Essential Neural Network Functions
Important Functions Used in Neural Networks – A Beginner's Guide

Neural networks rely on a handful of core mathematical functions that determine how they learn from data and make predictions. As a beginner, understanding these functions will help you see what is happening inside each layer and neuron. In this guide, we introduce the most important ones in clear, intuitive language, with simple examples of where they are used in practice.
1. Activation functions
Activation functions decide how much signal a neuron passes forward. Without them, a neural network would behave like a simple linear model and could not learn complex patterns.
- Sigmoid: Squashes values into the range (0, 1). It is often used in the output layer for binary classification, where the result is interpreted as a probability.
- Tanh: Similar to sigmoid but outputs values between -1 and 1. It is zero-centered, which can help optimization compared with sigmoid in some cases.
- ReLU (Rectified Linear Unit): Outputs 0 for negative inputs and the input itself for positive values. ReLU is the most common activation in hidden layers because it is simple and helps deep networks train faster.
- Leaky ReLU and variants: Modify ReLU so that negative inputs are not completely zero, which can reduce the risk of "dead" neurons that never activate.
- Softmax: Converts a vector of raw scores into probabilities that sum to 1. It is typically used in the output layer for multi-class classification problems.
2. Loss (cost) functions
Loss functions measure how far the network's predictions are from the true targets. Training means adjusting weights to minimize this loss.
- Mean Squared Error (MSE): Common in regression tasks, it averages the squared difference between predicted and actual values. Large errors are penalized more strongly.
- Mean Absolute Error (MAE): Uses the absolute difference instead of the square. It is more robust to outliers but can be harder to optimize smoothly.
- Binary Cross-Entropy: Used for binary classification. It compares predicted probabilities with actual labels (0 or 1) and heavily penalizes confident but wrong predictions.
- Categorical Cross-Entropy: Generalizes binary cross-entropy to multiple classes. It is the standard loss for multi-class classification with softmax outputs.
3. Optimization-related functions
To minimize the loss, neural networks use optimization algorithms that rely on gradients.
- Gradient: The gradient is a vector of partial derivatives of the loss with respect to each weight. It tells us the direction in which the loss increases fastest; moving in the opposite direction reduces the loss.
- Gradient Descent: An iterative method that updates weights by subtracting a fraction of the gradient. Variants like stochastic and mini-batch gradient descent use subsets of data to speed up training.
- Learning Rate: A scalar that controls the step size in gradient descent. Too large and training may diverge; too small and training becomes very slow.
- Advanced Optimizers (SGD with momentum, Adam, RMSProp): These methods adapt the update step using past gradients or per-parameter statistics, often leading to faster and more stable training.
4. Regularization functions
Regularization functions help prevent overfitting, where a model memorizes training data instead of learning general patterns.
- L1 and L2 penalties: Add extra terms to the loss that penalize large weights. L1 can drive some weights to exactly zero (feature selection), while L2 encourages smaller, smoother weights.
- Dropout: Randomly "drops" a fraction of neurons during training. This is not a single formula but a simple rule that acts like an ensemble of many smaller networks, improving generalization.
5. Output and utility functions
Finally, some functions are used to interpret or evaluate the network's outputs.
- Argmax: Selects the index of the largest output value, often used to choose the predicted class after a softmax layer.
- Accuracy, Precision, Recall, F1-score: These are evaluation metrics rather than training losses, but they are crucial for understanding how well your network performs on classification tasks.
As you build your first neural networks, focus on choosing an appropriate activation function for each layer, a suitable loss function for your task, and a reliable optimizer. With these core functions in place, you will have a solid foundation for exploring more advanced architectures and techniques.
Neural Networks are the foundation of modern Artificial Intelligence (AI) and Deep Learning. They power applications such as image recognition, speech recognition, fraud detection, recommendation systems, autonomous vehicles, and Generative AI. A neural network learns by processing data through multiple interconnected layers. During this process, several mathematical functions help the network make predictions, measure errors, and improve its performance over time. These functions can be broadly classified into four categories:
- Activation Functions – Decide whether a neuron should activate.
- Loss Functions – Measure prediction errors.
- Optimization Functions – Update weights to minimize errors.
- Evaluation Functions – Measure the final performance of the model.
1. Activation Functions
Activation functions introduce non-linearity into a neural network, allowing it to learn complex relationships that simple linear models cannot.1.1 Linear Activation Function
Formulaf(x) = xThe output is exactly equal to the input. Example Suppose the weighted sum of inputs is:
z = 8Output:
f(z) = 8Advantages
- Simple and computationally efficient
- Suitable for regression output layers
- Cannot learn complex nonlinear relationships
- Multiple linear layers behave like a single linear layer
1.2 Step Function
Formulaf(x)= 1, if x ≥ 0 0, otherwiseExample
| Input | Output |
|---|---|
| -3 | 0 |
| 5 | 1 |
- Early Perceptron models
- Simple binary decisions
1.3 Sigmoid Function
Formulaf(x)=1/(1+e^-x)Output Range 0 to 1 Example Input:
x = 2Output:
0.88This means the neuron activates with approximately 88% confidence. Advantages
- Produces probability values
- Smooth and differentiable
- Vanishing Gradient Problem
- Slow learning in deep networks
- Binary Classification Output Layer
- Spam Detection
- Medical Diagnosis
1.4 Tanh Function
Formulatanh(x)Output Range -1 to +1 Example Input:
2Output:
0.964Advantages
- Zero-centered output
- Performs better than Sigmoid in many hidden layers
- Still suffers from Vanishing Gradient
1.5 ReLU (Rectified Linear Unit)
Formulaf(x)=max(0,x)Example
| Input | Output |
|---|---|
| -5 | 0 |
| 8 | 8 |
- Very fast computation
- Helps solve Vanishing Gradient for positive inputs
- Most widely used activation function
- Dying ReLU Problem (neurons may stop learning)
1.6 Leaky ReLU
Formulaf(x)= x, if x > 0 0.01x, otherwiseExample
| Input | Output |
|---|---|
| -4 | -0.04 |
| 6 | 6 |
- Allows small negative values
- Prevents dead neurons
- Improves learning
1.7 ELU (Exponential Linear Unit)
FormulaELU(x)= x, if x > 0 α(e^x−1), otherwiseAdvantages
- Smooth negative outputs
- Faster convergence
- Often performs better than ReLU
1.8 Softmax Function
Softmax converts raw output values into probabilities for multi-class classification. Example| Animal | Score | Probability |
|---|---|---|
| Cat | 2 | 4.6% |
| Dog | 5 | 93.6% |
| Horse | 1 | 1.8% |
- Image Classification
- Handwritten Digit Recognition
- Object Detection
- Natural Language Processing
2. Loss Functions
Loss functions quantify how far the model's predictions are from the actual values.2.1 Mean Squared Error (MSE)
Used for Regression Problems. FormulaMSE = Average of (Actual - Predicted)²Example
| Actual | Predicted | Squared Error |
|---|---|---|
| 100 | 90 | 100 |
| 200 | 210 | 100 |
- House Price Prediction
- Sales Forecasting
- Demand Prediction
2.2 Mean Absolute Error (MAE)
FormulaMAE = Average |Actual - Predicted|Advantages:
- Easy to interpret
- Less sensitive to outliers
2.3 Binary Cross-Entropy Loss
Used for Binary Classification. Applications- Spam Detection
- Fraud Detection
- Disease Prediction
2.4 Categorical Cross-Entropy Loss
Used with the Softmax activation function for Multi-Class Classification. Examples include:- Image Classification
- Speech Recognition
- Language Translation
3. Optimization Functions
Optimizers determine how neural network weights are updated to reduce the loss.3.1 Gradient Descent
Gradient Descent updates the weights in the direction that minimizes the loss. FormulaNew Weight = Old Weight − Learning Rate × Gradient
3.2 Stochastic Gradient Descent (SGD)
Instead of using the entire dataset, SGD updates weights after each training example. Advantages- Faster updates
- Suitable for large datasets
3.3 Mini-Batch Gradient Descent
Uses small batches such as:- 32 samples
- 64 samples
- 128 samples
3.4 Adam Optimizer
Adam combines:- Momentum
- Adaptive Learning Rate
- Fast convergence
- Stable learning
- Excellent default optimizer
4. Evaluation Functions
Evaluation metrics help determine how well the trained model performs.| Metric | Purpose |
|---|---|
| Accuracy | Overall classification performance |
| Precision | Measures correctness of positive predictions |
| Recall | Measures ability to identify all positive cases |
| F1 Score | Balances Precision and Recall |
| ROC-AUC | Evaluates binary classification models |
| Confusion Matrix | Displays prediction breakdown |
| RMSE | Regression accuracy |
| R² Score | Explains variance in regression models |
Summary Table
| Category | Functions | Primary Purpose |
|---|---|---|
| Activation Functions | Linear, Step, Sigmoid, Tanh, ReLU, Leaky ReLU, ELU, Softmax | Generate neuron outputs and introduce non-linearity |
| Loss Functions | MSE, MAE, Binary Cross-Entropy, Categorical Cross-Entropy | Measure prediction errors |
| Optimization Functions | Gradient Descent, SGD, Mini-Batch GD, Adam | Update weights to minimize loss |
| Evaluation Metrics | Accuracy, Precision, Recall, F1 Score, RMSE, ROC-AUC, R² | Evaluate model performance |
