Linear Regression
Linear regression is a way to understand the relationship between two things. For example, imagine you sell lemonade. You notice that on hotter days, you sell more cups. On cooler days, you sell fewer. You might start to wonder, "Can I predict how many cups I'll sell if I know the temperature?"
That's exactly what linear regression helps with. It takes your past data—like the temperature each day and how many cups you sold—and draws a straight line that best fits that data. This line shows the general trend: as temperature goes up, sales go up.
Once that line is drawn, you can use it to make predictions. If tomorrow is 30 degrees, you can look at your line and estimate how many cups you'll probably sell. It doesn't give a perfect answer, but it gives a good guess based on what's happened before.
This method isn't just for lemonade. It's used all the time in real life. Businesses use it to guess how much money they might make based on advertising. Teachers can use it to see how study time affects grades. Doctors might use it to check how weight changes with age.
So, in simple words: linear regression is a tool that finds a straight-line pattern in your data and helps you make predictions based on it.

📊 What is Linear Regression?
Linear regression is like drawing the best straight line through a bunch of dots on a graph to understand a relationship or make predictions.
🎯 Imagine This:
You run a lemonade stand. You notice that:
-
On hot days, you sell more lemonade.
-
On cooler days, you sell less.
You write down:
-
The temperature each day
-
How many cups of lemonade you sold
Now, you want to predict how many cups you'll sell if tomorrow is 30°C.
This is where linear regression helps!
It draws a straight line that best fits your data and helps you say:
"If it's 30°C, I might sell around 60 cups."
🧠 In Simple Words:
Linear regression helps answer:
"How does one thing affect another?"
Examples:
-
"Does studying more hours lead to higher marks?"
-
"Does advertising increase sales?"
-
"Does height affect weight?"
🛠️ How It Works:
-
It looks at past data (like temperature and sales)
-
Finds the pattern (like higher temp = more sales)
-
Draws a straight line through the points
-
Uses the line to predict future outcomes
🔹 Mathematical Formula
For simple linear regression (one feature):
Y=β0+β1X+ϵ
-
Y = predicted value
-
X= input feature
-
β0 = intercept
-
β1 = slope (coefficient)
-
ϵ = error term
🔹 Real-Life Example
Problem: Predict the salary based on years of experience.

So we can model it as:
Salary=25,000+5,000×Experience
Thus, if a person has 6 years of experience:
Salary=25,000+5,000×6=55,000
🔹 Python Example (Using scikit-learn)
from sklearn.linear_model import LinearRegression
import numpy as np
# Input data
X = np.array([[1], [2], [3], [4], [5]]) # Years of experience
y = np.array([30000, 35000, 40000, 45000, 50000]) # Salary
# Create and train model
model = LinearRegression()
model.fit(X, y)
# Predict salary for 6 years of experience
predicted_salary = model.predict([[6]])
print("Predicted Salary for 6 years of experience:", predicted_salary[0])

The graph above shows how linear regression fits a straight line through the training data points (experience vs salary). The red line represents the model's prediction—clearly illustrating the linear relationship between years of experience and salary.
Using multiple linear regression, we predicted the salary for:
-
6 years of experience
-
Master's degree (education level = 2)
📌 Predicted Salary: $55,600
This model takes both experience and education level into account, showing how multiple factors can influence predictions more accurately than a single variable.

This 3D plot illustrates how multiple linear regression fits a plane through the data points, showing how both years of experience and education level contribute to the salary prediction.
-
The blue dots represent actual training data.
-
The red surface represents the model's predictions.
You can see that with increasing experience and higher education level, the predicted salary also increases.
The general equation for multiple linear regression is:
Y=β0+β1X1+β2X2+⋯+βnXn+ε
Where:
-
Y: Dependent (response) variable
-
X1,X2,…,Xn Independent (predictor) variables
-
β0 : Intercept (value of Y when all Xi=0)
-
β1,β2,…,βn : Coefficients (representing the change in Y for a one-unit change in the corresponding Xi holding other variables constant)
-
ε\varepsilonε: Error term (captures randomness or unexplained variation)
Example:
If predicting house prices based on size (in sq ft) and number of bedrooms:
Price=β0+β1(Size)+β2(Bedrooms)+ε
LOSS
What Is Loss?
Loss is a numerical metric that measures how far off a model's predictions are from the actual target values. It focuses on the magnitude of the error, not its direction. For instance, if a model predicts 2 but the actual value is 5, the error is -3, but the loss considers the absolute difference, which is 3.
Common Methods to Calculate Loss
To ensure all errors contribute positively to the total loss, two common methods are used:
-
Mean Absolute Error (MAE): Calculates the average of the absolute differences between predicted and actual values.
-
Mean Squared Error (MSE): Calculates the average of the squared differences between predicted and actual values.
MSE is particularly popular in linear regression because it penalizes larger errors more than smaller ones, making it sensitive to outliers.
Visualizing Loss
The concept of loss can be visualized by plotting arrows from each data point to the model's prediction line. These arrows represent the errors, and their lengths correspond to the magnitude of the loss for each point.
Importance of Minimizing Loss
Minimizing the loss function is crucial for training an effective model. Techniques like gradient descent are employed to adjust the model's parameters (weights and bias) iteratively, aiming to find the values that result in the lowest possible loss.
L1 and L2 Loss Functions
In addition to MAE and MSE, two widely used loss functions in linear regression are:
-
L1 Loss (Least Absolute Deviations): Also known as MAE, it calculates the sum of the absolute differences between predicted and actual values. L1 loss is robust to outliers, as it does not heavily penalize large errors.
-
L2 Loss (Least Squares Error): Also known as MSE, it calculates the sum of the squared differences between predicted and actual values. L2 loss penalizes larger errors more than smaller ones, making it sensitive to outliers.
Choosing between L1 and L2 loss functions depends on the specific problem and the presence of outliers in the data.
In linear regression, there are four main types of loss, which are outlined in the following table.

Learning Rate
1. What is Learning Rate?
The learning rate (α or η) is a hyperparameter that controls how much to change the model in response to the error each time the model weights are updated.
-
It determines how big a step the optimizer takes during gradient descent.
-
A small learning rate = slow learning, safer but may take long to converge.
-
A large learning rate = faster learning but may overshoot and become unstable.
🧠 Learning Rate in Very Simple Terms
The learning rate is like the speed at which a machine learning model learns.
🎯 Think of it like this:
Imagine you're trying to find the lowest point in a valley (the best answer) by taking steps downhill.
Each time, you check which direction is downhill and take a step.
-
🐢 Small steps (small learning rate): Safe, but very slow.
-
🐇 Big steps (large learning rate): Fast, but you might miss the bottom or fall off the path.
📘 The Formula:
new weight=old weight−learning rate×gradient
✅ In simple words:
Learning rate controls how big a change we make to the model each time it tries to improve.
👶 Example:
Let's say the model is learning to predict the price of a house.
If it guesses too high, it tries to lower the guess.
-
If the learning rate is too small, it lowers the guess a little.
-
If the learning rate is too big, it might lower the guess too much and make a new mistake.
Example:
If you're minimizing a loss function, and the gradient is 0.6, then with a learning rate of 0.01:
Weight update = Weight - (0.01 * 0.6) = Weight - 0.006
2. How Many Times Are Weight and Bias Updated?
The number of times weights and biases are updated depends on:
✅ Number of epochs
An epoch is one full pass through the training dataset. If you train for 10 epochs, the weights are updated 10 times per training sample (or batch, if using mini-batch training).
✅ Batch size
-
If using stochastic gradient descent (SGD): update happens after every sample.
-
If using mini-batch gradient descent: update happens after every batch.
-
If using batch gradient descent: update happens once per epoch.
Formula for Number of Updates:
If:
-
E = number of epochs
-
N = number of training samples
-
B = batch size
Then:
-
Number of updates = (N / B) × E
Example:
Say you have:
-
10,000 samples
-
Batch size = 100
-
Epochs = 5
Then:
-
Number of updates = (10,000 / 100) × 5 = 100 × 5 = 500 updates
-
So the weights and biases are updated 500 times.
Let's go through examples to calculate and understand learning rate in action, especially how it affects weight updates using gradient descent.
🔁 Basic Formula: Gradient Descent Update
For a given weight www, the update rule is:
wnew=wold−η⋅∂w∂L
Where:
-
η = learning rate
-
∂L/∂w = gradient of the loss function w.r.t weight
✅ Example 1: Simple Gradient Descent Update
Given:
-
Initial weight: w=0.5
-
Gradient: ∂L/∂w=0.8
-
Learning rate η=0.1
Calculation:
wnew=0.5−0.1×0.8=0.5−0.08=0.42
So, the new weight is 0.42.
✅ Example 2: Try Different Learning Rates
Scenario:
-
Initial weight: w=1.0
-
Gradient: ∂L/∂w=0.5

⚠️ Observation:
-
Too small: slow learning.
-
Too large: can overshoot or diverge.
✅ Example 3: With Bias Update
Let's include bias b.
Given:
-
w=0.6 , b=0.2
-
Gradients: ∂L/∂w=0.4 , ∂L/∂b=0.3
-
Learning rate η=0.05\eta = 0.05η=0.05
Update:
wnew=0.6−0.05×0.4=0.6−0.02=0.58
bnew=0.2−0.05×0.3=0.2−0.015=0.185
✅ New values: w=0.58w = 0.58w=0.58, b=0.185b = 0.185b=0.185
Multiple choice questions (MCQs)
✅ 1. What is the purpose of the learning rate in a machine learning model?
A) To determine the number of layers in the model
B) To set the maximum number of epochs
C) To control how much the weights are adjusted during training
D) To control the size of the dataset
Answer: ✅ C
Explanation: The learning rate controls how much we adjust the weights of the model based on the error calculated.
✅ 2. What can happen if the learning rate is too high?
A) The model will train very slowly
B) The model may not learn at all
C) The model may overshoot the optimal point and diverge
D) It will always find the best solution quickly
Answer: ✅ C
Explanation: A high learning rate can cause the model to skip over the best values, leading to poor or unstable learning.
✅ 3. If the learning rate is too low, what is most likely to happen?
A) The model will converge very quickly
B) The model will never converge
C) The model will converge very slowly
D) The model will skip the global minimum
Answer: ✅ C
Explanation: A very small learning rate makes the learning slow and can get stuck before reaching the best solution.
✅ 4. Which of the following is a typical range for a learning rate?
A) 10–100
B) 0.01–0.1
C) 100–1000
D) -1 to 0
Answer: ✅ B
Explanation: Learning rates are usually small values like 0.01 or 0.001 depending on the problem and optimizer.
✅ 5. In the weight update formula w=w−η⋅∂L/∂w, what does η\etaη represent?
A) Bias
B) Epoch
C) Loss
D) Learning Rate
Answer: ✅ D
Explanation: The symbol η\etaη (eta) commonly denotes the learning rate in the formula.
✅ 6. Which method adjusts the learning rate during training?
A) Stochastic Gradient Descent
B) Adaptive Learning Rate
C) Batch Normalization
D) Dropout
Answer: ✅ B
Explanation: Adaptive methods like Adam, RMSProp, and Adagrad adjust the learning rate during training.
✅ 7. Which optimizer uses an adaptive learning rate for each parameter?
A) SGD
B) Adam
C) Linear Regression
D) KNN
Answer: ✅ B
Explanation: Adam (Adaptive Moment Estimation) adjusts the learning rate based on estimates of first and second moments of the gradients.
✅ 8. What is a common sign that the learning rate is too high?
A) The training loss decreases steadily
B) The training loss is flat
C) The training loss fluctuates wildly or increases
D) The model performs well on the test set
Answer: ✅ C
Explanation: High learning rates can cause the loss to increase or bounce around, indicating instability.
MCQs on Hyperparameter
1. Which of the following is a hyperparameter in linear regression with gradient descent?
A. Weights
B. Bias
C. Learning rate
D. Features
Answer: C. Learning rate
Explanation:
Hyperparameters are set before training begins and are not learned from the data. The learning rate is a typical hyperparameter in gradient descent. Weights and bias are learned parameters, and features are part of the dataset, not hyperparameters.
2. What is the effect of setting the learning rate too high in gradient descent?
A. The model will converge more slowly
B. The model may overshoot the minimum and not converge
C. The model will perfectly fit the training data
D. The model will stop updating weights
Answer: B. The model may overshoot the minimum and not converge
Explanation:
A high learning rate can cause the updates to "overshoot" the optimal values, potentially causing the loss to increase or fluctuate, preventing convergence.
3. What does it mean if a model's loss decreases very slowly during training?
A. The learning rate may be too high
B. The model has converged
C. The learning rate may be too low
D. The model has overfitted
Answer: C. The learning rate may be too low
Explanation:
A low learning rate results in very small updates to weights, causing the model to take a long time to converge.
4. Which of the following is not typically considered a hyperparameter in linear regression using gradient descent?
A. Batch size
B. Number of features
C. Learning rate
D. Number of epochs
Answer: B. Number of features
Explanation:
The number of features is determined by the dataset, not by the training process. In contrast, batch size, learning rate, and number of epochs are user-specified hyperparameters.
5. What is a good approach to finding the right hyperparameters for your model?
A. Set random values and hope for the best
B. Train the model without tuning
C. Try a range of values and compare performance
D. Use the same values for all models
Answer: C. Try a range of values and compare performance
Explanation:
Hyperparameter tuning involves trying different values (e.g., grid search or random search) and evaluating which set yields the best performance.
6. If your model's loss is fluctuating wildly during training, what could be the issue?
A. Too many training examples
B. Too few epochs
C. Learning rate is too low
D. Learning rate is too high
Answer: D. Learning rate is too high
Explanation:
A high learning rate can make training unstable, causing the loss to fluctuate instead of decreasing smoothly.
7. Which of the following best describes the role of a hyperparameter?
A. A parameter learned from the dataset
B. A setting that controls the training process
C. A component of the cost function
D. A value used to initialize the dataset
Answer: B. A setting that controls the training process
Explanation:
Hyperparameters are user-defined settings like learning rate, batch size, and epochs that influence how the model learns.
8. What is the most likely outcome of training a model for too many epochs?
A. The model becomes more generalized
B. The model underfits the training data
C. The model overfits the training data
D. The model stops learning
Answer: C. The model overfits the training data
Explanation:
Training for too many epochs can cause the model to memorize the training data, reducing its ability to generalize to new data (overfitting).
9. Which of the following best defines "epoch" in the context of model training?
A. A complete forward and backward pass for one training sample
B. A set of hyperparameter tuning trials
C. One complete pass through the entire training dataset
D. The number of layers in the neural network
Answer: C. One complete pass through the entire training dataset
Explanation:
An epoch is defined as one full pass over the entire training dataset during model training.
10. How does batch size affect training in gradient descent?
A. It changes the number of features in the dataset
B. It determines how many weights are updated
C. It controls how many examples are used to calculate the gradient in each step
D. It does not affect the training process
Answer: C. It controls how many examples are used to calculate the gradient in each step
Explanation:
Batch size defines how many training examples are used in one forward/backward pass. Smaller batches update more frequently, while larger ones give a smoother estimate of the gradient.
11. If your model's training loss is decreasing but validation loss is increasing, what's happening?
A. Underfitting
B. Normal convergence
C. Overfitting
D. The learning rate is too low
Answer: C. Overfitting
Explanation:
When the model performs well on training data but poorly on unseen data (validation), it indicates overfitting.
12. What is the trade-off when choosing a very small batch size?
A. Slower convergence and lower generalization
B. Faster convergence but noisier updates
C. Always better performance
D. Less RAM usage but no impact on training
Answer: B. Faster convergence but noisier updates
Explanation:
Smaller batch sizes can update weights more frequently, which speeds up convergence but makes the gradient estimates noisier.
13. What happens when you increase the batch size significantly?
A. You always get better accuracy
B. Training becomes unstable
C. Gradient estimates become more accurate but require more memory
D. Learning rate must be decreased
Answer: C. Gradient estimates become more accurate but require more memory
Explanation:
Larger batches provide a more accurate gradient approximation but need more memory and computation per step.
14. Which combination is most likely to cause overfitting?
A. Low learning rate, few epochs
B. High learning rate, small batch size
C. High learning rate, many epochs
D. Low learning rate, many epochs
Answer: D. Low learning rate, many epochs
Explanation:
A low learning rate with many epochs allows the model to train slowly and potentially memorize the training data, leading to overfitting.
15. What does tuning the learning rate primarily help with?
A. Reducing model size
B. Improving validation data quality
C. Speeding up or stabilizing convergence during training
D. Increasing the number of training examples
Answer: C. Speeding up or stabilizing convergence during training
Explanation:
A properly tuned learning rate helps the model converge efficiently to a minimum of the loss function.
16. Suppose you are using a learning rate of 0.01. After 100 epochs, the loss is still high. What should you try first?
A. Increase the learning rate
B. Decrease the learning rate
C. Increase the number of features
D. Decrease the batch size
Answer: A. Increase the learning rate
Explanation:
A high loss after many epochs with a small learning rate suggests slow
learning. Increasing the learning rate can help speed up convergence.
17. You are training a linear regression model on 10,000 samples using mini-batch gradient descent with a batch size of 100. How many batches will be there in one epoch?
A. 10
B. 100
C. 1,000
D. 10,000
Answer: B. 100
Explanation:
Number of batches per epoch=10,000/100=100
18. If the initial weight is 0 and learning rate is 0.1, and gradient = -4, what will be the updated weight after one step of gradient descent?
A. 0.4
B. -0.4
C. 0.1
D. -0.1
Answer: A. 0.4
Explanation:
wnew=wold−η⋅gradient=0−(0.1⋅−4)=0+0.4=0.4
19. What would be the total number of weight updates in 5 epochs, using a dataset with 1,000 samples and a batch size of 50?
A. 50
B. 100
C. 5,000
D. 100
Answer: D. 100
Explanation:
Batches per epoch=1000/50=20⇒Total updates in 5 epochs=20×5=100
20. Which of the following is true about the effect of increasing the number of epochs?
A. It always reduces the training loss
B. It always improves validation performance
C. It may lead to overfitting
D. It resets the weights each time
Answer: C. It may lead to overfitting
Explanation:
Too many epochs can cause the model to memorize training data, decreasing
generalization.
21. You're training with a batch size of 200, dataset size of 10,000, and 10 epochs. How many total gradient steps will occur?
A. 50
B. 100
C. 500
D. 1,000
Answer: C. 500
Explanation:
Steps per epoch=10,000/200=50⇒50×10=500 steps
22. During training, your loss decreases for several epochs but then plateaus. What's a common solution?
A. Reduce training data
B. Increase learning rate slightly
C. Decrease learning rate slightly
D. Add more features
Answer: C. Decrease learning rate slightly
Explanation:
If the model is stuck near a minimum, reducing the learning rate may help
fine-tune and escape the plateau.
23. What happens if the learning rate is set to zero?
A. The model will train very slowly
B. The model will converge faster
C. The weights won't update at all
D. The model will overfit
Answer: C. The weights won't update at all
Explanation:
With a learning rate of 0, gradient descent updates are zero, so weights stay
constant.
24. If a model is training on 200,000 samples and the batch size is 2,000, how many batches per epoch will there be?
A. 100
B. 1,000
C. 200
D. 2
Answer: A. 100
Explanation:
Batches per epoch=200,000/2,000=100
25. If the learning rate is too small (e.g., 0.00001), what might you observe in the training curve?
A. Sharp drops in loss
B. Sudden divergence in loss
C. Flat or very slow decline in loss
D. Overfitting after first epoch
Answer: C. Flat or very slow decline in loss
Explanation:
Tiny learning rates cause small parameter updates, resulting in slow learning
and little progress per epoch.
26. You reduce the batch size from 256 to 32. Which of the following is the most likely outcome?
A. Fewer weight updates per epoch
B. Slower convergence due to fewer updates
C. More frequent updates with higher variance
D. More memory usage per step
Answer: C. More frequent updates with higher
variance
Explanation:
Smaller batch sizes lead to more frequent gradient updates per epoch, but each
update is noisier due to less representative data.
27. A dataset has 60,000 examples. If you use mini-batch gradient descent with batch size = 600 and train for 10 epochs, how many total updates will occur?
A. 600
B. 1,000
C. 100
D. 1,000
Answer: D. 1,000
Explanation:
Batches per epoch=60,000/600=100⇒Total updates in 10 epochs=100×10=1,000
28. What would happen if the learning rate is set too high (e.g., 10.0)?
A. Model will converge faster
B. Model may skip over the minimum and diverge
C. Model will underfit the data
D. Gradient updates will be smaller
Answer: B. Model may skip over the minimum and
diverge
Explanation:
An excessively high learning rate causes unstable updates, often skipping past
the minimum or even increasing the loss.
29. Which of the following is the best strategy to choose hyperparameters like learning rate and batch size?
A. Use trial and error without evaluation
B. Fix them for all datasets
C. Tune them using validation performance
D. Choose values from previous models
Answer: C. Tune them using validation performance
Explanation:
Hyperparameters should be optimized by evaluating performance on a separate
validation set.
30. A model with a learning rate of 0.5 diverges during training. What would be a safer value to try next?
A. 1.0
B. 0.05
C. 0.9
D. 2.0
Answer: B. 0.05
Explanation:
If 0.5 causes divergence, try a smaller value like 0.05 to stabilize updates
and promote convergence.
31. Which hyperparameter controls how many times the model sees the entire dataset?
A. Learning rate
B. Batch size
C. Epochs
D. Loss function
Answer: C. Epochs
Explanation:
The number of epochs defines how many complete passes over the training dataset
the model will make.
32. You are training a model with a dataset of 50,000 examples using stochastic gradient descent (batch size = 1). How many updates per epoch will occur?
A. 1
B. 50
C. 500
D. 50,000
Answer: D. 50,000
Explanation:
Stochastic gradient descent updates weights after each example. So, with 50,000
samples, it performs 50,000 updates per epoch.
33. You try a batch size of 10 and observe high variance in loss between steps. What is a likely solution?
A. Increase learning rate
B. Increase batch size
C. Reduce number of epochs
D. Add more layers
Answer: B. Increase batch size
Explanation:
A small batch size gives noisy gradients. A larger batch size helps reduce this
variance and stabilize training.
34. For a fixed number of training examples, increasing the number of epochs without early stopping will likely cause:
A. Underfitting
B. Data leakage
C. Overfitting
D. More training data to be required
Answer: C. Overfitting
Explanation:
Too many epochs may lead the model to memorize the training data, reducing its
generalization ability.
35. Which of the following is not typically tuned as a hyperparameter in gradient descent-based training?
A. Number of epochs
B. Activation function
C. Batch size
D. Learning rate
Answer: B. Activation function
Explanation:
Activation function is part of model architecture design, not typically treated
as a tunable training hyperparameter in simple linear regression.
36. If your model's training loss is decreasing but validation loss starts increasing after a few epochs, what should you do?
A. Increase the number of epochs
B. Increase the learning rate
C. Use early stopping or reduce epochs
D. Reduce the batch size
Answer: C. Use early stopping or reduce epochs
Explanation:
This pattern indicates overfitting. Early stopping prevents the model from
continuing to learn noise from training data.
37. Suppose the initial weight is 2.0, the learning rate is 0.1, and the computed gradient is 6. What will be the weight after one update?
A. 1.4
B. 2.6
C. 1.0
D. 0.6
Answer: A. 1.4
Explanation:
wnew=wold−η⋅gradient=2.0−0.1⋅6=2.0−0.6=1.4
38. Which of the following is a common consequence of using a very large batch size (e.g., 10,000)?
A. More noise in gradient estimates
B. Faster updates per epoch
C. Smoother convergence but slower generalization
D. Higher chance of overfitting due to noise
Answer: C. Smoother convergence but slower
generalization
Explanation:
Large batches reduce gradient noise, making convergence smooth, but can cause
the model to converge to sharp minima with poor generalization.
39. You increase your batch size from 64 to 512. What immediate effect does this have on each epoch?
A. More weight updates per epoch
B. Fewer weight updates per epoch
C. More memory efficiency
D. Reduced model complexity
Answer: B. Fewer weight updates per epoch
Explanation:
Larger batches mean fewer batches (updates) per epoch, since the number of
examples is fixed.
40. If your dataset contains 12,000 samples and you choose a batch size of 300 with 8 epochs, how many total weight updates will occur?
A. 32
B. 320
C. 80
D. 96
Answer: B. 320
Explanation:
Batches per epoch=12,000300=40⇒Total updates=40×8=320