Linear Regression

Linear Regression
Linear regression is a fundamental technique in statistics and machine learning used to model relationships between variables.
What it does
It fits a straight line through data points to predict values.
- y = dependent variable
- x = independent variable
- m = slope
- b = intercept
Intuition
The goal is to find the best-fit line that minimizes prediction error.
Types
- Simple Linear Regression
- Multiple Linear Regression
How it works
Uses least squares method to minimize squared errors.
Applications
- House price prediction
- Sales forecasting
- Risk analysis
- Trend prediction
Limitations
- Assumes linear relationship
- Sensitive to outliers
- Not suitable for nonlinear data
Linear Regression Interactive Lab
Concept
Linear regression models the relationship between variables using a straight line.
Prediction Tool
Graph Visualization
Linear Regression (Working Demo)
| X | Y | Predicted | Error |
|---|---|---|---|
| 1 | 30 | ||
| 2 | 37 | ||
| 3 | 39 | ||
| 4 | 46 | ||
| 5 | 52 |
Linear regression is a way to understand the relationship between two things. For example, imagine you sell lemonade. You notice that on hotter days, you sell more cups. On cooler days, you sell fewer. You might start to wonder, "Can I predict how many cups I'll sell if I know the temperature?"
That's exactly what linear regression helps with. It takes your past data—like the temperature each day and how many cups you sold—and draws a straight line that best fits that data. This line shows the general trend: as temperature goes up, sales go up.
Once that line is drawn, you can use it to make predictions. If tomorrow is 30 degrees, you can look at your line and estimate how many cups you'll probably sell. It doesn't give a perfect answer, but it gives a good guess based on what's happened before.
This method isn't just for lemonade. It's used all the time in real life. Businesses use it to guess how much money they might make based on advertising. Teachers can use it to see how study time affects grades. Doctors might use it to check how weight changes with age.
So, in simple words: linear regression is a tool that finds a straight-line pattern in your data and helps you make predictions based on it.

Mathematical Formula
For simple linear regression (one feature):
Y=β0+β1X+ϵ
-
Y = predicted value
-
X= input feature
-
β0 = intercept
-
β1 = slope (coefficient)
-
ϵ = error term
Real-Life Example
Problem: Predict the salary based on years of experience.

So we can model it as:
Salary=25,000+5,000×Experience
Thus, if a person has 6 years of experience:
Salary=25,000+5,000×6=55,000

The graph above shows how linear regression fits a straight line through the training data points (experience vs salary). The red line represents the model's prediction—clearly illustrating the linear relationship between years of experience and salary.
Using multiple linear regression, we predicted the salary for:
-
6 years of experience
-
Master's degree (education level = 2)
📌 Predicted Salary: $55,600
This model takes both experience and education level into account, showing how multiple factors can influence predictions more accurately than a single variable.

This 3D plot illustrates how multiple linear regression fits a plane through the data points, showing how both years of experience and education level contribute to the salary prediction.
-
The blue dots represent actual training data.
-
The red surface represents the model's predictions.
You can see that with increasing experience and higher education level, the predicted salary also increases.
The general equation for multiple linear regression is:
Y=β0+β1X1+β2X2+⋯+βnXn+ε
Where:
-
Y: Dependent (response) variable
-
X1,X2,…,Xn Independent (predictor) variables
-
β0 : Intercept (value of Y when all Xi=0)
-
β1,β2,…,βn : Coefficients (representing the change in Y for a one-unit change in the corresponding Xi holding other variables constant)
-
ε\varepsilonε: Error term (captures randomness or unexplained variation)
Example:
If predicting house prices based on size (in sq ft) and number of bedrooms:
Price=β0+β1(Size)+β2(Bedrooms)+ε
LOSS
Loss Functions in Machine Learning
Loss functions measure how far the predicted values are from the actual values.
Types of Loss Functions
| Loss Type | Definition | Formula |
|---|---|---|
| L₁ Loss | Sum of absolute differences between actual and predicted values | Σ |Actual − Predicted| |
| Mean Absolute Error (MAE) | Average of absolute errors across all observations | (1/N) Σ |Actual − Predicted| |
| L₂ Loss | Sum of squared differences between actual and predicted values | Σ (Actual − Predicted)² |
| Mean Squared Error (MSE) | Average of squared errors across all observations | (1/N) Σ (Actual − Predicted)² |
Example
| Actual | Predicted | Error | |Error| | Error² |
|---|---|---|---|---|
| 10 | 8 | 2 | 2 | 4 |
| 15 | 18 | -3 | 3 | 9 |
| 20 | 17 | 3 | 3 | 9 |
Key Insights
- MAE treats all errors equally.
- MSE penalizes large errors more heavily.
- L₂ loss is more sensitive to outliers than L₁ loss.
When to Use
- Use MAE when you want robustness against outliers.
- Use MSE when large errors should be penalized more.
Hyperparameters
Hyperparameters in Machine Learning
Hyperparameters are settings that are defined before training a machine learning model. They control how the model learns.
Hyperparameters → Set by user
Parameters → Learned by model
Examples of Hyperparameters
| Model | Hyperparameter | Description |
|---|---|---|
| Linear Regression | Learning Rate | Controls how fast the model learns |
| Decision Tree | Max Depth | Maximum depth of the tree |
| KNN | K (Neighbors) | Number of nearest neighbors |
| Neural Network | Epochs | Number of times data is processed |
| Neural Network | Batch Size | Number of samples per update |
Why Hyperparameters Matter
- Control model performance
- Prevent overfitting and underfitting
- Improve accuracy
Hyperparameter Tuning
Hyperparameter tuning is the process of finding the best combination of hyperparameters.
- Grid Search
- Random Search
- Bayesian Optimization
Key Insight
Learning Rate
Learning rate is a hyperparameter that controls how much a model updates its weights during training.
Intuition
- Small learning rate → slow learning
- Large learning rate → unstable learning
- Optimal learning rate → fast and stable convergence
Effect of Learning Rate
| Learning Rate | Behavior |
|---|---|
| Too Small | Very slow training |
| Too Large | Overshooting, may not converge |
| Optimal | Efficient and stable learning |
Why It Matters
- Controls speed of training
- Affects model accuracy
- Prevents divergence
How Many Times Are Weight and Bias Updated?
The number of times weights and biases are updated depends on:
An epoch is one full pass through the training dataset. If you train for 10 epochs, the model sees the data 10 times.
- Stochastic Gradient Descent (SGD): Update happens after every sample.
- Mini-batch Gradient Descent: Update happens after every batch.
- Batch Gradient Descent: Update happens once per epoch.
Summary Table
| Method | Batch Size | Updates per Epoch |
|---|---|---|
| SGD | 1 | 100 (if dataset has 100 samples) |
| Mini-batch | 10 | 10 |
| Batch Gradient Descent | 100 | 1 |
Formula
Example
Dataset = 100 samples
Epochs = 5
Batch size = 10
Total updates = 5 × 10 = 50 updates
Key Insight
- More epochs → more learning cycles
- Smaller batch size → more frequent updates
- Larger batch size → fewer but stable updates
Gradient Descent (Step-by-Step)
Gradient Descent is an optimization algorithm used to minimize loss.
Step-by-Step Example
| Step | Weight | Gradient | Updated Weight |
|---|---|---|---|
| 1 | 10 | 2 | 10 - (0.1×2) = 9.8 |
| 2 | 9.8 | 1.5 | 9.8 - (0.1×1.5) = 9.65 |
| 3 | 9.65 | 1 | 9.65 - (0.1×1) = 9.55 |
Key Idea
- Move in direction of decreasing error
- Repeat until minimum is reached
Gradient Descent with Proper Upward Parabola
Adjust Learning Rate:
Types of Gradient Descent
Gradient Descent is used to update weights and minimize loss. There are three main types based on how data is processed.
- Uses one data point at a time
- Updates weights after every sample
- Very fast but noisy
- Uses small batches of data
- Updates weights after each batch
- Balanced approach
- Uses entire dataset
- Updates weights once per epoch
- Very stable but slow
Comparison Table
| Type | Data Used | Update Frequency | Speed | Stability |
|---|---|---|---|---|
| SGD | 1 sample | Very frequent | Fast | Low |
| Mini-batch | Small batch | Moderate | Balanced | Medium |
| Batch GD | Full dataset | Once per epoch | Slow | High |
Key Insight
Gradient Descent Convergence (Multi-Iteration)
Learning Rate = 0.1
| Iteration | w | b | Predictions (x=1,2,3) | Loss (MSE) |
|---|---|---|---|---|
| 0 | 0.00 | 0.00 | 0, 0, 0 | 18.67 |
| 1 | 0.93 | 0.40 | 1.33, 2.27, 3.20 | 3.76 |
| 2 | 1.44 | 0.64 | 2.08, 3.52, 4.96 | 0.70 |
| 3 | 1.72 | 0.78 | 2.50, 4.22, 5.94 | 0.13 |
| 4 | 1.86 | 0.86 | 2.72, 4.57, 6.43 | 0.02 |
| 5 | 1.93 | 0.91 | 2.84, 4.76, 6.68 | ~0.00 |
Key Insight
- Loss decreases with each iteration
- w approaches 2 and b approaches 0
- Model converges to optimal solution
📘 The Formula:
new weight=old weight−learning rate × gradient
gradient with respect to weight =∂w/∂L
gradient with respect to bias = ∂b/∂L
Example:
If you're minimizing a loss function, and the gradient is 0.6, then with a learning rate of 0.01:
Weight update = Weight - (0.01 * 0.6) = Weight - 0.006
Example:
Say you have:
-
10,000 samples
-
Batch size = 100
-
Epochs = 5
Then:
-
Number of updates = (10,000 / 100) × 5 = 100 × 5 = 500 updates
-
So the weights and biases are updated 500 times.
Let's go through examples to calculate and understand learning rate in action, especially how it affects weight updates using gradient descent.
🔁 Basic Formula: Gradient Descent Update
For a given weight www, the update rule is:
wnew=wold−η⋅∂L/∂w
Where:
-
η = learning rate
-
∂L/∂w = gradient of the loss function w.r.t weight
✅ Example 1: Simple Gradient Descent Update
Given:
-
Initial weight: w=0.5
-
Gradient: ∂L/∂w=0.8
-
Learning rate η=0.1
Calculation:
wnew=0.5−0.1×0.8=0.5−0.08=0.42
So, the new weight is 0.42.
⚠️ Observation:
-
Too small: slow learning.
-
Too large: can overshoot or diverge.
✅ Example 3: With Bias Update
Let's include bias b.
Given:
-
w=0.6 , b=0.2
-
Gradients: ∂L/∂w=0.4 , ∂L/∂b=0.3
-
Learning rate η=0.05
Update:
wnew=0.6−0.05×0.4=0.6−0.02=0.58
bnew=0.2−0.05×0.3=0.2−0.015=0.185
✅ New values: w=0.58w = 0.58w=0.58, b=0.185b = 0.185b=0.185
Adam Optimizer
Adam (Adaptive Moment Estimation) is an advanced optimization algorithm that combines momentum and adaptive learning rates.
v = variance (squared gradients)
Update Equations
v = β₂v + (1−β₂)g²
w = w − η × (m / √v)
Why Adam Works
- Uses past gradients → faster learning
- Adjusts learning rate → stable updates
- Reduces oscillations
Comparison
| Optimizer | Speed | Stability |
|---|---|---|
| SGD | Fast | Low |
| Batch GD | Slow | High |
| Adam | Fast | High |
Key Insight
📊 Real-Life Examples of Linear Regression
Linear Regression helps us understand relationships between variables and predict outcomes. Here are practical real-world applications across industries.
Input: Area, Rooms
Output: Price
Example: Larger houses → Higher prices
Input: Experience
Output: Salary
Insight: Salary grows linearly in early career stages
Input: Marketing Budget
Output: Sales
Use: Optimize marketing ROI
Input: Age, Weight
Output: BP/Cholesterol
Use: Preventive healthcare
Input: Engine Size
Output: Mileage
Insight: Bigger engines → Lower efficiency
Input: Study Time
Output: Marks
Example: More study → Better scores
Input: Year
Output: Temperature
Use: Climate change analysis
Y = mX + c
Where:
m = slope (impact of X on Y)
c = intercept (starting value)
Multiple choice questions (MCQs)
✅ 1. What is the purpose of the learning rate in a machine learning model?
A) To determine the number of layers in the model
B) To set the maximum number of epochs
C) To control how much the weights are adjusted during training
D) To control the size of the dataset
Answer: ✅ C
Explanation: The learning rate controls how much we adjust the weights of the model based on the error calculated.
✅ 2. What can happen if the learning rate is too high?
A) The model will train very slowly
B) The model may not learn at all
C) The model may overshoot the optimal point and diverge
D) It will always find the best solution quickly
Answer: ✅ C
Explanation: A high learning rate can cause the model to skip over the best values, leading to poor or unstable learning.
✅ 3. If the learning rate is too low, what is most likely to happen?
A) The model will converge very quickly
B) The model will never converge
C) The model will converge very slowly
D) The model will skip the global minimum
Answer: ✅ C
Explanation: A very small learning rate makes the learning slow and can get stuck before reaching the best solution.
✅ 4. Which of the following is a typical range for a learning rate?
A) 10–100
B) 0.01–0.1
C) 100–1000
D) -1 to 0
Answer: ✅ B
Explanation: Learning rates are usually small values like 0.01 or 0.001 depending on the problem and optimizer.
✅ 5. In the weight update formula w=w−η⋅∂L/∂w, what does η\etaη represent?
A) Bias
B) Epoch
C) Loss
D) Learning Rate
Answer: ✅ D
Explanation: The symbol η\etaη (eta) commonly denotes the learning rate in the formula.
✅ 6. Which method adjusts the learning rate during training?
A) Stochastic Gradient Descent
B) Adaptive Learning Rate
C) Batch Normalization
D) Dropout
Answer: ✅ B
Explanation: Adaptive methods like Adam, RMSProp, and Adagrad adjust the learning rate during training.
✅ 7. Which optimizer uses an adaptive learning rate for each parameter?
A) SGD
B) Adam
C) Linear Regression
D) KNN
Answer: ✅ B
Explanation: Adam (Adaptive Moment Estimation) adjusts the learning rate based on estimates of first and second moments of the gradients.
✅ 8. What is a common sign that the learning rate is too high?
A) The training loss decreases steadily
B) The training loss is flat
C) The training loss fluctuates wildly or increases
D) The model performs well on the test set
Answer: ✅ C
Explanation: High learning rates can cause the loss to increase or bounce around, indicating instability.
MCQs on Hyperparameter
1. Which of the following is a hyperparameter in linear regression with gradient descent?
A. Weights
B. Bias
C. Learning rate
D. Features
Answer: C. Learning rate
Explanation:
Hyperparameters are set before training begins and are not learned from the data. The learning rate is a typical hyperparameter in gradient descent. Weights and bias are learned parameters, and features are part of the dataset, not hyperparameters.
2. What is the effect of setting the learning rate too high in gradient descent?
A. The model will converge more slowly
B. The model may overshoot the minimum and not converge
C. The model will perfectly fit the training data
D. The model will stop updating weights
Answer: B. The model may overshoot the minimum and not converge
Explanation:
A high learning rate can cause the updates to "overshoot" the optimal values, potentially causing the loss to increase or fluctuate, preventing convergence.
3. What does it mean if a model's loss decreases very slowly during training?
A. The learning rate may be too high
B. The model has converged
C. The learning rate may be too low
D. The model has overfitted
Answer: C. The learning rate may be too low
Explanation:
A low learning rate results in very small updates to weights, causing the model to take a long time to converge.
4. Which of the following is not typically considered a hyperparameter in linear regression using gradient descent?
A. Batch size
B. Number of features
C. Learning rate
D. Number of epochs
Answer: B. Number of features
Explanation:
The number of features is determined by the dataset, not by the training process. In contrast, batch size, learning rate, and number of epochs are user-specified hyperparameters.
5. What is a good approach to finding the right hyperparameters for your model?
A. Set random values and hope for the best
B. Train the model without tuning
C. Try a range of values and compare performance
D. Use the same values for all models
Answer: C. Try a range of values and compare performance
Explanation:
Hyperparameter tuning involves trying different values (e.g., grid search or random search) and evaluating which set yields the best performance.
6. If your model's loss is fluctuating wildly during training, what could be the issue?
A. Too many training examples
B. Too few epochs
C. Learning rate is too low
D. Learning rate is too high
Answer: D. Learning rate is too high
Explanation:
A high learning rate can make training unstable, causing the loss to fluctuate instead of decreasing smoothly.
7. Which of the following best describes the role of a hyperparameter?
A. A parameter learned from the dataset
B. A setting that controls the training process
C. A component of the cost function
D. A value used to initialize the dataset
Answer: B. A setting that controls the training process
Explanation:
Hyperparameters are user-defined settings like learning rate, batch size, and epochs that influence how the model learns.
8. What is the most likely outcome of training a model for too many epochs?
A. The model becomes more generalized
B. The model underfits the training data
C. The model overfits the training data
D. The model stops learning
Answer: C. The model overfits the training data
Explanation:
Training for too many epochs can cause the model to memorize the training data, reducing its ability to generalize to new data (overfitting).
9. Which of the following best defines "epoch" in the context of model training?
A. A complete forward and backward pass for one training sample
B. A set of hyperparameter tuning trials
C. One complete pass through the entire training dataset
D. The number of layers in the neural network
Answer: C. One complete pass through the entire training dataset
Explanation:
An epoch is defined as one full pass over the entire training dataset during model training.
10. How does batch size affect training in gradient descent?
A. It changes the number of features in the dataset
B. It determines how many weights are updated
C. It controls how many examples are used to calculate the gradient in each step
D. It does not affect the training process
Answer: C. It controls how many examples are used to calculate the gradient in each step
Explanation:
Batch size defines how many training examples are used in one forward/backward pass. Smaller batches update more frequently, while larger ones give a smoother estimate of the gradient.
11. If your model's training loss is decreasing but validation loss is increasing, what's happening?
A. Underfitting
B. Normal convergence
C. Overfitting
D. The learning rate is too low
Answer: C. Overfitting
Explanation:
When the model performs well on training data but poorly on unseen data (validation), it indicates overfitting.
12. What is the trade-off when choosing a very small batch size?
A. Slower convergence and lower generalization
B. Faster convergence but noisier updates
C. Always better performance
D. Less RAM usage but no impact on training
Answer: B. Faster convergence but noisier updates
Explanation:
Smaller batch sizes can update weights more frequently, which speeds up convergence but makes the gradient estimates noisier.
13. What happens when you increase the batch size significantly?
A. You always get better accuracy
B. Training becomes unstable
C. Gradient estimates become more accurate but require more memory
D. Learning rate must be decreased
Answer: C. Gradient estimates become more accurate but require more memory
Explanation:
Larger batches provide a more accurate gradient approximation but need more memory and computation per step.
14. Which combination is most likely to cause overfitting?
A. Low learning rate, few epochs
B. High learning rate, small batch size
C. High learning rate, many epochs
D. Low learning rate, many epochs
Answer: D. Low learning rate, many epochs
Explanation:
A low learning rate with many epochs allows the model to train slowly and potentially memorize the training data, leading to overfitting.
15. What does tuning the learning rate primarily help with?
A. Reducing model size
B. Improving validation data quality
C. Speeding up or stabilizing convergence during training
D. Increasing the number of training examples
Answer: C. Speeding up or stabilizing convergence during training
Explanation:
A properly tuned learning rate helps the model converge efficiently to a minimum of the loss function.
16. Suppose you are using a learning rate of 0.01. After 100 epochs, the loss is still high. What should you try first?
A. Increase the learning rate
B. Decrease the learning rate
C. Increase the number of features
D. Decrease the batch size
Answer: A. Increase the learning rate
Explanation:
A high loss after many epochs with a small learning rate suggests slow
learning. Increasing the learning rate can help speed up convergence.
17. You are training a linear regression model on 10,000 samples using mini-batch gradient descent with a batch size of 100. How many batches will be there in one epoch?
A. 10
B. 100
C. 1,000
D. 10,000
Answer: B. 100
Explanation:
Number of batches per epoch=10,000/100=100
18. If the initial weight is 0 and learning rate is 0.1, and gradient = -4, what will be the updated weight after one step of gradient descent?
A. 0.4
B. -0.4
C. 0.1
D. -0.1
Answer: A. 0.4
Explanation:
wnew=wold−η⋅gradient=0−(0.1⋅−4)=0+0.4=0.4
19. What would be the total number of weight updates in 5 epochs, using a dataset with 1,000 samples and a batch size of 50?
A. 50
B. 100
C. 5,000
D. 100
Answer: D. 100
Explanation:
Batches per epoch=1000/50=20⇒Total updates in 5 epochs=20×5=100
20. Which of the following is true about the effect of increasing the number of epochs?
A. It always reduces the training loss
B. It always improves validation performance
C. It may lead to overfitting
D. It resets the weights each time
Answer: C. It may lead to overfitting
Explanation:
Too many epochs can cause the model to memorize training data, decreasing
generalization.
21. You're training with a batch size of 200, dataset size of 10,000, and 10 epochs. How many total gradient steps will occur?
A. 50
B. 100
C. 500
D. 1,000
Answer: C. 500
Explanation:
Steps per epoch=10,000/200=50⇒50×10=500 steps
22. During training, your loss decreases for several epochs but then plateaus. What's a common solution?
A. Reduce training data
B. Increase learning rate slightly
C. Decrease learning rate slightly
D. Add more features
Answer: C. Decrease learning rate slightly
Explanation:
If the model is stuck near a minimum, reducing the learning rate may help
fine-tune and escape the plateau.
23. What happens if the learning rate is set to zero?
A. The model will train very slowly
B. The model will converge faster
C. The weights won't update at all
D. The model will overfit
Answer: C. The weights won't update at all
Explanation:
With a learning rate of 0, gradient descent updates are zero, so weights stay
constant.
24. If a model is training on 200,000 samples and the batch size is 2,000, how many batches per epoch will there be?
A. 100
B. 1,000
C. 200
D. 2
Answer: A. 100
Explanation:
Batches per epoch=200,000/2,000=100
25. If the learning rate is too small (e.g., 0.00001), what might you observe in the training curve?
A. Sharp drops in loss
B. Sudden divergence in loss
C. Flat or very slow decline in loss
D. Overfitting after first epoch
Answer: C. Flat or very slow decline in loss
Explanation:
Tiny learning rates cause small parameter updates, resulting in slow learning
and little progress per epoch.
26. You reduce the batch size from 256 to 32. Which of the following is the most likely outcome?
A. Fewer weight updates per epoch
B. Slower convergence due to fewer updates
C. More frequent updates with higher variance
D. More memory usage per step
Answer: C. More frequent updates with higher
variance
Explanation:
Smaller batch sizes lead to more frequent gradient updates per epoch, but each
update is noisier due to less representative data.
27. A dataset has 60,000 examples. If you use mini-batch gradient descent with batch size = 600 and train for 10 epochs, how many total updates will occur?
A. 600
B. 1,000
C. 100
D. 1,000
Answer: D. 1,000
Explanation:
Batches per epoch=60,000/600=100⇒Total updates in 10 epochs=100×10=1,000
28. What would happen if the learning rate is set too high (e.g., 10.0)?
A. Model will converge faster
B. Model may skip over the minimum and diverge
C. Model will underfit the data
D. Gradient updates will be smaller
Answer: B. Model may skip over the minimum and
diverge
Explanation:
An excessively high learning rate causes unstable updates, often skipping past
the minimum or even increasing the loss.
29. Which of the following is the best strategy to choose hyperparameters like learning rate and batch size?
A. Use trial and error without evaluation
B. Fix them for all datasets
C. Tune them using validation performance
D. Choose values from previous models
Answer: C. Tune them using validation performance
Explanation:
Hyperparameters should be optimized by evaluating performance on a separate
validation set.
30. A model with a learning rate of 0.5 diverges during training. What would be a safer value to try next?
A. 1.0
B. 0.05
C. 0.9
D. 2.0
Answer: B. 0.05
Explanation:
If 0.5 causes divergence, try a smaller value like 0.05 to stabilize updates
and promote convergence.
31. Which hyperparameter controls how many times the model sees the entire dataset?
A. Learning rate
B. Batch size
C. Epochs
D. Loss function
Answer: C. Epochs
Explanation:
The number of epochs defines how many complete passes over the training dataset
the model will make.
32. You are training a model with a dataset of 50,000 examples using stochastic gradient descent (batch size = 1). How many updates per epoch will occur?
A. 1
B. 50
C. 500
D. 50,000
Answer: D. 50,000
Explanation:
Stochastic gradient descent updates weights after each example. So, with 50,000
samples, it performs 50,000 updates per epoch.
33. You try a batch size of 10 and observe high variance in loss between steps. What is a likely solution?
A. Increase learning rate
B. Increase batch size
C. Reduce number of epochs
D. Add more layers
Answer: B. Increase batch size
Explanation:
A small batch size gives noisy gradients. A larger batch size helps reduce this
variance and stabilize training.
34. For a fixed number of training examples, increasing the number of epochs without early stopping will likely cause:
A. Underfitting
B. Data leakage
C. Overfitting
D. More training data to be required
Answer: C. Overfitting
Explanation:
Too many epochs may lead the model to memorize the training data, reducing its
generalization ability.
35. Which of the following is not typically tuned as a hyperparameter in gradient descent-based training?
A. Number of epochs
B. Activation function
C. Batch size
D. Learning rate
Answer: B. Activation function
Explanation:
Activation function is part of model architecture design, not typically treated
as a tunable training hyperparameter in simple linear regression.
36. If your model's training loss is decreasing but validation loss starts increasing after a few epochs, what should you do?
A. Increase the number of epochs
B. Increase the learning rate
C. Use early stopping or reduce epochs
D. Reduce the batch size
Answer: C. Use early stopping or reduce epochs
Explanation:
This pattern indicates overfitting. Early stopping prevents the model from
continuing to learn noise from training data.
37. Suppose the initial weight is 2.0, the learning rate is 0.1, and the computed gradient is 6. What will be the weight after one update?
A. 1.4
B. 2.6
C. 1.0
D. 0.6
Answer: A. 1.4
Explanation:
wnew=wold−η⋅gradient=2.0−0.1⋅6=2.0−0.6=1.4
38. Which of the following is a common consequence of using a very large batch size (e.g., 10,000)?
A. More noise in gradient estimates
B. Faster updates per epoch
C. Smoother convergence but slower generalization
D. Higher chance of overfitting due to noise
Answer: C. Smoother convergence but slower
generalization
Explanation:
Large batches reduce gradient noise, making convergence smooth, but can cause
the model to converge to sharp minima with poor generalization.
39. You increase your batch size from 64 to 512. What immediate effect does this have on each epoch?
A. More weight updates per epoch
B. Fewer weight updates per epoch
C. More memory efficiency
D. Reduced model complexity
Answer: B. Fewer weight updates per epoch
Explanation:
Larger batches mean fewer batches (updates) per epoch, since the number of
examples is fixed.
40. If your dataset contains 12,000 samples and you choose a batch size of 300 with 8 epochs, how many total weight updates will occur?
A. 32
B. 320
C. 80
D. 96
Answer: B. 320
Explanation:
Batches per epoch=12,000300=40⇒Total updates=40×8=320
