Linear Regression

26/05/2025

Linear Regression

Linear regression is a fundamental technique in statistics and machine learning used to model relationships between variables.

What it does

It fits a straight line through data points to predict values.

y = mx + b

y = dependent variable
x = independent variable
m = slope
b = intercept

Intuition

The goal is to find the best-fit line that minimizes prediction error.

Types

Simple Linear Regression
Multiple Linear Regression

y = b + m1x1 + m2x2 + ... + mnxn

How it works

Uses least squares method to minimize squared errors.

Applications

House price prediction
Sales forecasting
Risk analysis
Trend prediction

Limitations

Assumes linear relationship
Sensitive to outliers
Not suitable for nonlinear data

Linear Regression Lab

Linear Regression Interactive Lab

Concept

Linear regression models the relationship between variables using a straight line.

y = mx + b

Prediction Tool

Slope (m): Intercept (b): Input (x):

Graph Visualization

Linear Regression Demo

Linear Regression (Working Demo)

X	Y	Predicted	Error
1	30
2	37
3	39
4	46
5	52

Linear regression is a way to understand the relationship between two things. For example, imagine you sell lemonade. You notice that on hotter days, you sell more cups. On cooler days, you sell fewer. You might start to wonder, "Can I predict how many cups I'll sell if I know the temperature?"

That's exactly what linear regression helps with. It takes your past data—like the temperature each day and how many cups you sold—and draws a straight line that best fits that data. This line shows the general trend: as temperature goes up, sales go up.

Once that line is drawn, you can use it to make predictions. If tomorrow is 30 degrees, you can look at your line and estimate how many cups you'll probably sell. It doesn't give a perfect answer, but it gives a good guess based on what's happened before.

This method isn't just for lemonade. It's used all the time in real life. Businesses use it to guess how much money they might make based on advertising. Teachers can use it to see how study time affects grades. Doctors might use it to check how weight changes with age.

So, in simple words: linear regression is a tool that finds a straight-line pattern in your data and helps you make predictions based on it.

Mathematical Formula

For simple linear regression (one feature):

Y=β0+β1X+ϵ

Y = predicted value
X= input feature
β0 = intercept
β1 = slope (coefficient)
ϵ = error term

Real-Life Example

Problem: Predict the salary based on years of experience.

So we can model it as:

Salary=25,000+5,000×Experience

Thus, if a person has 6 years of experience:

Salary=25,000+5,000×6=55,000

The graph above shows how linear regression fits a straight line through the training data points (experience vs salary). The red line represents the model's prediction—clearly illustrating the linear relationship between years of experience and salary.

Using multiple linear regression, we predicted the salary for:

6 years of experience
Master's degree (education level = 2)

📌 Predicted Salary: $55,600

This model takes both experience and education level into account, showing how multiple factors can influence predictions more accurately than a single variable.

This 3D plot illustrates how multiple linear regression fits a plane through the data points, showing how both years of experience and education level contribute to the salary prediction.

The blue dots represent actual training data.
The red surface represents the model's predictions.

You can see that with increasing experience and higher education level, the predicted salary also increases.

The general equation for multiple linear regression is:

Y=β0+β1X1+β2X2+⋯+βnXn+ε

Where:

Y: Dependent (response) variable
X1,X2,…,Xn Independent (predictor) variables
β0 : Intercept (value of Y when all Xi=0)
β1,β2,…,βn : Coefficients (representing the change in Y for a one-unit change in the corresponding Xi holding other variables constant)
ε\varepsilonε: Error term (captures randomness or unexplained variation)

Example:

If predicting house prices based on size (in sq ft) and number of bedrooms:

Price=β0+β1(Size)+β2(Bedrooms)+ε

LOSS

Loss Functions in Machine Learning

Loss functions measure how far the predicted values are from the actual values.

Error = Actual Value − Predicted Value

Types of Loss Functions

Loss Type	Definition	Formula
L₁ Loss	Sum of absolute differences between actual and predicted values	Σ \|Actual − Predicted\|
Mean Absolute Error (MAE)	Average of absolute errors across all observations	(1/N) Σ \|Actual − Predicted\|
L₂ Loss	Sum of squared differences between actual and predicted values	Σ (Actual − Predicted)²
Mean Squared Error (MSE)	Average of squared errors across all observations	(1/N) Σ (Actual − Predicted)²

Example

Actual	Predicted	Error	\|Error\|	Error²
10	8	2	2	4
15	18	-3	3	9
20	17	3	3	9

Key Insights

MAE treats all errors equally.
MSE penalizes large errors more heavily.
L₂ loss is more sensitive to outliers than L₁ loss.

When to Use

Use MAE when you want robustness against outliers.
Use MSE when large errors should be penalized more.

Hyperparameters

Hyperparameters in Machine Learning

Hyperparameters are settings that are defined before training a machine learning model. They control how the model learns.

Hyperparameters ≠ Model Parameters
Hyperparameters → Set by user
Parameters → Learned by model

Examples of Hyperparameters

Model	Hyperparameter	Description
Linear Regression	Learning Rate	Controls how fast the model learns
Decision Tree	Max Depth	Maximum depth of the tree
KNN	K (Neighbors)	Number of nearest neighbors
Neural Network	Epochs	Number of times data is processed
Neural Network	Batch Size	Number of samples per update

Why Hyperparameters Matter

Control model performance
Prevent overfitting and underfitting
Improve accuracy

Hyperparameter Tuning

Hyperparameter tuning is the process of finding the best combination of hyperparameters.

Grid Search
Random Search
Bayesian Optimization

Key Insight

Better hyperparameters → Better model performance

Learning Rate in Machine Learning

Learning Rate

Learning rate is a hyperparameter that controls how much a model updates its weights during training.

New Weight = Old Weight − (Learning Rate × Gradient)

Intuition

Small learning rate → slow learning
Large learning rate → unstable learning
Optimal learning rate → fast and stable convergence

Effect of Learning Rate

Learning Rate	Behavior
Too Small	Very slow training
Too Large	Overshooting, may not converge
Optimal	Efficient and stable learning

Why It Matters

Controls speed of training
Affects model accuracy
Prevents divergence

Weight and Bias Updates

How Many Times Are Weight and Bias Updated?

The number of times weights and biases are updated depends on:

✔ Number of epochs

An epoch is one full pass through the training dataset. If you train for 10 epochs, the model sees the data 10 times.

✔ Batch size

Stochastic Gradient Descent (SGD): Update happens after every sample.
Mini-batch Gradient Descent: Update happens after every batch.
Batch Gradient Descent: Update happens once per epoch.

Summary Table

Method	Batch Size	Updates per Epoch
SGD	1	100 (if dataset has 100 samples)
Mini-batch	10	10
Batch Gradient Descent	100	1

Formula

Total Updates = Epochs × (Total Samples ÷ Batch Size)

Example

Dataset = 100 samples
Epochs = 5
Batch size = 10

Updates per epoch = 100 ÷ 10 = 10
Total updates = 5 × 10 = 50 updates

Key Insight

More epochs → more learning cycles
Smaller batch size → more frequent updates
Larger batch size → fewer but stable updates

Gradient Descent

Gradient Descent (Step-by-Step)

Gradient Descent is an optimization algorithm used to minimize loss.

New Weight = Old Weight − (Learning Rate × Gradient)

Step-by-Step Example

Step	Weight	Gradient	Updated Weight
1	10	2	10 - (0.1×2) = 9.8
2	9.8	1.5	9.8 - (0.1×1.5) = 9.65
3	9.65	1	9.65 - (0.1×1) = 9.55

Key Idea

Move in direction of decreasing error
Repeat until minimum is reached

Gradient Descent

Gradient Descent with Proper Upward Parabola

Adjust Learning Rate:

Types of Gradient Descent

Gradient Descent is used to update weights and minimize loss. There are three main types based on how data is processed.

✔ 1. Stochastic Gradient Descent (SGD)

Uses one data point at a time
Updates weights after every sample
Very fast but noisy

Best for: Large datasets and fast learning

✔ 2. Mini-batch Gradient Descent

Uses small batches of data
Updates weights after each batch
Balanced approach

Best for: Most real-world ML systems

✔ 3. Batch Gradient Descent

Uses entire dataset
Updates weights once per epoch
Very stable but slow

Best for: Small datasets

Comparison Table

Type	Data Used	Update Frequency	Speed	Stability
SGD	1 sample	Very frequent	Fast	Low
Mini-batch	Small batch	Moderate	Balanced	Medium
Batch GD	Full dataset	Once per epoch	Slow	High

Key Insight

Mini-batch Gradient Descent is the most widely used method in practice.

Convergence Table

Gradient Descent Convergence (Multi-Iteration)

Model: y = wx + b
Learning Rate = 0.1

Iteration	w	b	Predictions (x=1,2,3)	Loss (MSE)
0	0.00	0.00	0, 0, 0	18.67
1	0.93	0.40	1.33, 2.27, 3.20	3.76
2	1.44	0.64	2.08, 3.52, 4.96	0.70
3	1.72	0.78	2.50, 4.22, 5.94	0.13
4	1.86	0.86	2.72, 4.57, 6.43	0.02
5	1.93	0.91	2.84, 4.76, 6.68	~0.00

Key Insight

Loss decreases with each iteration
w approaches 2 and b approaches 0
Model converges to optimal solution

📘 The Formula:

new weight=old weight−learning rate × gradient

gradient with respect to weight =∂w/∂L

gradient with respect to bias = ∂b/∂L

Example:
If you're minimizing a loss function, and the gradient is 0.6, then with a learning rate of 0.01:

Weight update = Weight - (0.01 * 0.6) = Weight - 0.006

Example:

Say you have:

10,000 samples
Batch size = 100
Epochs = 5

Then:

Number of updates = (10,000 / 100) × 5 = 100 × 5 = 500 updates
So the weights and biases are updated 500 times.

Let's go through examples to calculate and understand learning rate in action, especially how it affects weight updates using gradient descent.

🔁 Basic Formula: Gradient Descent Update

For a given weight www, the update rule is:

wnew=wold−η⋅∂L/∂w

Where:

η = learning rate
∂L/∂w = gradient of the loss function w.r.t weight

✅ Example 1: Simple Gradient Descent Update

Given:

Initial weight: w=0.5
Gradient: ∂L/∂w=0.8
Learning rate η=0.1

Calculation:

wnew=0.5−0.1×0.8=0.5−0.08=0.42

So, the new weight is 0.42.

⚠️ Observation:

Too small: slow learning.
Too large: can overshoot or diverge.

✅ Example 3: With Bias Update

Let's include bias b.

Given:

w=0.6 , b=0.2
Gradients: ∂L/∂w=0.4 , ∂L/∂b=0.3
Learning rate η=0.05

Update:

wnew=0.6−0.05×0.4=0.6−0.02=0.58

bnew=0.2−0.05×0.3=0.2−0.015=0.185

✅ New values: w=0.58w = 0.58w=0.58, b=0.185b = 0.185b=0.185

Adam Optimizer

Adam (Adaptive Moment Estimation) is an advanced optimization algorithm that combines momentum and adaptive learning rates.

m = momentum (past gradients)
v = variance (squared gradients)

Update Equations

m = β₁m + (1−β₁)g
v = β₂v + (1−β₂)g²
w = w − η × (m / √v)

Why Adam Works

Uses past gradients → faster learning
Adjusts learning rate → stable updates
Reduces oscillations

Comparison

Optimizer	Speed	Stability
SGD	Fast	Low
Batch GD	Slow	High
Adam	Fast	High

Key Insight

Adam combines momentum + adaptive learning for faster and stable convergence.

Real-Life Examples of Linear Regression

📊 Real-Life Examples of Linear Regression

Linear Regression helps us understand relationships between variables and predict outcomes. Here are practical real-world applications across industries.

🏠

House Price Prediction

Predict property prices based on area, location, and number of rooms.

Input: Area, Rooms
Output: Price
Example: Larger houses → Higher prices

💼

Salary Prediction

Estimate salary based on years of experience.

Input: Experience
Output: Salary
Insight: Salary grows linearly in early career stages

📈

Sales Forecasting

Predict sales based on advertising spend.

Input: Marketing Budget
Output: Sales
Use: Optimize marketing ROI

🏥

Healthcare Predictions

Predict health indicators like blood pressure.

Input: Age, Weight
Output: BP/Cholesterol
Use: Preventive healthcare

🚗

Fuel Efficiency

Estimate mileage based on engine size and weight.

Input: Engine Size
Output: Mileage
Insight: Bigger engines → Lower efficiency

🎓

Student Performance

Predict exam scores based on study hours.

Input: Study Time
Output: Marks
Example: More study → Better scores

🌡️

Climate Trends

Analyze temperature changes over time.

Input: Year
Output: Temperature
Use: Climate change analysis

Linear Regression Formula:

Y = mX + c

Where:
m = slope (impact of X on Y)
c = intercept (starting value)

Multiple choice questions (MCQs)

✅ 1. What is the purpose of the learning rate in a machine learning model?

A) To determine the number of layers in the model
B) To set the maximum number of epochs
C) To control how much the weights are adjusted during training
D) To control the size of the dataset

Answer: ✅ C
Explanation: The learning rate controls how much we adjust the weights of the model based on the error calculated.

✅ 2. What can happen if the learning rate is too high?

A) The model will train very slowly
B) The model may not learn at all
C) The model may overshoot the optimal point and diverge
D) It will always find the best solution quickly

Answer: ✅ C
Explanation: A high learning rate can cause the model to skip over the best values, leading to poor or unstable learning.

✅ 3. If the learning rate is too low, what is most likely to happen?

A) The model will converge very quickly
B) The model will never converge
C) The model will converge very slowly
D) The model will skip the global minimum

Answer: ✅ C
Explanation: A very small learning rate makes the learning slow and can get stuck before reaching the best solution.

✅ 4. Which of the following is a typical range for a learning rate?

A) 10–100
B) 0.01–0.1
C) 100–1000
D) -1 to 0

Answer: ✅ B
Explanation: Learning rates are usually small values like 0.01 or 0.001 depending on the problem and optimizer.

✅ 5. In the weight update formula w=w−η⋅∂L/∂w, what does η\etaη represent?

A) Bias
B) Epoch
C) Loss
D) Learning Rate

Answer: ✅ D
Explanation: The symbol η\etaη (eta) commonly denotes the learning rate in the formula.

✅ 6. Which method adjusts the learning rate during training?

A) Stochastic Gradient Descent
B) Adaptive Learning Rate
C) Batch Normalization
D) Dropout

Answer: ✅ B
Explanation: Adaptive methods like Adam, RMSProp, and Adagrad adjust the learning rate during training.

✅ 7. Which optimizer uses an adaptive learning rate for each parameter?

A) SGD
B) Adam
C) Linear Regression
D) KNN

Answer: ✅ B
Explanation: Adam (Adaptive Moment Estimation) adjusts the learning rate based on estimates of first and second moments of the gradients.

✅ 8. What is a common sign that the learning rate is too high?

A) The training loss decreases steadily
B) The training loss is flat
C) The training loss fluctuates wildly or increases
D) The model performs well on the test set

Answer: ✅ C
Explanation: High learning rates can cause the loss to increase or bounce around, indicating instability.

MCQs on Hyperparameter

1. Which of the following is a hyperparameter in linear regression with gradient descent?

A. Weights
B. Bias
C. Learning rate
D. Features

Answer: C. Learning rate
Explanation:
Hyperparameters are set before training begins and are not learned from the data. The learning rate is a typical hyperparameter in gradient descent. Weights and bias are learned parameters, and features are part of the dataset, not hyperparameters.

2. What is the effect of setting the learning rate too high in gradient descent?

A. The model will converge more slowly
B. The model may overshoot the minimum and not converge
C. The model will perfectly fit the training data
D. The model will stop updating weights

Answer: B. The model may overshoot the minimum and not converge
Explanation:
A high learning rate can cause the updates to "overshoot" the optimal values, potentially causing the loss to increase or fluctuate, preventing convergence.

3. What does it mean if a model's loss decreases very slowly during training?

A. The learning rate may be too high
B. The model has converged
C. The learning rate may be too low
D. The model has overfitted

Answer: C. The learning rate may be too low
Explanation:
A low learning rate results in very small updates to weights, causing the model to take a long time to converge.

4. Which of the following is not typically considered a hyperparameter in linear regression using gradient descent?

A. Batch size
B. Number of features
C. Learning rate
D. Number of epochs

Answer: B. Number of features
Explanation:
The number of features is determined by the dataset, not by the training process. In contrast, batch size, learning rate, and number of epochs are user-specified hyperparameters.

5. What is a good approach to finding the right hyperparameters for your model?

A. Set random values and hope for the best
B. Train the model without tuning
C. Try a range of values and compare performance
D. Use the same values for all models

Answer: C. Try a range of values and compare performance
Explanation:
Hyperparameter tuning involves trying different values (e.g., grid search or random search) and evaluating which set yields the best performance.

6. If your model's loss is fluctuating wildly during training, what could be the issue?

A. Too many training examples
B. Too few epochs
C. Learning rate is too low
D. Learning rate is too high

Answer: D. Learning rate is too high
Explanation:
A high learning rate can make training unstable, causing the loss to fluctuate instead of decreasing smoothly.

7. Which of the following best describes the role of a hyperparameter?

A. A parameter learned from the dataset
B. A setting that controls the training process
C. A component of the cost function
D. A value used to initialize the dataset

Answer: B. A setting that controls the training process
Explanation:
Hyperparameters are user-defined settings like learning rate, batch size, and epochs that influence how the model learns.

8. What is the most likely outcome of training a model for too many epochs?

A. The model becomes more generalized
B. The model underfits the training data
C. The model overfits the training data
D. The model stops learning

Answer: C. The model overfits the training data
Explanation:
Training for too many epochs can cause the model to memorize the training data, reducing its ability to generalize to new data (overfitting).

9. Which of the following best defines "epoch" in the context of model training?

A. A complete forward and backward pass for one training sample
B. A set of hyperparameter tuning trials
C. One complete pass through the entire training dataset
D. The number of layers in the neural network

Answer: C. One complete pass through the entire training dataset
Explanation:
An epoch is defined as one full pass over the entire training dataset during model training.

10. How does batch size affect training in gradient descent?

A. It changes the number of features in the dataset
B. It determines how many weights are updated
C. It controls how many examples are used to calculate the gradient in each step
D. It does not affect the training process

Answer: C. It controls how many examples are used to calculate the gradient in each step
Explanation:
Batch size defines how many training examples are used in one forward/backward pass. Smaller batches update more frequently, while larger ones give a smoother estimate of the gradient.

11. If your model's training loss is decreasing but validation loss is increasing, what's happening?

A. Underfitting
B. Normal convergence
C. Overfitting
D. The learning rate is too low

Answer: C. Overfitting
Explanation:
When the model performs well on training data but poorly on unseen data (validation), it indicates overfitting.

12. What is the trade-off when choosing a very small batch size?

A. Slower convergence and lower generalization
B. Faster convergence but noisier updates
C. Always better performance
D. Less RAM usage but no impact on training

Answer: B. Faster convergence but noisier updates
Explanation:
Smaller batch sizes can update weights more frequently, which speeds up convergence but makes the gradient estimates noisier.

13. What happens when you increase the batch size significantly?

A. You always get better accuracy
B. Training becomes unstable
C. Gradient estimates become more accurate but require more memory
D. Learning rate must be decreased

Answer: C. Gradient estimates become more accurate but require more memory
Explanation:
Larger batches provide a more accurate gradient approximation but need more memory and computation per step.

14. Which combination is most likely to cause overfitting?

A. Low learning rate, few epochs
B. High learning rate, small batch size
C. High learning rate, many epochs
D. Low learning rate, many epochs

Answer: D. Low learning rate, many epochs
Explanation:
A low learning rate with many epochs allows the model to train slowly and potentially memorize the training data, leading to overfitting.

15. What does tuning the learning rate primarily help with?

A. Reducing model size
B. Improving validation data quality
C. Speeding up or stabilizing convergence during training
D. Increasing the number of training examples

Answer: C. Speeding up or stabilizing convergence during training
Explanation:
A properly tuned learning rate helps the model converge efficiently to a minimum of the loss function.

16. Suppose you are using a learning rate of 0.01. After 100 epochs, the loss is still high. What should you try first?

A. Increase the learning rate
B. Decrease the learning rate
C. Increase the number of features
D. Decrease the batch size

Answer: A. Increase the learning rate
Explanation:
A high loss after many epochs with a small learning rate suggests slow learning. Increasing the learning rate can help speed up convergence.

17. You are training a linear regression model on 10,000 samples using mini-batch gradient descent with a batch size of 100. How many batches will be there in one epoch?

A. 10
B. 100
C. 1,000
D. 10,000

Answer: B. 100
Explanation:

Number of batches per epoch=10,000/100=100

18. If the initial weight is 0 and learning rate is 0.1, and gradient = -4, what will be the updated weight after one step of gradient descent?

A. 0.4
B. -0.4
C. 0.1
D. -0.1

Answer: A. 0.4
Explanation:

wnew=wold−η⋅gradient=0−(0.1⋅−4)=0+0.4=0.4

19. What would be the total number of weight updates in 5 epochs, using a dataset with 1,000 samples and a batch size of 50?

A. 50
B. 100
C. 5,000
D. 100

Answer: D. 100
Explanation:

Batches per epoch=1000/50=20⇒Total updates in 5 epochs=20×5=100

20. Which of the following is true about the effect of increasing the number of epochs?

A. It always reduces the training loss
B. It always improves validation performance
C. It may lead to overfitting
D. It resets the weights each time

Answer: C. It may lead to overfitting
Explanation:
Too many epochs can cause the model to memorize training data, decreasing generalization.

21. You're training with a batch size of 200, dataset size of 10,000, and 10 epochs. How many total gradient steps will occur?

A. 50
B. 100
C. 500
D. 1,000

Answer: C. 500
Explanation:

Steps per epoch=10,000/200=50⇒50×10=500 steps

22. During training, your loss decreases for several epochs but then plateaus. What's a common solution?

A. Reduce training data
B. Increase learning rate slightly
C. Decrease learning rate slightly
D. Add more features

Answer: C. Decrease learning rate slightly
Explanation:
If the model is stuck near a minimum, reducing the learning rate may help fine-tune and escape the plateau.

23. What happens if the learning rate is set to zero?

A. The model will train very slowly
B. The model will converge faster
C. The weights won't update at all
D. The model will overfit

Answer: C. The weights won't update at all
Explanation:
With a learning rate of 0, gradient descent updates are zero, so weights stay constant.

24. If a model is training on 200,000 samples and the batch size is 2,000, how many batches per epoch will there be?

A. 100
B. 1,000
C. 200
D. 2

Answer: A. 100
Explanation:

Batches per epoch=200,000/2,000=100

25. If the learning rate is too small (e.g., 0.00001), what might you observe in the training curve?

A. Sharp drops in loss
B. Sudden divergence in loss
C. Flat or very slow decline in loss
D. Overfitting after first epoch

Answer: C. Flat or very slow decline in loss
Explanation:
Tiny learning rates cause small parameter updates, resulting in slow learning and little progress per epoch.

26. You reduce the batch size from 256 to 32. Which of the following is the most likely outcome?

A. Fewer weight updates per epoch
B. Slower convergence due to fewer updates
C. More frequent updates with higher variance
D. More memory usage per step

Answer: C. More frequent updates with higher variance
Explanation:
Smaller batch sizes lead to more frequent gradient updates per epoch, but each update is noisier due to less representative data.

27. A dataset has 60,000 examples. If you use mini-batch gradient descent with batch size = 600 and train for 10 epochs, how many total updates will occur?

A. 600
B. 1,000
C. 100
D. 1,000

Answer: D. 1,000
Explanation:

Batches per epoch=60,000/600=100⇒Total updates in 10 epochs=100×10=1,000

28. What would happen if the learning rate is set too high (e.g., 10.0)?

A. Model will converge faster
B. Model may skip over the minimum and diverge
C. Model will underfit the data
D. Gradient updates will be smaller

Answer: B. Model may skip over the minimum and diverge
Explanation:
An excessively high learning rate causes unstable updates, often skipping past the minimum or even increasing the loss.

29. Which of the following is the best strategy to choose hyperparameters like learning rate and batch size?

A. Use trial and error without evaluation
B. Fix them for all datasets
C. Tune them using validation performance
D. Choose values from previous models

Answer: C. Tune them using validation performance
Explanation:
Hyperparameters should be optimized by evaluating performance on a separate validation set.

30. A model with a learning rate of 0.5 diverges during training. What would be a safer value to try next?

A. 1.0
B. 0.05
C. 0.9
D. 2.0

Answer: B. 0.05
Explanation:
If 0.5 causes divergence, try a smaller value like 0.05 to stabilize updates and promote convergence.

31. Which hyperparameter controls how many times the model sees the entire dataset?

A. Learning rate
B. Batch size
C. Epochs
D. Loss function

Answer: C. Epochs
Explanation:
The number of epochs defines how many complete passes over the training dataset the model will make.

32. You are training a model with a dataset of 50,000 examples using stochastic gradient descent (batch size = 1). How many updates per epoch will occur?

A. 1
B. 50
C. 500
D. 50,000

Answer: D. 50,000
Explanation:
Stochastic gradient descent updates weights after each example. So, with 50,000 samples, it performs 50,000 updates per epoch.

33. You try a batch size of 10 and observe high variance in loss between steps. What is a likely solution?

A. Increase learning rate
B. Increase batch size
C. Reduce number of epochs
D. Add more layers

Answer: B. Increase batch size
Explanation:
A small batch size gives noisy gradients. A larger batch size helps reduce this variance and stabilize training.

34. For a fixed number of training examples, increasing the number of epochs without early stopping will likely cause:

A. Underfitting
B. Data leakage
C. Overfitting
D. More training data to be required

Answer: C. Overfitting
Explanation:
Too many epochs may lead the model to memorize the training data, reducing its generalization ability.

35. Which of the following is not typically tuned as a hyperparameter in gradient descent-based training?

A. Number of epochs
B. Activation function
C. Batch size
D. Learning rate

Answer: B. Activation function
Explanation:
Activation function is part of model architecture design, not typically treated as a tunable training hyperparameter in simple linear regression.

36. If your model's training loss is decreasing but validation loss starts increasing after a few epochs, what should you do?

A. Increase the number of epochs
B. Increase the learning rate
C. Use early stopping or reduce epochs
D. Reduce the batch size

Answer: C. Use early stopping or reduce epochs
Explanation:
This pattern indicates overfitting. Early stopping prevents the model from continuing to learn noise from training data.

37. Suppose the initial weight is 2.0, the learning rate is 0.1, and the computed gradient is 6. What will be the weight after one update?

A. 1.4
B. 2.6
C. 1.0
D. 0.6

Answer: A. 1.4
Explanation:

wnew=wold−η⋅gradient=2.0−0.1⋅6=2.0−0.6=1.4

38. Which of the following is a common consequence of using a very large batch size (e.g., 10,000)?

A. More noise in gradient estimates
B. Faster updates per epoch
C. Smoother convergence but slower generalization
D. Higher chance of overfitting due to noise

Answer: C. Smoother convergence but slower generalization
Explanation:
Large batches reduce gradient noise, making convergence smooth, but can cause the model to converge to sharp minima with poor generalization.

39. You increase your batch size from 64 to 512. What immediate effect does this have on each epoch?

A. More weight updates per epoch
B. Fewer weight updates per epoch
C. More memory efficiency
D. Reduced model complexity

Answer: B. Fewer weight updates per epoch
Explanation:
Larger batches mean fewer batches (updates) per epoch, since the number of examples is fixed.

40. If your dataset contains 12,000 samples and you choose a batch size of 300 with 8 epochs, how many total weight updates will occur?

A. 32
B. 320
C. 80
D. 96

Answer: B. 320
Explanation:

Batches per epoch=12,000300=40⇒Total updates=40×8=320

Linear Regression

Linear Regression

What it does

Intuition

Types

How it works

Applications

Limitations

Linear Regression Interactive Lab

Concept

Prediction Tool

Graph Visualization

Linear Regression (Working Demo)

Mathematical Formula

For simple linear regression (one feature):

Real-Life Example

Where:

Example:

LOSS

Loss Functions in Machine Learning

Types of Loss Functions

Example

Key Insights

When to Use

Hyperparameters

Hyperparameters in Machine Learning

Examples of Hyperparameters

Why Hyperparameters Matter

Hyperparameter Tuning

Key Insight

Learning Rate

Intuition

Effect of Learning Rate

Why It Matters

How Many Times Are Weight and Bias Updated?

Summary Table

Formula

Example

Key Insight

Gradient Descent (Step-by-Step)

Step-by-Step Example

Key Idea

Gradient Descent with Proper Upward Parabola

Types of Gradient Descent

Comparison Table

Key Insight

Gradient Descent Convergence (Multi-Iteration)

Key Insight

📘 The Formula:

Example:

🔁 Basic Formula: Gradient Descent Update

✅ Example 1: Simple Gradient Descent Update

Given:

Calculation:

✅ Example 3: With Bias Update

Given:

Update:

Adam Optimizer

Update Equations

Why Adam Works

Comparison

Key Insight

📊 Real-Life Examples of Linear Regression

Multiple choice questions (MCQs)

✅ 1. What is the purpose of the learning rate in a machine learning model?

✅ 2. What can happen if the learning rate is too high?

✅ 3. If the learning rate is too low, what is most likely to happen?

✅ 4. Which of the following is a typical range for a learning rate?

✅ 5. In the weight update formula w=w−η⋅∂L/∂w​, what does η\etaη represent?

✅ 6. Which method adjusts the learning rate during training?

✅ 7. Which optimizer uses an adaptive learning rate for each parameter?

✅ 8. What is a common sign that the learning rate is too high?

MCQs on Hyperparameter

1. Which of the following is a hyperparameter in linear regression with gradient descent?

2. What is the effect of setting the learning rate too high in gradient descent?

3. What does it mean if a model's loss decreases very slowly during training?

4. Which of the following is not typically considered a hyperparameter in linear regression using gradient descent?

5. What is a good approach to finding the right hyperparameters for your model?

6. If your model's loss is fluctuating wildly during training, what could be the issue?

7. Which of the following best describes the role of a hyperparameter?

✅ 5. In the weight update formula w=w−η⋅∂L/∂w, what does η\etaη represent?