Building Decision Trees

17/05/2026

Understanding Decision Trees

Decision trees are intuitive models used to support decisions, classify data, or predict outcomes by following a series of simple rules. Each internal node represents a question, each branch represents a possible answer, and each leaf node represents a final decision or prediction. Because the logic is visual and easy to follow, decision trees are widely used in business, data science, and operations to explain complex choices in a transparent way.

They can handle both numerical and categorical data, work well as a first modeling approach, and help teams align on how decisions are actually made. Whether you are mapping customer journeys, evaluating risks, or building machine learning models, a clear decision tree can turn scattered information into a structured, repeatable process.

To build a decision tree, you start with a main question or goal, then repeatedly split your data or options based on the most informative criteria. At each step, you choose the question that best separates different outcomes, gradually forming a branching structure. This makes it easy to trace why a particular decision was reached, which is especially valuable in regulated or high‑stakes environments.

However, decision trees can become overly complex if not pruned or simplified, leading to overfitting in predictive models. Good practice includes limiting depth, combining similar branches, and regularly reviewing the tree with stakeholders. When designed thoughtfully, decision trees provide a powerful balance of clarity, flexibility, and analytical strength.

Decision Tree in Machine Learning

🌳 Decision Tree in Machine Learning

A Decision Tree is a supervised learning algorithm used for classification and regression.

Internal Node → Decision based on feature
Branch → Outcome of decision
Leaf Node → Final prediction

📊 Example Dataset

Age	Income	Student	Credit Rating	Buys Product
<30	High	No	Fair	No
<30	High	No	Excellent	No
31–40	High	No	Fair	Yes
>40	Medium	No	Fair	Yes
>40	Low	Yes	Fair	Yes
>40	Low	Yes	Excellent	No
31–40	Low	Yes	Excellent	Yes
<30	Medium	No	Fair	No
<30	Low	Yes	Fair	Yes
>40	Medium	Yes	Fair	Yes
<30	Medium	Yes	Excellent	Yes
31–40	Medium	No	Excellent	Yes
31–40	High	Yes	Fair	Yes
>40	Medium	No	Excellent	No

🧠 Step-by-Step Working

1. Root Node Selection

Entropy & Information Gain
Gini Index

Assumption: Age is selected as the root node.

2. Splitting

Age < 30
Age 31–40
Age > 40

3. Further Decisions

Case 1: Age < 30

Student = Yes → Yes
Student = No → No

Case 2: Age 31–40

All outcomes → Yes

Case 3: Age > 40

Credit Rating = Fair → Yes
Credit Rating = Excellent → No

🌿 Final Decision Tree

Age
├── <30 → Student
│   ├── Yes → Yes
│   └── No → No
├── 31–40 → Yes
└── >40 → Credit Rating
    ├── Fair → Yes
    └── Excellent → No

🔍 Example Prediction

Input: Age = <30, Student = Yes

Prediction: Yes

📌 Key Concepts

Entropy = - Σ (p log₂ p)
Information Gain = Entropy(parent) - Weighted Entropy(children)
Gini = 1 - Σ (p²)

⚡ Advantages

Easy to understand
Handles categorical & numerical data
No feature scaling needed

⚠️ Disadvantages

Overfitting risk
Sensitive to data changes

🚀 Applications

Credit risk analysis
Medical diagnosis
Fraud detection
Customer segmentation

Gini Impurity - Decision Tree Root Node Selection

🌳 Root Node Selection using Gini Impurity

📌 Step 1: Gini Formula

Gini = 1 - Σ (p²)

Where p represents probability of each class.

---

📊 Step 2: Gini of Entire Dataset

Total Records = 14
Yes = 9
No = 5

p(Yes) = 9/14
p(No) = 5/14

Gini = 1 - (9/14)² - (5/14)²
     = 1 - (0.41 + 0.13)
     = 0.46

---

🔍 Step 3: Split by Features

✅ Feature 1: Age

Age < 30 → Yes=2, No=3

Gini = 0.48

Age 31–40 → Yes=4, No=0

Gini = 0

Age > 40 → Yes=3, No=2

Gini = 0.48

🎯 Weighted Gini (Age)

= (5/14)*0.48 + (4/14)*0 + (5/14)*0.48
= 0.342

---

✅ Feature 2: Student

Student = Yes → Yes=6, No=1

Gini = 0.245

Student = No → Yes=3, No=4

Gini = 0.49

🎯 Weighted Gini (Student)

= (7/14)*0.245 + (7/14)*0.49
= 0.367

---

✅ Feature 3: Credit Rating

Fair → Yes=6, No=2

Gini ≈ 0.375

Excellent → Yes=3, No=3

Gini = 0.5

🎯 Weighted Gini (Credit Rating)

= (8/14)*0.375 + (6/14)*0.5
= 0.428

---

🏆 Step 4: Comparison

Feature	Weighted Gini
Age	0.342 ✅
Student	0.367
Credit Rating	0.428

---

🎯 Final Decision

Root Node = Age (Lowest Gini Impurity)

---

🧠 Key Insight

Lower Gini = Better split
Gini = 0 → Pure node
Decision Trees minimize impurity at each step

The algorithm selects the feature that makes the data most pure after splitting.

Root Node Selection using Entropy & Information Gain

🌳 Root Node Selection using Entropy & Information Gain

📌 Step 1: Entropy Formula

Entropy = - Σ (p log₂ p)

Where p represents probability of each class.

📊 Step 2: Entropy of Entire Dataset

Total Records = 14
Yes = 9
No = 5

p(Yes) = 9/14
p(No) = 5/14

Entropy(S) = -[(9/14) log₂ (9/14) + (5/14) log₂ (5/14)]
           ≈ 0.94

🔍 Step 3: Information Gain Calculation

✅ Feature 1: Age

Age Group	Yes	No	Entropy
<30	2	3	0.97
31–40	4	0	0
>40	3	2	0.97

🎯 Weighted Entropy (Age)

= (5/14)*0.97 + (4/14)*0 + (5/14)*0.97
= 0.693

📈 Information Gain (Age)

IG(Age) = 0.94 - 0.693 = 0.247

✅ Feature 2: Student

Student	Yes	No	Entropy
Yes	6	1	0.59
No	3	4	0.98

🎯 Weighted Entropy (Student)

= (7/14)*0.59 + (7/14)*0.98
= 0.785

📈 Information Gain (Student)

IG(Student) = 0.94 - 0.785 = 0.155

✅ Feature 3: Credit Rating

Credit Rating	Yes	No	Entropy
Fair	6	2	0.81
Excellent	3	3	1.00

🎯 Weighted Entropy (Credit Rating)

= (8/14)*0.81 + (6/14)*1.00
= 0.892

📈 Information Gain (Credit Rating)

IG(Credit) = 0.94 - 0.892 = 0.048

🏆 Step 4: Comparison

Feature	Information Gain
Age	0.247 ✅
Student	0.155
Credit Rating	0.048

🎯 Final Decision

Root Node = Age (Highest Information Gain)

🧠 Key Insights

Higher Information Gain = Better split
Entropy measures uncertainty
Information Gain measures reduction in uncertainty

The algorithm selects the feature that reduces uncertainty the most after splitting.

⚡ Final Conclusion

Method	Selected Root
Gini Index	Age ✅
Information Gain	Age ✅

🔐 Cybersecurity Use Case: Intrusion Detection using Decision Trees

In cybersecurity, decision trees are widely used for threat detection, fraud analysis, and intrusion detection systems (IDS).

📊 Example Dataset (Network Activity)

Login Attempts	IP Reputation	Time of Access	Data Transfer	Threat
High	Unknown	Night	High	Yes
Low	Trusted	Day	Low	No
Medium	Unknown	Night	Medium	Yes
High	Blacklisted	Night	High	Yes
Low	Trusted	Day	Low	No
Medium	Trusted	Evening	Medium	No
High	Unknown	Night	High	Yes
Low	Unknown	Day	Low	No

🧠 Objective

Predict whether a network activity is a Threat (Yes/No).

---

🔍 Example Decision Logic

If IP Reputation = Blacklisted → Threat = Yes
If Login Attempts = High AND Time = Night → Threat = Yes
If IP Reputation = Trusted → Threat = No

---

🌿 Sample Decision Tree

IP Reputation
├── Blacklisted → Threat = Yes
├── Trusted → Threat = No
└── Unknown
    ├── Login Attempts = High → Yes
    └── Login Attempts = Low/Medium → No

---

📈 Why This Works in Cybersecurity

Identifies suspicious patterns quickly
Interpretable rules (important for audits & compliance)
Works well with categorical + behavioral data
Can be integrated into SIEM systems

---

⚠️ Real-World Considerations

Attackers may mimic normal behavior (evasion)
Requires continuous retraining with new threat data
Often combined with ensemble methods (Random Forest, XGBoost)

---

🚀 Advanced Insight (Expert Level)

In real-world cybersecurity systems, decision trees are rarely used alone. 
They are part of ensemble models and AI-driven SOC pipelines, 
where they help explain decisions made by complex models.

Cybersecurity Threat Detection Simulator

🔐 Cybersecurity Threat Detection Simulator

Select the network parameters and click Predict to detect potential threats.

Decision Logic:

If IP Reputation = Blacklisted → Threat
If Login Attempts = High AND Time = Night → Threat
If IP Reputation = Trusted → Safe
Otherwise → Safe

Regression Tree in Machine Learning

🌳 Regression Tree in Machine Learning

A Regression Tree is used when the output variable is continuous (numerical) instead of categorical.

---

🎯 Problem: Predict House Price

📊 Dataset

Size (sq ft)	Bedrooms	Age (years)	Price (₹ Lakhs)
800	2	10	40
900	2	8	45
1000	3	6	50
1200	3	5	60
1500	4	4	75
1700	4	3	85
2000	5	2	100

---

🧠 How Regression Tree Works

Splits data based on feature values
Predicts mean value at leaf nodes
Minimizes variance / Mean Squared Error (MSE)

---

📌 Step 1: First Split

Condition: Size < 1200

Left Node (Size < 1200)

Prices: 40, 45, 50

Mean = 45

Right Node (Size ≥ 1200)

Prices: 60, 75, 85, 100

Mean = 80

---

📌 Step 2: Further Split

Condition: Bedrooms < 4

Bedrooms < 4

Price: 60

Mean = 60

Bedrooms ≥ 4

Prices: 75, 85, 100

Mean = 86.7

---

🌿 Final Regression Tree

Size < 1200?
├── Yes → Predict Price = 45
└── No
    ├── Bedrooms < 4 → Predict = 60
    └── Bedrooms ≥ 4 → Predict = 86.7

---

🔍 Example Prediction

Input:

Size = 1600
Bedrooms = 4

Prediction Path:

Size ≥ 1200 → Right
Bedrooms ≥ 4 → Right

Predicted Price = ₹86.7 Lakhs

---

📌 Key Formula

MSE = (1/n) Σ (yi - ȳ)²

Regression trees select splits that minimize prediction error.

---

⚡ Classification vs Regression Tree

Aspect	Classification Tree	Regression Tree
Output	Class (Yes/No)	Continuous value
Metric	Gini / Entropy	MSE / Variance
Leaf Node	Majority class	Mean value

---

🚀 Real-World Use Cases

House price prediction
Sales forecasting
Stock price estimation
Risk scoring in finance
Cyber risk severity prediction

---

🔐 Cybersecurity Example

Failed Logins	Data Transfer	Risk Score
5	Low	20
20	Medium	60
50	High	90

Regression trees can predict continuous risk scores instead of just Yes/No threats.

Used in cybersecurity to estimate:
Risk Score (0–100)
Expected Loss
Attack Severity

ID3 Algorithm in Machine Learning

🌳 ID3 Algorithm (Machine Learning)

The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree algorithm used for classification problems. It selects the feature with the highest Information Gain.

---

🧠 Core Idea

ID3 selects the attribute that reduces uncertainty the most.

Entropy → Measures impurity
Information Gain → Reduction in entropy

---

📌 Key Formulas

Entropy

Entropy = - Σ (p log₂ p)

Information Gain

IG = Entropy(parent) - Weighted Entropy(children)

---

⚙️ Step-by-Step Working

1. Calculate Entropy

Measure impurity of dataset.

2. Compute Information Gain

Evaluate each feature and calculate entropy after split.

3. Select Best Feature

Feature with highest Information Gain becomes root node.

4. Repeat Recursively

Continue splitting subsets
Stop when data becomes pure or features end

---

📊 Example: Play Tennis

Outlook	Temperature	Humidity	Wind	Play
Sunny	Hot	High	Weak	No
Sunny	Hot	High	Strong	No
Overcast	Hot	High	Weak	Yes
Rain	Mild	High	Weak	Yes

ID3 calculates entropy and selects the best feature (e.g., Outlook).

---

🌿 Resulting Tree (Concept)

Outlook
├── Sunny → further split
├── Overcast → Yes
└── Rain → further split

---

🔍 Characteristics of ID3

✅ Advantages

Simple and easy to understand
Fast computation
Works well on small datasets

⚠️ Limitations

Handles only categorical data
Prone to overfitting
No pruning mechanism
Biased toward features with many values

---

🔐 Cybersecurity Use Case

Suspicious Login Detection

IP Reputation	Login Attempts	Time	Threat
Trusted	Low	Day	No
Unknown	High	Night	Yes
Blacklisted	Medium	Night	Yes

ID3 helps identify patterns and build rule-based intrusion detection systems.

---

🧠 Key Insight

ID3 builds decision trees by selecting features that maximize reduction in uncertainty using Information Gain.

CART Algorithm in Machine Learning

🌳 CART Algorithm (Machine Learning)

The CART (Classification and Regression Trees) algorithm is a decision tree algorithm used for both classification and regression problems.

CART always creates a binary tree (two branches at each split).

---

🧠 Core Idea

CART selects splits that minimize impurity (classification) or error (regression).

---

📌 Key Concepts

Classification → Gini Index

Gini = 1 - Σ (p²)

Lower Gini → Better split

Regression → Mean Squared Error (MSE)

MSE = (1/n) Σ (yi - ȳ)²

Lower MSE → Better split

---

⚙️ Step-by-Step Working

1. Select Best Feature

Evaluate all features using Gini or MSE.

2. Find Best Split Point

Example: Size < 1200

3. Create Binary Split

Left → Condition true
Right → Condition false

4. Repeat Recursively

Continue splitting subsets
Stop when node becomes pure or minimum samples reached

5. Apply Pruning

Remove unnecessary branches to reduce overfitting.

---

📊 Example (Classification)

Loan Approval

Income	Credit Score	Approved
High	Good	Yes
Low	Poor	No
Medium	Good	Yes
Low	Good	No

🌿 CART Tree

Credit Score = Good?
├── Yes → Approved = Yes
└── No → Approved = No

---

📈 Example (Regression)

House Price Prediction

Size	Price
800	40
1200	60
1500	75

🌿 CART Regression Tree

Size < 1200?
├── Yes → Mean = 40
└── No → Mean = 67.5

---

🔍 Characteristics of CART

✅ Advantages

Works for both classification and regression
Handles numerical and categorical data
Supports pruning
More practical than ID3

⚠️ Limitations

Can overfit if not pruned
Sensitive to small data changes
Binary splits may increase tree depth

---

🔐 Cybersecurity Use Case

Intrusion Detection

Feature	Example
Login Attempts	High
IP Reputation	Blacklisted
Time	Night

CART can detect suspicious activity and predict attack probability.

---

🧠 ID3 vs CART

Feature	ID3	CART
Split Type	Multi-way	Binary
Metric	Entropy	Gini / MSE
Pruning	No	Yes
Data Type	Categorical only	Both

---

💡 Key Insight

CART builds binary decision trees using Gini Index (classification) and MSE (regression), with pruning to improve generalization.

Pruning in Decision Trees

🌳 Pruning in Decision Trees

Pruning is the process of removing unnecessary branches from a decision tree to improve its performance on unseen data.

---

🧠 Why Pruning is Needed

Decision trees tend to overfit training data
They may learn noise and become too complex

Pruning improves generalization and reduces model complexity.

---

📌 Types of Pruning

1️⃣ Pre-Pruning (Early Stopping)

Stops tree growth early based on conditions:

Maximum depth reached
Minimum samples per node
Minimum information gain threshold

✅ Advantage

Faster training
Avoids overly complex trees

⚠️ Limitation

May underfit if stopped too early

---

2️⃣ Post-Pruning (Backward Pruning)

Build full tree first, then remove unnecessary branches.

Process:

Grow full tree
Evaluate subtrees
Remove branches that do not improve performance

✅ Advantage

Better accuracy compared to pre-pruning

---

📊 Example

🌿 Before Pruning (Overfitted Tree)

Age
├── <30
│   ├── Student=Yes → Yes
│   └── Student=No → No
├── 31–40 → Yes
└── >40
    ├── Credit=Fair → Yes
    └── Credit=Excellent
        ├── Income=High → No
        └── Income=Low → Yes

Too complex → captures noise

---

✂️ After Pruning

Age
├── <30 → split on Student
├── 31–40 → Yes
└── >40 → split on Credit

Simpler tree → better generalization

---

⚙️ Common Pruning Techniques

Cost Complexity Pruning (CART)

Rα(T) = R(T) + α|T|

R(T) → Error
|T| → Number of leaves
α → Complexity penalty

Balances accuracy and simplicity.

---

Reduced Error Pruning

Remove node if validation accuracy does not decrease

Minimum Error Pruning

Replace subtree with leaf if error reduces

---

🔍 Key Benefits

Prevents overfitting
Improves interpretability
Reduces noise impact
Faster predictions

---

⚠️ Trade-Off

Too Much Pruning	Too Little Pruning
Underfitting	Overfitting
High bias	High variance

---

🔐 Cybersecurity Use Case

In intrusion detection systems:

Without pruning → Too many false positives
With pruning → Focus on real threats

Pruning helps reduce alert fatigue and improves decision-making in SOC environments.

---

🧠 Key Insight

Pruning simplifies decision trees by removing branches that do not improve predictive performance, reducing overfitting and improving generalization.

🌳 Decision Tree Classifier (Machine Learning)

Problem: Predict whether a login is a Threat or Safe based on features.

📌 Python Program

from sklearn.tree import DecisionTreeClassifier

# Features: [Login Attempts, IP Reputation]
# 0 = Low, 1 = Medium, 2 = High
# IP: 0 = Trusted, 1 = Unknown, 2 = Blacklisted

X = [
    [2, 2],
    [1, 1],
    [2, 1],
    [0, 0],
    [1, 0]
]

# Labels: 1 = Threat, 0 = Safe
y = [1, 0, 1, 0, 0]

model = DecisionTreeClassifier()
model.fit(X, y)

# Test sample
prediction = model.predict([[2, 2]])

if prediction[0] == 1:
    print("🚨 Threat Detected")
else:
    print("✅ Safe Activity")

▶ Run This Code

📊 Expected Output

🚨 Threat Detected

🧠 Explanation

We train a Decision Tree model using login behavior data.
Model learns patterns like high login attempts + bad IP = threat.
We test a new case: High attempts + Blacklisted IP.
The model predicts it as a Threat.

MCQs Decision Trees

Decision Tree Quiz

🌳 Interactive Quiz: Decision Trees

Q1. What is the main objective of a decision tree?

A. Maximize variance
B. Minimize impurity
C. Increase data size
D. Normalize features

✅ Answer: B
Decision trees aim to reduce impurity using Gini or Entropy.

Q2. Which metric is used in ID3?

A. Gini Index
B. Entropy
C. MSE
D. Accuracy

✅ Answer: B
ID3 uses Entropy and Information Gain.

Q3. What does a leaf node represent?

A. Feature
B. Split condition
C. Final prediction
D. Dataset

✅ Answer: C
Leaf node gives the final output.

Q4. Gini impurity of a pure node is:

A. 1
B. 0
C. 0.5
D. -1

✅ Answer: B
Pure node means all samples belong to one class.

Q5. Overfitting occurs when the tree is:

A. Small
B. Deep
C. Balanced
D. Normalized

✅ Answer: B
Deep trees memorize training data and overfit.

Q6. Which algorithm uses Gini Index?

A. ID3
B. CART
C. KNN
D. Naive Bayes

✅ Answer: B
CART uses Gini Index for splitting.

Q7. Decision trees can handle:

A. Only numerical data
B. Only categorical data
C. Both types
D. Only binary data

✅ Answer: C
Handles both categorical and numerical data.

Q8. What is pruning?

A. Adding nodes
B. Removing nodes
C. Scaling features
D. Encoding data

✅ Answer: B
Pruning removes unnecessary branches to reduce overfitting.

Q9. Entropy measures:

A. Accuracy
B. Impurity
C. Speed
D. Size

✅ Answer: B
Entropy measures randomness or uncertainty.

Q10. A decision tree is best described as:

A. Linear model
B. Rule-based model
C. Neural model
D. Probabilistic model

✅ Answer: B
Decision trees follow if-then rules.