A Step by Step ID3 Decision Tree Example

Decision tree algorithms transfom raw data to rule based decision making trees. Herein, ID3 is one of the most common decision tree algorithm. Firstly, It was introduced in 1986 and it is acronym of Iterative Dichotomiser.

premonition_cover — Sandra Bullock, Premonition (2007)

First of all, dichotomisation means dividing into two completely opposite things. That’s why, the algorithm iteratively divides attributes into two groups which are the most dominant attribute and others to construct a tree. Then, it calculates the entropy and information gains of each atrribute. In this way, the most dominant attribute can be founded. After then, the most dominant one is put on the tree as decision node. Thereafter, entropy and gain scores would be calculated again among the other attributes. Thus, the next most dominant attribute is found. Finally, this procedure continues until reaching a decision for that branch. That’s why, it is called Iterative Dichotomiser. So, we’ll mention the algorithm step by step in this post.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Vlog

Here, you should watch the following video to understand how decision tree algorithms work. No matter which decision tree algorithm you are running: ID3, C4.5, CART, CHAID or Regression Trees. They all look for the feature offering the highest information gain. Then, they add a decision rule for the found feature and build an another decision tree for the sub data set recursively until they reached a decision.

Besides, regular decision tree algorithms are designed to create branches for categorical features. Still, we are able to build trees with continuous and numerical features. The trick is here that we will convert continuos features into categorical. We will split the numerical feature where it offers the highest information gain.

ID3 in Python

This blog post mentions the deeply explanation of ID3 algorithm and we will solve a problem step by step. On the other hand, you might just want to run ID3 algorithm and its mathematical background might not attract your attention.

Herein, you can find the python implementation of ID3 algorithm here. You can build ID3 decision trees with a few lines of code. This package supports the most common decision tree algorithms such as ID3, C4.5, CART, CHAID or Regression Trees, also some bagging methods such as random forest and some boosting methods such as gradient boosting and adaboost. You can support this work just by starring the GitHub repository.

Objective

Decision rules will be found based on entropy and information gain pair of features.

Data set

For instance, the following table informs about decision making factors to play tennis at outside for previous 14 days.

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
3	Overcast	Hot	High	Weak	Yes
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
6	Rain	Cool	Normal	Strong	No
7	Overcast	Cool	Normal	Strong	Yes
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
10	Rain	Mild	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes
12	Overcast	Mild	High	Strong	Yes
13	Overcast	Hot	Normal	Weak	Yes
14	Rain	Mild	High	Strong	No

We can summarize the ID3 algorithm as illustrated below

Entropy(S) = ∑ – p(I) . log₂p(I)

Gain(S, A) = Entropy(S) – ∑ [ p(S|A) . Entropy(S|A) ]

These formulas might confuse your mind. Practicing will make it understandable.

Entropy

We need to calculate the entropy first. Decision column consists of 14 instances and includes two labels: yes and no. There are 9 decisions labeled yes, and 5 decisions labeled no.

Entropy(Decision) = – p(Yes) . log₂p(Yes) – p(No) . log₂p(No)

Entropy(Decision) = – (9/14) . log₂(9/14) – (5/14) . log₂(5/14) = 0.940

Now, we need to find the most dominant factor for decisioning.

Wind factor on decision

Gain(Decision, Wind) = Entropy(Decision) – ∑ [ p(Decision|Wind) . Entropy(Decision|Wind) ]

Wind attribute has two labels: weak and strong. We would reflect it to the formula.

Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) . Entropy(Decision|Wind=Weak) ] – [ p(Decision|Wind=Strong) . Entropy(Decision|Wind=Strong) ]

Now, we need to calculate (Decision|Wind=Weak) and (Decision|Wind=Strong) respectively.

Weak wind factor on decision

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	Hot	High	Weak	No
3	Overcast	Hot	High	Weak	Yes
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
10	Rain	Mild	Normal	Weak	Yes
13	Overcast	Hot	Normal	Weak	Yes

There are 8 instances for weak wind. Decision of 2 items are no and 6 items are yes as illustrated below.

1- Entropy(Decision|Wind=Weak) = – p(No) . log₂p(No) – p(Yes) . log₂p(Yes)

2- Entropy(Decision|Wind=Weak) = – (2/8) . log₂(2/8) – (6/8) . log₂(6/8) = 0.811

Notice that if the number of instances of a class were 0 and total number of instances were n, then we need to calculate -(0/n) . log₂(0/n). Here, log(0) would be equal to -∞, and we cannot calculate 0 times ∞. This is a special case often appears in decision tree applications. Even though compilers cannot compute this operation, we can compute it with calculus. If you wonder how to compute this equation, please read this post.

Strong wind factor on decision

Day	Outlook	Temp.	Humidity	Wind	Decision
2	Sunny	Hot	High	Strong	No
6	Rain	Cool	Normal	Strong	No
7	Overcast	Cool	Normal	Strong	Yes
11	Sunny	Mild	Normal	Strong	Yes
12	Overcast	Mild	High	Strong	Yes
14	Rain	Mild	High	Strong	No

Here, there are 6 instances for strong wind. Decision is divided into two equal parts.

1- Entropy(Decision|Wind=Strong) = – p(No) . log₂p(No) – p(Yes) . log₂p(Yes)

2- Entropy(Decision|Wind=Strong) = – (3/6) . log₂(3/6) – (3/6) . log₂(3/6) = 1

Now, we can turn back to Gain(Decision, Wind) equation.

Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) . Entropy(Decision|Wind=Weak) ] – [ p(Decision|Wind=Strong) . Entropy(Decision|Wind=Strong) ] = 0.940 – [ (8/14) . 0.811 ] – [ (6/14). 1] = 0.048

Calculations for wind column is over. Now, we need to apply same calculations for other columns to find the most dominant factor on decision.

Other factors on decision

We have applied similar calculation on the other columns.

1- Gain(Decision, Outlook) = 0.246

2- Gain(Decision, Temperature) = 0.029

3- Gain(Decision, Humidity) = 0.151

As seen, outlook factor on decision produces the highest score. That’s why, outlook decision will appear in the root node of the tree.

Now, we need to test dataset for custom subsets of outlook attribute.

Overcast outlook on decision

Basically, decision will always be yes if outlook were overcast.

Day	Outlook	Temp.	Humidity	Wind	Decision
3	Overcast	Hot	High	Weak	Yes
7	Overcast	Cool	Normal	Strong	Yes
12	Overcast	Mild	High	Strong	Yes
13	Overcast	Hot	Normal	Weak	Yes

Sunny outlook on decision

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes

Here, there are 5 instances for sunny outlook. Decision would be probably 3/5 percent no, 2/5 percent yes.

1- Gain(Outlook=Sunny|Temperature) = 0.570

2- Gain(Outlook=Sunny|Humidity) = 0.970

3- Gain(Outlook=Sunny|Wind) = 0.019

Now, humidity is the decision because it produces the highest score if outlook were sunny.

At this point, decision will always be no if humidity were high.

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
8	Sunny	Mild	High	Weak	No

On the other hand, decision will always be yes if humidity were normal

Day	Outlook	Temp.	Humidity	Wind	Decision
9	Sunny	Cool	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes

Finally, it means that we need to check the humidity and decide if outlook were sunny.

Rain outlook on decision

Day	Outlook	Temp.	Humidity	Wind	Decision
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
6	Rain	Cool	Normal	Strong	No
10	Rain	Mild	Normal	Weak	Yes
14	Rain	Mild	High	Strong	No

1- Gain(Outlook=Rain | Temperature) = 0.01997309402197489

2- Gain(Outlook=Rain | Humidity) = 0.01997309402197489

3- Gain(Outlook=Rain | Wind) = 0.9709505944546686

Here, wind produces the highest score if outlook were rain. That’s why, we need to check wind attribute in 2nd level if outlook were rain.

So, it is revealed that decision will always be yes if wind were weak and outlook were rain.

Day	Outlook	Temp.	Humidity	Wind	Decision
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
10	Rain	Mild	Normal	Weak	Yes

What’s more, decision will be always no if wind were strong and outlook were rain.

Day	Outlook	Temp.	Humidity	Wind	Decision
6	Rain	Cool	Normal	Strong	No
14	Rain	Mild	High	Strong	No

So, decision tree construction is over. We can use the following rules for decisioning.

tree-v3
Final version of decision tree

Feature Importance

Decision trees are naturally explainable and interpretable algorithms. Besides, we can find the feature importance values as well to understand how model works.

What’s next?

Extended version of ID3 is C4.5. It supports numerical features and uses gain ratio instead of information gain. Here, you can find a tutorial deeply explained.

Gradient Boosting Decision Trees

Nowadays, gradient boosting decision trees are very popular in machine learning community. They are actually not different than the decision tree algorithm mentioned in this blog post. They mainly builds sequantial decision trees based on the errors in the previous loop.

Random Forest vs Gradient Boosting

The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.

Adaboost vs Gradient Boosting

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Conclusion

So, decision tree algorithms transform the raw data into rule based mechanism. In this post, we have mentioned one of the most common decision tree algorithm named as ID3. They can use nominal attributes whereas most of common machine learning algorithms cannot. However, it is required to transform numeric attributes to nominal in ID3. Besides, its evolved version C4.5 exists which can handle nominal data. Even though decision tree algorithms are powerful, they have long training time. On the other hand, they tend to fall over-fitting. Besides, they have evolved versions named random forests which tend not to fall over-fitting issue and have shorter training times.

Like this blog? Support me on Patreon

30 Comments

Workneh says:

June 6, 2018 at 9:00 pm

Thank you dear! It’s very helpful.
OGBONNA VITALIS says:

June 7, 2018 at 5:27 am

Thanks alot, very nice of
mr chidubem okafor says:

June 7, 2018 at 1:18 pm

very cool
jane doe says:

July 4, 2018 at 7:34 am

Instead of (Outlook=Sunny|Temperature)gain, shouldn’t it be Gain(Outlook=Sunny|Temperature)?
1. Sefik Serengil says:
  
  July 5, 2018 at 1:20 pm
  
  Right, I fixed the terms. Thanks.
Zubby zelletan says:

July 19, 2018 at 5:27 am

Thanks alot, the explanation was helpful.
Mustajab Hussain says:

October 28, 2018 at 9:38 am

Please give me the python code of this algorithm
1. Sefik Serengil says:
  
  October 28, 2018 at 9:40 am
  
  You can find it in my GitHub repo. Link is https://github.com/serengil/chefboost
Mustajab Hussain says:

October 28, 2018 at 9:54 am

Thanks Sefik Serengil. You are such a helping man
Yunusa Haruna says:

December 28, 2018 at 4:55 pm

Great tuts, we’ve already got the decision of yes or no. Then what are we looking for?

Regards!
1. Sefik Serengil says:
  
  December 28, 2018 at 5:12 pm
  
  We extract the behavior of previous habits. In this eay, we can answer rain outlook and hot temperature which does not exist in the data set.
Ayman Negm says:

August 7, 2019 at 8:09 pm

very nice explanation … thank you 🙂
dyari says:

November 14, 2019 at 10:59 am

i did same thing but it tells me ImportError: No module named commons. it can not recognize chefboost i use 2.7 jupyter notebook is that the problem ? thanks alot..
1. Sefik Serengil says:
  
  November 14, 2019 at 11:04 am
  
  First of all, you should run it on python 3.7.
  
  After then, commons is a directory in chefboost. You can see it here: https://github.com/serengil/chefboost
  
  commons folder exists in your environment?
  1. dyari says:
    
    November 14, 2019 at 12:48 pm
    
    Thank you so much worked on 3.7.. but i tried my own dataset in which one of the columns is numeric. and the decision tree only took the first value of the first row and split that attribute on that value, But instead of that I have to take an average of that column and split on the average, is there any way to do that ?! thanks again.
    1. Sefik Serengil says:
      
      November 14, 2019 at 12:49 pm
      
      I do not understand your problem actually. Could you share the data set and building decision tree configurations?
dyari says:

November 14, 2019 at 1:09 pm

this is the dataset and the rule.ps https://drive.google.com/open?id=1mDjy7tOqjjfUoBwp0LrkLB2sBvvoVCDJ …
i hope you can find out what i mean .. and another thing is it can not predict if one of the columns is a number. thanks
Mirbagher says:

January 21, 2020 at 5:19 am

Great step by step explanation! Please let me know if I can share these contents in my machine learning class?

Thank you!
1. Sefik Serengil says:
  
  January 21, 2020 at 5:23 am
  
  Of course, I would be happy if you share these content in a classroom. I will be appreciate if you reference the blog.
Ted says:

March 22, 2020 at 12:44 pm

Nice explanation.Just one doubt…Can you tell me how ID3 handles continuous values of attributes?I am aware that c4.5 is the way to go if there are continuous values but I just happen to find a ID3 problem that had continuous values in the attribute.Please tell me how does ID3 handle continuous values
Djamal says:

April 26, 2020 at 1:25 pm

@Ted ID3 only works with categorical attributes.
Tuba says:

July 24, 2020 at 10:31 am

can I apply Linear Regression on this data set
1. Sefik Serengil says:
  
  August 7, 2020 at 8:59 am
  
  Decision trees are non-linear (piere-wise actually) algorithms. So, linear regression will fail on this kind of data set.
2. Muhammad Yasir says:
  
  April 30, 2021 at 10:07 am
  
  When the target variable is a discrete value, either yes or no, true or false. then classification is suitable for such a dataset. Linear regression is useable when the target variable or label continues like prices, temperature, etc. In this case, the target variable is yes/no form so you can’t apply LR.
Sahe Alam says:

October 13, 2020 at 9:20 am

Thank you so much for the solution. It’s really helpful for me during my final examination time for Master of networking (Cyber Security).
DexlonS says:

February 17, 2021 at 5:13 pm

no temperature on decision tree?
1. Sefik Serengil says:
  
  February 17, 2021 at 5:13 pm
  
  yes, it seems that it is unimportant one.
sara says:

April 9, 2021 at 11:04 am

Thank you so much. but what is deference between this algorithm and id5 ones? how features are selected?
1. Sefik Serengil says:
  
  April 9, 2021 at 1:16 pm
  
  No idea what id5 is
Banar Agung Santosa says:

March 25, 2022 at 3:23 pm

hey can u make this recursive, keep asking the questions until the gain is 0?