A Step By Step C4.5 Decision Tree Example

Decision trees are still hot topics nowadays in data science world. Here, ID3 is the most common conventional decision tree algorithm but it has bottlenecks. Attributes must be nominal values, dataset must not include missing data, and finally the algorithm tend to fall into overfitting. Here, Ross Quinlan, inventor of ID3, made some improvements for these bottlenecks and created a new algorithm named C4.5. Now, the algorithm can create a more generalized models including continuous data and could handle missing data. Additionally, some resources such as Weka named this algorithm as J48. Actually, it refers to re-implementation of C4.5 release 8.

guardians_galaxy_Groot_1000-920x584
Groot appears in Guardians of Galaxy and Avengers Infinity War

Vlog

Here, you should watch the following video to understand how decision tree algorithms work. No matter which decision tree algorithm you are running: ID3, C4.5, CART, CHAID or Regression Trees. They all look for the feature offering the highest information gain. Then, they add a decision rule for the found feature and build an another decision tree for the sub data set recursively until they reached a decision.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Besides, regular decision tree algorithms are designed to create branches for categorical features. Still, we are able to build trees with continuous and numerical features. The trick is here that we will convert continuos features into categorical. We will split the numerical feature where it offers the highest information gain.

C4.5 in Python

This blog post mentions the deeply explanation of C4.5 algorithm and we will solve a problem step by step. On the other hand, you might just want to run C4.5 algorithm and its mathematical background might not attract your attention.

Herein, you can find the python implementation of C4.5 algorithm here. You can build C4.5 decision trees with a few lines of code. This package supports the most common decision tree algorithms such as ID3, CART, CHAID or Regression Trees, also some bagging methods such as random forest and some boosting methods such as gradient boosting and adaboost.

Objective

Decision rules will be found based on entropy and information gain ratio pair of each feature. In each level of decision tree, the feature having the maximum gain ratio will be the decision rule.

Data set

We are going to create a decision table for the following dataset. It informs about decision making factors to play tennis at outside for previous 14 days. The dataset might be familiar from the ID3 post. The difference is that temperature and humidity columns have continuous values instead of nominal ones.

Day Outlook Temp. Humidity Wind Decision
1 Sunny 85 85 Weak No
2 Sunny 80 90 Strong No
3 Overcast 83 78 Weak Yes
4 Rain 70 96 Weak Yes
5 Rain 68 80 Weak Yes
6 Rain 65 70 Strong No
7 Overcast 64 65 Strong Yes
8 Sunny 72 95 Weak No
9 Sunny 69 70 Weak Yes
10 Rain 75 80 Weak Yes
11 Sunny 75 70 Strong Yes
12 Overcast 72 90 Strong Yes
13 Overcast 81 75 Weak Yes
14 Rain 71 80 Strong No

We will do what we have done in ID3 example. Firstly, we need to calculate global entropy. There are 14 examples; 9 instances refer to yes decision, and 5 instances refer to no decision.

Entropy(Decision) = ∑ – p(I) . log2p(I) = – p(Yes) . log2p(Yes) – p(No) . log2p(No) = – (9/14) . log2(9/14) – (5/14) . log2(5/14) = 0.940

In ID3 algorithm, we’ve calculated gains for each attribute. Here, we need to calculate gain ratios instead of gains.





GainRatio(A) = Gain(A) / SplitInfo(A)

SplitInfo(A) = -∑ |Dj|/|D| x log2|Dj|/|D|

Wind Attribute

Wind is a nominal attribute. Its possible values are weak and strong.

Gain(Decision, Wind) = Entropy(Decision) – ∑ ( p(Decision|Wind) . Entropy(Decision|Wind) )

Gain(Decision, Wind) =  Entropy(Decision) – [ p(Decision|Wind=Weak) . Entropy(Decision|Wind=Weak) ] + [ p(Decision|Wind=Strong) . Entropy(Decision|Wind=Strong) ]

There are 8 weak wind instances. 2 of them are concluded as no, 6 of them are concluded as yes.

Entropy(Decision|Wind=Weak) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = – (2/8) . log2(2/8) – (6/8) . log2(6/8) = 0.811

Entropy(Decision|Wind=Strong) = – (3/6) . log2(3/6) – (3/6) . log2(3/6) = 1

Gain(Decision, Wind) = 0.940 – (8/14).(0.811) – (6/14).(1) = 0.940 – 0.463 – 0.428 = 0.049

There are 8 decisions for weak wind, and 6 decisions for strong wind.





SplitInfo(Decision, Wind) = -(8/14).log2(8/14) – (6/14).log2(6/14) = 0.461 + 0.524 = 0.985

GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision, Wind) = 0.049 / 0.985 = 0.049

Outlook Attribute

Outlook is a nominal attribute, too. Its possible values are sunny, overcast and rain.

Gain(Decision, Outlook) = Entropy(Decision) – ∑ ( p(Decision|Outlook) . Entropy(Decision|Outlook) ) =

Gain(Decision, Outlook) = Entropy(Decision) – p(Decision|Outlook=Sunny) . Entropy(Decision|Outlook=Sunny) – p(Decision|Outlook=Overcast) . Entropy(Decision|Outlook=Overcast) – p(Decision|Outlook=Rain) . Entropy(Decision|Outlook=Rain)

There are 5 sunny instances. 3 of them are concluded as no, 2 of them are concluded as yes.

Entropy(Decision|Outlook=Sunny) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(3/5).log2(3/5) – (2/5).log2(2/5) = 0.441 + 0.528 = 0.970

Entropy(Decision|Outlook=Overcast) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(0/4).log2(0/4) – (4/4).log2(4/4) = 0

Notice that log2(0) is actually equal to -∞ but assume that it is equal to 0. Actually, lim (x->0) x.log2(x) = 0. If you wonder the proof, please look at this post.

Entropy(Decision|Outlook=Rain) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(2/5).log2(2/5) – (3/5).log2(3/5) = 0.528 + 0.441 = 0.970





Gain(Decision, Outlook) = 0.940 – (5/14).(0.970) – (4/14).(0) – (5/14).(0.970) – (5/14).(0.970) = 0.246

There are 5 instances for sunny, 4 instances for overcast and 5 instances for rain

SplitInfo(Decision, Outlook) = -(5/14).log2(5/14) -(4/14).log2(4/14) -(5/14).log2(5/14) = 1.577

GainRatio(Decision, Outlook) = Gain(Decision, Outlook)/SplitInfo(Decision, Outlook) = 0.246/1.577 = 0.155

Humidity Attribute

As an exception, humidity is a continuous attribute. We need to convert continuous values to nominal ones. C4.5 proposes to perform binary split based on a threshold value. Threshold should be a value which offers maximum gain for that attribute. Let’s focus on humidity attribute. Firstly, we need to sort humidity values smallest to largest.

Day Humidity Decision
7 65 Yes
6 70 No
9 70 Yes
11 70 Yes
13 75 Yes
3 78 Yes
5 80 Yes
10 80 Yes
14 80 No
1 85 No
2 90 No
12 90 Yes
8 95 No
4 96 Yes

Now, we need to iterate on all humidity values and seperate dataset into two parts as instances less than or equal to current value, and instances greater than the current value. We would calculate the gain or gain ratio for every step. The value which maximizes the gain would be the threshold.

Check 65 as a threshold for humidity

Entropy(Decision|Humidity<=65) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(0/1).log2(0/1) – (1/1).log2(1/1) = 0

Entropy(Decision|Humidity>65) = -(5/13).log2(5/13) – (8/13).log2(8/13) =0.530 + 0.431 = 0.961

Gain(Decision, Humidity<> 65) = 0.940 – (1/14).0 – (13/14).(0.961) = 0.048





* The statement above refers to that what would branch of decision tree be for less than or equal to 65, and greater than 65. It does not refer to that humidity is not equal to 65!

SplitInfo(Decision, Humidity<> 65) = -(1/14).log2(1/14) -(13/14).log2(13/14) = 0.371

GainRatio(Decision, Humidity<> 65) = 0.126

Check 70 as a threshold for humidity

Entropy(Decision|Humidity<=70) = – (1/4).log2(1/4) – (3/4).log2(3/4) = 0.811

Entropy(Decision|Humidity>70) =  – (4/10).log2(4/10) – (6/10).log2(6/10) = 0.970

Gain(Decision, Humidity<> 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.940 – 0.231 – 0.692 = 0.014

SplitInfo(Decision, Humidity<> 70) = -(4/14).log2(4/14) -(10/14).log2(10/14) = 0.863

GainRatio(Decision, Humidity<> 70) = 0.016

Check 75 as a threshold for humidity





Entropy(Decision|Humidity<=75) = – (1/5).log2(1/5) – (4/5).log2(4/5) = 0.721

Entropy(Decision|Humidity>75) = – (4/9).log2(4/9) – (5/9).log2(5/9) = 0.991

Gain(Decision, Humidity<> 75) = 0.940 – (5/14).(0.721) – (9/14).(0.991) = 0.940 – 0.2575 – 0.637 = 0.045

SplitInfo(Decision, Humidity<> 75) = -(5/14).log2(4/14) -(9/14).log2(10/14) = 0.940

GainRatio(Decision, Humidity<> 75) = 0.047

I think calculation demonstrations are enough. Now, I skip the calculations and write only results.

Gain(Decision, Humidity <> 78) =0.090, GainRatio(Decision, Humidity <> 78) =0.090

Gain(Decision, Humidity <> 80) = 0.101, GainRatio(Decision, Humidity <> 80) = 0.107

Gain(Decision, Humidity <> 85) = 0.024, GainRatio(Decision, Humidity <> 85) = 0.027

Gain(Decision, Humidity <> 90) = 0.010, GainRatio(Decision, Humidity <> 90) = 0.016





Gain(Decision, Humidity <> 95) = 0.048, GainRatio(Decision, Humidity <> 95) = 0.128

Here, I ignore the value 96 as threshold because humidity cannot be greater than this value.

As seen, gain maximizes when threshold is equal to 80 for humidity. This means that we need to compare other nominal attributes and comparison of humidity to 80 to create a branch in our tree.

Temperature feature is continuous as well. When I apply binary split to temperature for all possible split points, the following decision rule maximizes for both gain and gain ratio.

Gain(Decision, Temperature <> 83) = 0.113, GainRatio(Decision, Temperature<> 83) = 0.305

Let’s summarize calculated gain and gain ratios. Outlook attribute comes with both maximized gain and gain ratio. This means that we need to put outlook decision in root of decision tree.

Attribute Gain GainRatio
Wind  0.049  0.049
Outlook  0.246  0.155
Humidity <> 80  0.101  0.107
Temperature <> 83  0.113  0.305

If we will use gain metric, then outlook will be the root node because it has the highest gain value. On the other hand, if we use gain ratio metric, then temperature will be the root node because it has the highest gain ratio value. I prefer to use gain here similar to ID3. As a homework, please try to build a C4.5 decision tree based on gain ratio metric.

After then, we would apply similar steps just like as ID3 and create following decision tree. Outlook is put into root node. Now, we should look decisions for different outlook types.

Outlook = Sunny

We’ve split humidity for greater than 80, and less than or equal to 80. Surprisingly, decisions would be no if humidity is greater than 80 when outlook is sunny. Similarly, decision would be yes if humidity is less than or equal to 80 for sunny outlook.

Day Outlook Temp. Hum. > 80 Wind Decision
1 Sunny 85 Yes Weak No
2 Sunny 80 Yes Strong No
8 Sunny 72 Yes Weak No
9 Sunny 69 No Weak Yes
11 Sunny 75 No Strong Yes

Outlook = Overcast

If outlook is overcast, then no matter temperature, humidity or wind are, decision will always be yes.





Day Outlook Temp. Hum. > 80 Wind Decision
3 Overcast 83 No Weak Yes
7 Overcast 64 No Strong Yes
12 Overcast 72 Yes Strong Yes
13 Overcast 81 No Weak Yes

Outlook = Rain

We’ve just filtered rain outlook instances. As seen, decision would be yes when wind is weak, and it would be no if wind is strong.

Day Outlook Temp. Hum. > 80 Wind Decision
4 Rain 70 Yes Weak Yes
5 Rain 68 No Weak Yes
6 Rain 65 No Strong No
10 Rain 75 No Weak Yes
14 Rain 71 No Strong No

Final form of decision table is demonstrated below.

c45-result
Decision tree generated by C4.5

Feature Importance

Decision trees are naturally explainable and interpretable algorithms. Besides, we can find the feature importance values as well to understand how model works.

Gradient Boosting Decision Trees

Nowadays, gradient boosting decision trees are very popular in machine learning community. They are actually not different than the decision tree algorithm mentioned in this blog post. They mainly builds sequantial decision trees based on the errors in the previous loop.

Random Forest vs Gradient Boosting

The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.

Adaboost vs Gradient Boosting

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Conclusion

So, C4.5 algorithm solves most of problems in ID3. The algorithm uses gain ratios instead of gains. In this way, it creates more generalized trees and not to fall into overfitting. Moreover, the algorithm transforms continuous attributes to nominal ones based on gain maximization and in this way it can handle continuous data. Additionally, it can ignore instances including missing data and handle missing dataset. On the other hand, both ID3 and C4.5 requires high CPU and memory demand. Besides, most of authorities think decision tree algorithms in data mining field instead of machine learning.

Bonus

In this post, we have used gain metric to build a C4.5 decision tree. If we use gain ratio as a decision metric, then built decision tree would be a different look.

def findDecision(Outlook, Temperature, Humidity, Wind)
   if Temperature&amp;amp;amp;amp;amp;lt;=83:
      if Outlook == 'Rain':
         if Wind == 'Weak':
            return 'Yes'
         elif Wind == 'Strong':
            return 'No'
      elif Outlook == 'Overcast':
         return 'Yes'
      elif Outlook == 'Sunny':
         if Humidity&amp;amp;amp;amp;amp;gt;65:
            if Wind == 'Strong':
               return 'No'
            elif Wind == 'Weak':
               return 'No'
   elif Temperature&amp;amp;amp;amp;amp;gt;83:
      return 'No'

I ask you to use gain ratio metric as a homework to understand C4.5 algorithm. Gain ratio based C4.5 decision tree will be look like the following decision rules.






Support this blog if you do like!

Buy me a coffee      Buy me a coffee


81 Comments

Leave a Reply