A Step By Step C4.5 Decision Tree Example

Decision trees are still hot topics nowadays in data science world. Here, ID3 is the most common conventional decision tree algorithm but it has bottlenecks. Attributes must be nominal values, dataset must not include missing data, and finally the algorithm tend to fall into overfitting. Here, Ross Quinlan, inventor of ID3, made some improvements for these bottlenecks and created a new algorithm named C4.5. Now, the algorithm can create a more generalized models including continuous data and could handle missing data. Additionally, some resources such as Weka named this algorithm as J48. Actually, it refers to re-implementation of C4.5 release 8.

guardians_galaxy_Groot_1000-920x584 — Groot appears in Guardians of Galaxy and Avengers Infinity War

Vlog

Here, you should watch the following video to understand how decision tree algorithms work. No matter which decision tree algorithm you are running: ID3, C4.5, CART, CHAID or Regression Trees. They all look for the feature offering the highest information gain. Then, they add a decision rule for the found feature and build an another decision tree for the sub data set recursively until they reached a decision.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Besides, regular decision tree algorithms are designed to create branches for categorical features. Still, we are able to build trees with continuous and numerical features. The trick is here that we will convert continuos features into categorical. We will split the numerical feature where it offers the highest information gain.

C4.5 in Python

This blog post mentions the deeply explanation of C4.5 algorithm and we will solve a problem step by step. On the other hand, you might just want to run C4.5 algorithm and its mathematical background might not attract your attention.

Herein, you can find the python implementation of C4.5 algorithm here. You can build C4.5 decision trees with a few lines of code. This package supports the most common decision tree algorithms such as ID3, CART, CHAID or Regression Trees, also some bagging methods such as random forest and some boosting methods such as gradient boosting and adaboost.

Objective

Decision rules will be found based on entropy and information gain ratio pair of each feature. In each level of decision tree, the feature having the maximum gain ratio will be the decision rule.

Data set

We are going to create a decision table for the following dataset. It informs about decision making factors to play tennis at outside for previous 14 days. The dataset might be familiar from the ID3 post. The difference is that temperature and humidity columns have continuous values instead of nominal ones.

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	85	85	Weak	No
2	Sunny	80	90	Strong	No
3	Overcast	83	78	Weak	Yes
4	Rain	70	96	Weak	Yes
5	Rain	68	80	Weak	Yes
6	Rain	65	70	Strong	No
7	Overcast	64	65	Strong	Yes
8	Sunny	72	95	Weak	No
9	Sunny	69	70	Weak	Yes
10	Rain	75	80	Weak	Yes
11	Sunny	75	70	Strong	Yes
12	Overcast	72	90	Strong	Yes
13	Overcast	81	75	Weak	Yes
14	Rain	71	80	Strong	No

We will do what we have done in ID3 example. Firstly, we need to calculate global entropy. There are 14 examples; 9 instances refer to yes decision, and 5 instances refer to no decision.

Entropy(Decision) = ∑ – p(I) . log₂p(I) = – p(Yes) . log₂p(Yes) – p(No) . log₂p(No) = – (9/14) . log₂(9/14) – (5/14) . log₂(5/14) = 0.940

In ID3 algorithm, we’ve calculated gains for each attribute. Here, we need to calculate gain ratios instead of gains.

GainRatio(A) = Gain(A) / SplitInfo(A)

SplitInfo(A) = -∑ |Dj|/|D| x log₂|Dj|/|D|

Wind Attribute

Wind is a nominal attribute. Its possible values are weak and strong.

Gain(Decision, Wind) = Entropy(Decision) – ∑ ( p(Decision|Wind) . Entropy(Decision|Wind) )

Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) . Entropy(Decision|Wind=Weak) ] + [ p(Decision|Wind=Strong) . Entropy(Decision|Wind=Strong) ]

There are 8 weak wind instances. 2 of them are concluded as no, 6 of them are concluded as yes.

Entropy(Decision|Wind=Weak) = – p(No) . log₂p(No) – p(Yes) . log₂p(Yes) = – (2/8) . log₂(2/8) – (6/8) . log₂(6/8) = 0.811

Entropy(Decision|Wind=Strong) = – (3/6) . log₂(3/6) – (3/6) . log₂(3/6) = 1

Gain(Decision, Wind) = 0.940 – (8/14).(0.811) – (6/14).(1) = 0.940 – 0.463 – 0.428 = 0.049

There are 8 decisions for weak wind, and 6 decisions for strong wind.

SplitInfo(Decision, Wind) = -(8/14).log₂(8/14) – (6/14).log₂(6/14) = 0.461 + 0.524 = 0.985

GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision, Wind) = 0.049 / 0.985 = 0.049

Outlook Attribute

Outlook is a nominal attribute, too. Its possible values are sunny, overcast and rain.

Gain(Decision, Outlook) = Entropy(Decision) – ∑ ( p(Decision|Outlook) . Entropy(Decision|Outlook) ) =

There are 5 sunny instances. 3 of them are concluded as no, 2 of them are concluded as yes.

Entropy(Decision|Outlook=Sunny) = – p(No) . log₂p(No) – p(Yes) . log₂p(Yes) = -(3/5).log₂(3/5) – (2/5).log₂(2/5) = 0.441 + 0.528 = 0.970

Entropy(Decision|Outlook=Overcast) = – p(No) . log₂p(No) – p(Yes) . log₂p(Yes) = -(0/4).log₂(0/4) – (4/4).log₂(4/4) = 0

Notice that log₂(0) is actually equal to -∞ but assume that it is equal to 0. Actually, lim (x->0) x.log₂(x) = 0. If you wonder the proof, please look at this post.

Entropy(Decision|Outlook=Rain) = – p(No) . log₂p(No) – p(Yes) . log₂p(Yes) = -(2/5).log₂(2/5) – (3/5).log₂(3/5) = 0.528 + 0.441 = 0.970

Gain(Decision, Outlook) = 0.940 – (5/14).(0.970) – (4/14).(0) – (5/14).(0.970) – (5/14).(0.970) = 0.246

There are 5 instances for sunny, 4 instances for overcast and 5 instances for rain

SplitInfo(Decision, Outlook) = -(5/14).log₂(5/14) -(4/14).log₂(4/14) -(5/14).log₂(5/14) = 1.577

GainRatio(Decision, Outlook) = Gain(Decision, Outlook)/SplitInfo(Decision, Outlook) = 0.246/1.577 = 0.155

Humidity Attribute

As an exception, humidity is a continuous attribute. We need to convert continuous values to nominal ones. C4.5 proposes to perform binary split based on a threshold value. Threshold should be a value which offers maximum gain for that attribute. Let’s focus on humidity attribute. Firstly, we need to sort humidity values smallest to largest.

Day	Humidity	Decision
7	65	Yes
6	70	No
9	70	Yes
11	70	Yes
13	75	Yes
3	78	Yes
5	80	Yes
10	80	Yes
14	80	No
1	85	No
2	90	No
12	90	Yes
8	95	No
4	96	Yes

Now, we need to iterate on all humidity values and seperate dataset into two parts as instances less than or equal to current value, and instances greater than the current value. We would calculate the gain or gain ratio for every step. The value which maximizes the gain would be the threshold.

Check 65 as a threshold for humidity

Entropy(Decision|Humidity<=65) = – p(No) . log₂p(No) – p(Yes) . log₂p(Yes) = -(0/1).log₂(0/1) – (1/1).log₂(1/1) = 0

Entropy(Decision|Humidity>65) = -(5/13).log₂(5/13) – (8/13).log₂(8/13) =0.530 + 0.431 = 0.961

Gain(Decision, Humidity<> 65) = 0.940 – (1/14).0 – (13/14).(0.961) = 0.048

* The statement above refers to that what would branch of decision tree be for less than or equal to 65, and greater than 65. It does not refer to that humidity is not equal to 65!

SplitInfo(Decision, Humidity<> 65) = -(1/14).log₂(1/14) -(13/14).log₂(13/14) = 0.371

GainRatio(Decision, Humidity<> 65) = 0.126

Check 70 as a threshold for humidity

Entropy(Decision|Humidity<=70) = – (1/4).log₂(1/4) – (3/4).log₂(3/4) = 0.811

Entropy(Decision|Humidity>70) = – (4/10).log₂(4/10) – (6/10).log₂(6/10) = 0.970

Gain(Decision, Humidity<> 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.940 – 0.231 – 0.692 = 0.014

SplitInfo(Decision, Humidity<> 70) = -(4/14).log₂(4/14) -(10/14).log₂(10/14) = 0.863

GainRatio(Decision, Humidity<> 70) = 0.016

Check 75 as a threshold for humidity

Entropy(Decision|Humidity<=75) = – (1/5).log₂(1/5) – (4/5).log₂(4/5) = 0.721

Entropy(Decision|Humidity>75) = – (4/9).log₂(4/9) – (5/9).log₂(5/9) = 0.991

Gain(Decision, Humidity<> 75) = 0.940 – (5/14).(0.721) – (9/14).(0.991) = 0.940 – 0.2575 – 0.637 = 0.045

SplitInfo(Decision, Humidity<> 75) = -(5/14).log₂(4/14) -(9/14).log₂(10/14) = 0.940

GainRatio(Decision, Humidity<> 75) = 0.047

I think calculation demonstrations are enough. Now, I skip the calculations and write only results.

Gain(Decision, Humidity <> 78) =0.090, GainRatio(Decision, Humidity <> 78) =0.090

Gain(Decision, Humidity <> 80) = 0.101, GainRatio(Decision, Humidity <> 80) = 0.107

Gain(Decision, Humidity <> 85) = 0.024, GainRatio(Decision, Humidity <> 85) = 0.027

Gain(Decision, Humidity <> 90) = 0.010, GainRatio(Decision, Humidity <> 90) = 0.016

Gain(Decision, Humidity <> 95) = 0.048, GainRatio(Decision, Humidity <> 95) = 0.128

Here, I ignore the value 96 as threshold because humidity cannot be greater than this value.

As seen, gain maximizes when threshold is equal to 80 for humidity. This means that we need to compare other nominal attributes and comparison of humidity to 80 to create a branch in our tree.

Temperature feature is continuous as well. When I apply binary split to temperature for all possible split points, the following decision rule maximizes for both gain and gain ratio.

Gain(Decision, Temperature <> 83) = 0.113, GainRatio(Decision, Temperature<> 83) = 0.305

Let’s summarize calculated gain and gain ratios. Outlook attribute comes with both maximized gain and gain ratio. This means that we need to put outlook decision in root of decision tree.

Attribute	Gain	GainRatio
Wind	0.049	0.049
Outlook	0.246	0.155
Humidity <> 80	0.101	0.107
Temperature <> 83	0.113	0.305

If we will use gain metric, then outlook will be the root node because it has the highest gain value. On the other hand, if we use gain ratio metric, then temperature will be the root node because it has the highest gain ratio value. I prefer to use gain here similar to ID3. As a homework, please try to build a C4.5 decision tree based on gain ratio metric.

After then, we would apply similar steps just like as ID3 and create following decision tree. Outlook is put into root node. Now, we should look decisions for different outlook types.

Outlook = Sunny

We’ve split humidity for greater than 80, and less than or equal to 80. Surprisingly, decisions would be no if humidity is greater than 80 when outlook is sunny. Similarly, decision would be yes if humidity is less than or equal to 80 for sunny outlook.

Day	Outlook	Temp.	Hum. > 80	Wind	Decision
1	Sunny	85	Yes	Weak	No
2	Sunny	80	Yes	Strong	No
8	Sunny	72	Yes	Weak	No
9	Sunny	69	No	Weak	Yes
11	Sunny	75	No	Strong	Yes

Outlook = Overcast

If outlook is overcast, then no matter temperature, humidity or wind are, decision will always be yes.

Day	Outlook	Temp.	Hum. > 80	Wind	Decision
3	Overcast	83	No	Weak	Yes
7	Overcast	64	No	Strong	Yes
12	Overcast	72	Yes	Strong	Yes
13	Overcast	81	No	Weak	Yes

Outlook = Rain

We’ve just filtered rain outlook instances. As seen, decision would be yes when wind is weak, and it would be no if wind is strong.

Day	Outlook	Temp.	Hum. > 80	Wind	Decision
4	Rain	70	Yes	Weak	Yes
5	Rain	68	No	Weak	Yes
6	Rain	65	No	Strong	No
10	Rain	75	No	Weak	Yes
14	Rain	71	No	Strong	No

Final form of decision table is demonstrated below.

c45-result — Decision tree generated by C4.5

Feature Importance

Decision trees are naturally explainable and interpretable algorithms. Besides, we can find the feature importance values as well to understand how model works.

Gradient Boosting Decision Trees

Nowadays, gradient boosting decision trees are very popular in machine learning community. They are actually not different than the decision tree algorithm mentioned in this blog post. They mainly builds sequantial decision trees based on the errors in the previous loop.

Random Forest vs Gradient Boosting

The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.

Adaboost vs Gradient Boosting

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Conclusion

So, C4.5 algorithm solves most of problems in ID3. The algorithm uses gain ratios instead of gains. In this way, it creates more generalized trees and not to fall into overfitting. Moreover, the algorithm transforms continuous attributes to nominal ones based on gain maximization and in this way it can handle continuous data. Additionally, it can ignore instances including missing data and handle missing dataset. On the other hand, both ID3 and C4.5 requires high CPU and memory demand. Besides, most of authorities think decision tree algorithms in data mining field instead of machine learning.

Bonus

In this post, we have used gain metric to build a C4.5 decision tree. If we use gain ratio as a decision metric, then built decision tree would be a different look.

def findDecision(Outlook, Temperature, Humidity, Wind)
   if Temperature&amp;amp;amp;amp;amp;lt;=83:
      if Outlook == 'Rain':
         if Wind == 'Weak':
            return 'Yes'
         elif Wind == 'Strong':
            return 'No'
      elif Outlook == 'Overcast':
         return 'Yes'
      elif Outlook == 'Sunny':
         if Humidity&amp;amp;amp;amp;amp;gt;65:
            if Wind == 'Strong':
               return 'No'
            elif Wind == 'Weak':
               return 'No'
   elif Temperature&amp;amp;amp;amp;amp;gt;83:
      return 'No'

I ask you to use gain ratio metric as a homework to understand C4.5 algorithm. Gain ratio based C4.5 decision tree will be look like the following decision rules.

Support this blog if you do like!

81 Comments

Sidra_Shoukat says:

November 24, 2022 at 12:47 am

If gain ratio are same in all atributes then what we select for root node?

Log in to Reply
1. Sefik Serengil says:
  
  November 28, 2022 at 9:40 am
  
  You can select a random one in that case
  
  Log in to Reply