Decision trees are still hot topics nowadays in data science world. Here, ID3 is the most common conventional decision tree algorithm but it has bottlenecks. Attributes must be nominal values, dataset must not include missing data, and finally the algorithm tend to fall into overfitting. Here, Ross Quinlan, inventor of ID3, made some improvements for these bottlenecks and created a new algorithm named C4.5. Now, the algorithm can create a more generalized models including continuous data and could handle missing data. Additionally, some resources such as Weka named this algorithm as J48. Actually, it refers to re-implementation of C4.5 release 8.

Vlog
Here, you should watch the following video to understand how decision tree algorithms work. No matter which decision tree algorithm you are running: ID3, C4.5, CART, CHAID or Regression Trees. They all look for the feature offering the highest information gain. Then, they add a decision rule for the found feature and build an another decision tree for the sub data set recursively until they reached a decision.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy

Besides, regular decision tree algorithms are designed to create branches for categorical features. Still, we are able to build trees with continuous and numerical features. The trick is here that we will convert continuos features into categorical. We will split the numerical feature where it offers the highest information gain.
C4.5 in Python
This blog post mentions the deeply explanation of C4.5 algorithm and we will solve a problem step by step. On the other hand, you might just want to run C4.5 algorithm and its mathematical background might not attract your attention.
Herein, you can find the python implementation of C4.5 algorithm here. You can build C4.5 decision trees with a few lines of code. This package supports the most common decision tree algorithms such as ID3, CART, CHAID or Regression Trees, also some bagging methods such as random forest and some boosting methods such as gradient boosting and adaboost.
Objective
Decision rules will be found based on entropy and information gain ratio pair of each feature. In each level of decision tree, the feature having the maximum gain ratio will be the decision rule.
Data set
We are going to create a decision table for the following dataset. It informs about decision making factors to play tennis at outside for previous 14 days. The dataset might be familiar from the ID3 post. The difference is that temperature and humidity columns have continuous values instead of nominal ones.
Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
3 | Overcast | 83 | 78 | Weak | Yes |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
7 | Overcast | 64 | 65 | Strong | Yes |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
10 | Rain | 75 | 80 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
We will do what we have done in ID3 example. Firstly, we need to calculate global entropy. There are 14 examples; 9 instances refer to yes decision, and 5 instances refer to no decision.
Entropy(Decision) = ∑ – p(I) . log2p(I) = – p(Yes) . log2p(Yes) – p(No) . log2p(No) = – (9/14) . log2(9/14) – (5/14) . log2(5/14) = 0.940
In ID3 algorithm, we’ve calculated gains for each attribute. Here, we need to calculate gain ratios instead of gains.
GainRatio(A) = Gain(A) / SplitInfo(A)
SplitInfo(A) = -∑ |Dj|/|D| x log2|Dj|/|D|
Wind Attribute
Wind is a nominal attribute. Its possible values are weak and strong.
Gain(Decision, Wind) = Entropy(Decision) – ∑ ( p(Decision|Wind) . Entropy(Decision|Wind) )
Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) . Entropy(Decision|Wind=Weak) ] + [ p(Decision|Wind=Strong) . Entropy(Decision|Wind=Strong) ]
There are 8 weak wind instances. 2 of them are concluded as no, 6 of them are concluded as yes.
Entropy(Decision|Wind=Weak) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = – (2/8) . log2(2/8) – (6/8) . log2(6/8) = 0.811
Entropy(Decision|Wind=Strong) = – (3/6) . log2(3/6) – (3/6) . log2(3/6) = 1
Gain(Decision, Wind) = 0.940 – (8/14).(0.811) – (6/14).(1) = 0.940 – 0.463 – 0.428 = 0.049
There are 8 decisions for weak wind, and 6 decisions for strong wind.
SplitInfo(Decision, Wind) = -(8/14).log2(8/14) – (6/14).log2(6/14) = 0.461 + 0.524 = 0.985
GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision, Wind) = 0.049 / 0.985 = 0.049
Outlook Attribute
Outlook is a nominal attribute, too. Its possible values are sunny, overcast and rain.
Gain(Decision, Outlook) = Entropy(Decision) – ∑ ( p(Decision|Outlook) . Entropy(Decision|Outlook) ) =
Gain(Decision, Outlook) = Entropy(Decision) – p(Decision|Outlook=Sunny) . Entropy(Decision|Outlook=Sunny) – p(Decision|Outlook=Overcast) . Entropy(Decision|Outlook=Overcast) – p(Decision|Outlook=Rain) . Entropy(Decision|Outlook=Rain)
There are 5 sunny instances. 3 of them are concluded as no, 2 of them are concluded as yes.
Entropy(Decision|Outlook=Sunny) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(3/5).log2(3/5) – (2/5).log2(2/5) = 0.441 + 0.528 = 0.970
Entropy(Decision|Outlook=Overcast) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(0/4).log2(0/4) – (4/4).log2(4/4) = 0
Notice that log2(0) is actually equal to -∞ but assume that it is equal to 0. Actually, lim (x->0) x.log2(x) = 0. If you wonder the proof, please look at this post.
Entropy(Decision|Outlook=Rain) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(2/5).log2(2/5) – (3/5).log2(3/5) = 0.528 + 0.441 = 0.970
Gain(Decision, Outlook) = 0.940 – (5/14).(0.970) – (4/14).(0) – (5/14).(0.970) – (5/14).(0.970) = 0.246
There are 5 instances for sunny, 4 instances for overcast and 5 instances for rain
SplitInfo(Decision, Outlook) = -(5/14).log2(5/14) -(4/14).log2(4/14) -(5/14).log2(5/14) = 1.577
GainRatio(Decision, Outlook) = Gain(Decision, Outlook)/SplitInfo(Decision, Outlook) = 0.246/1.577 = 0.155
Humidity Attribute
As an exception, humidity is a continuous attribute. We need to convert continuous values to nominal ones. C4.5 proposes to perform binary split based on a threshold value. Threshold should be a value which offers maximum gain for that attribute. Let’s focus on humidity attribute. Firstly, we need to sort humidity values smallest to largest.
Day | Humidity | Decision |
---|---|---|
7 | 65 | Yes |
6 | 70 | No |
9 | 70 | Yes |
11 | 70 | Yes |
13 | 75 | Yes |
3 | 78 | Yes |
5 | 80 | Yes |
10 | 80 | Yes |
14 | 80 | No |
1 | 85 | No |
2 | 90 | No |
12 | 90 | Yes |
8 | 95 | No |
4 | 96 | Yes |
Now, we need to iterate on all humidity values and seperate dataset into two parts as instances less than or equal to current value, and instances greater than the current value. We would calculate the gain or gain ratio for every step. The value which maximizes the gain would be the threshold.
Check 65 as a threshold for humidity
Entropy(Decision|Humidity<=65) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(0/1).log2(0/1) – (1/1).log2(1/1) = 0
Entropy(Decision|Humidity>65) = -(5/13).log2(5/13) – (8/13).log2(8/13) =0.530 + 0.431 = 0.961
Gain(Decision, Humidity<> 65) = 0.940 – (1/14).0 – (13/14).(0.961) = 0.048
* The statement above refers to that what would branch of decision tree be for less than or equal to 65, and greater than 65. It does not refer to that humidity is not equal to 65!
SplitInfo(Decision, Humidity<> 65) = -(1/14).log2(1/14) -(13/14).log2(13/14) = 0.371
GainRatio(Decision, Humidity<> 65) = 0.126
Check 70 as a threshold for humidity
Entropy(Decision|Humidity<=70) = – (1/4).log2(1/4) – (3/4).log2(3/4) = 0.811
Entropy(Decision|Humidity>70) = – (4/10).log2(4/10) – (6/10).log2(6/10) = 0.970
Gain(Decision, Humidity<> 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.940 – 0.231 – 0.692 = 0.014
SplitInfo(Decision, Humidity<> 70) = -(4/14).log2(4/14) -(10/14).log2(10/14) = 0.863
GainRatio(Decision, Humidity<> 70) = 0.016
Check 75 as a threshold for humidity
Entropy(Decision|Humidity<=75) = – (1/5).log2(1/5) – (4/5).log2(4/5) = 0.721
Entropy(Decision|Humidity>75) = – (4/9).log2(4/9) – (5/9).log2(5/9) = 0.991
Gain(Decision, Humidity<> 75) = 0.940 – (5/14).(0.721) – (9/14).(0.991) = 0.940 – 0.2575 – 0.637 = 0.045
SplitInfo(Decision, Humidity<> 75) = -(5/14).log2(4/14) -(9/14).log2(10/14) = 0.940
GainRatio(Decision, Humidity<> 75) = 0.047
I think calculation demonstrations are enough. Now, I skip the calculations and write only results.
Gain(Decision, Humidity <> 78) =0.090, GainRatio(Decision, Humidity <> 78) =0.090
Gain(Decision, Humidity <> 80) = 0.101, GainRatio(Decision, Humidity <> 80) = 0.107
Gain(Decision, Humidity <> 85) = 0.024, GainRatio(Decision, Humidity <> 85) = 0.027
Gain(Decision, Humidity <> 90) = 0.010, GainRatio(Decision, Humidity <> 90) = 0.016
Gain(Decision, Humidity <> 95) = 0.048, GainRatio(Decision, Humidity <> 95) = 0.128
Here, I ignore the value 96 as threshold because humidity cannot be greater than this value.
As seen, gain maximizes when threshold is equal to 80 for humidity. This means that we need to compare other nominal attributes and comparison of humidity to 80 to create a branch in our tree.
Temperature feature is continuous as well. When I apply binary split to temperature for all possible split points, the following decision rule maximizes for both gain and gain ratio.
Gain(Decision, Temperature <> 83) = 0.113, GainRatio(Decision, Temperature<> 83) = 0.305
Let’s summarize calculated gain and gain ratios. Outlook attribute comes with both maximized gain and gain ratio. This means that we need to put outlook decision in root of decision tree.
Attribute | Gain | GainRatio |
Wind | 0.049 | 0.049 |
Outlook | 0.246 | 0.155 |
Humidity <> 80 | 0.101 | 0.107 |
Temperature <> 83 | 0.113 | 0.305 |
If we will use gain metric, then outlook will be the root node because it has the highest gain value. On the other hand, if we use gain ratio metric, then temperature will be the root node because it has the highest gain ratio value. I prefer to use gain here similar to ID3. As a homework, please try to build a C4.5 decision tree based on gain ratio metric.
After then, we would apply similar steps just like as ID3 and create following decision tree. Outlook is put into root node. Now, we should look decisions for different outlook types.
Outlook = Sunny
We’ve split humidity for greater than 80, and less than or equal to 80. Surprisingly, decisions would be no if humidity is greater than 80 when outlook is sunny. Similarly, decision would be yes if humidity is less than or equal to 80 for sunny outlook.
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | 85 | Yes | Weak | No |
2 | Sunny | 80 | Yes | Strong | No |
8 | Sunny | 72 | Yes | Weak | No |
9 | Sunny | 69 | No | Weak | Yes |
11 | Sunny | 75 | No | Strong | Yes |
Outlook = Overcast
If outlook is overcast, then no matter temperature, humidity or wind are, decision will always be yes.
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
3 | Overcast | 83 | No | Weak | Yes |
7 | Overcast | 64 | No | Strong | Yes |
12 | Overcast | 72 | Yes | Strong | Yes |
13 | Overcast | 81 | No | Weak | Yes |
Outlook = Rain
We’ve just filtered rain outlook instances. As seen, decision would be yes when wind is weak, and it would be no if wind is strong.
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
4 | Rain | 70 | Yes | Weak | Yes |
5 | Rain | 68 | No | Weak | Yes |
6 | Rain | 65 | No | Strong | No |
10 | Rain | 75 | No | Weak | Yes |
14 | Rain | 71 | No | Strong | No |
Final form of decision table is demonstrated below.

Feature Importance
Decision trees are naturally explainable and interpretable algorithms. Besides, we can find the feature importance values as well to understand how model works.
Gradient Boosting Decision Trees
Nowadays, gradient boosting decision trees are very popular in machine learning community. They are actually not different than the decision tree algorithm mentioned in this blog post. They mainly builds sequantial decision trees based on the errors in the previous loop.
Random Forest vs Gradient Boosting
The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.
Adaboost vs Gradient Boosting
The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.
Conclusion
So, C4.5 algorithm solves most of problems in ID3. The algorithm uses gain ratios instead of gains. In this way, it creates more generalized trees and not to fall into overfitting. Moreover, the algorithm transforms continuous attributes to nominal ones based on gain maximization and in this way it can handle continuous data. Additionally, it can ignore instances including missing data and handle missing dataset. On the other hand, both ID3 and C4.5 requires high CPU and memory demand. Besides, most of authorities think decision tree algorithms in data mining field instead of machine learning.
Bonus
In this post, we have used gain metric to build a C4.5 decision tree. If we use gain ratio as a decision metric, then built decision tree would be a different look.
def findDecision(Outlook, Temperature, Humidity, Wind) if Temperature&amp;amp;amp;amp;lt;=83: if Outlook == 'Rain': if Wind == 'Weak': return 'Yes' elif Wind == 'Strong': return 'No' elif Outlook == 'Overcast': return 'Yes' elif Outlook == 'Sunny': if Humidity&amp;amp;amp;amp;gt;65: if Wind == 'Strong': return 'No' elif Wind == 'Weak': return 'No' elif Temperature&amp;amp;amp;amp;gt;83: return 'No'
I ask you to use gain ratio metric as a homework to understand C4.5 algorithm. Gain ratio based C4.5 decision tree will be look like the following decision rules.
Support this blog if you do like!
If gain ratio are same in all atributes then what we select for root node?
You can select a random one in that case