Decision trees are still hot topics nowadays in data science world. Here, ID3 is the most common conventional decision tree algorithm but it has bottlenecks. Attributes must be nominal values, dataset must not include missing data, and finally the algorithm tend to fall into overfitting. Here, Ross Quinlan, inventor of ID3, made some improvements for these bottlenecks and created a new algorithm named C4.5. Now, the algorithm can create a more generalized models including continuous data and could handle missing data. Additionally, some resources such as Weka named this algorithm as J48. Actually, it refers to re-implementation of C4.5 release 8.
Vlog
Here, you should watch the following video to understand how decision tree algorithms work. No matter which decision tree algorithm you are running: ID3, C4.5, CART, CHAID or Regression Trees. They all look for the feature offering the highest information gain. Then, they add a decision rule for the found feature and build an another decision tree for the sub data set recursively until they reached a decision.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
Besides, regular decision tree algorithms are designed to create branches for categorical features. Still, we are able to build trees with continuous and numerical features. The trick is here that we will convert continuos features into categorical. We will split the numerical feature where it offers the highest information gain.
C4.5 in Python
This blog post mentions the deeply explanation of C4.5 algorithm and we will solve a problem step by step. On the other hand, you might just want to run C4.5 algorithm and its mathematical background might not attract your attention.
Herein, you can find the python implementation of C4.5 algorithm here. You can build C4.5 decision trees with a few lines of code. This package supports the most common decision tree algorithms such as ID3, CART, CHAID or Regression Trees, also some bagging methods such as random forest and some boosting methods such as gradient boosting and adaboost.
Objective
Decision rules will be found based on entropy and information gain ratio pair of each feature. In each level of decision tree, the feature having the maximum gain ratio will be the decision rule.
Data set
We are going to create a decision table for the following dataset. It informs about decision making factors to play tennis at outside for previous 14 days. The dataset might be familiar from the ID3 post. The difference is that temperature and humidity columns have continuous values instead of nominal ones.
Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
3 | Overcast | 83 | 78 | Weak | Yes |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
7 | Overcast | 64 | 65 | Strong | Yes |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
10 | Rain | 75 | 80 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
We will do what we have done in ID3 example. Firstly, we need to calculate global entropy. There are 14 examples; 9 instances refer to yes decision, and 5 instances refer to no decision.
Entropy(Decision) = ∑ – p(I) . log2p(I) = – p(Yes) . log2p(Yes) – p(No) . log2p(No) = – (9/14) . log2(9/14) – (5/14) . log2(5/14) = 0.940
In ID3 algorithm, we’ve calculated gains for each attribute. Here, we need to calculate gain ratios instead of gains.
GainRatio(A) = Gain(A) / SplitInfo(A)
SplitInfo(A) = -∑ |Dj|/|D| x log2|Dj|/|D|
Wind Attribute
Wind is a nominal attribute. Its possible values are weak and strong.
Gain(Decision, Wind) = Entropy(Decision) – ∑ ( p(Decision|Wind) . Entropy(Decision|Wind) )
Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) . Entropy(Decision|Wind=Weak) ] + [ p(Decision|Wind=Strong) . Entropy(Decision|Wind=Strong) ]
There are 8 weak wind instances. 2 of them are concluded as no, 6 of them are concluded as yes.
Entropy(Decision|Wind=Weak) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = – (2/8) . log2(2/8) – (6/8) . log2(6/8) = 0.811
Entropy(Decision|Wind=Strong) = – (3/6) . log2(3/6) – (3/6) . log2(3/6) = 1
Gain(Decision, Wind) = 0.940 – (8/14).(0.811) – (6/14).(1) = 0.940 – 0.463 – 0.428 = 0.049
There are 8 decisions for weak wind, and 6 decisions for strong wind.
SplitInfo(Decision, Wind) = -(8/14).log2(8/14) – (6/14).log2(6/14) = 0.461 + 0.524 = 0.985
GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision, Wind) = 0.049 / 0.985 = 0.049
Outlook Attribute
Outlook is a nominal attribute, too. Its possible values are sunny, overcast and rain.
Gain(Decision, Outlook) = Entropy(Decision) – ∑ ( p(Decision|Outlook) . Entropy(Decision|Outlook) ) =
Gain(Decision, Outlook) = Entropy(Decision) – p(Decision|Outlook=Sunny) . Entropy(Decision|Outlook=Sunny) – p(Decision|Outlook=Overcast) . Entropy(Decision|Outlook=Overcast) – p(Decision|Outlook=Rain) . Entropy(Decision|Outlook=Rain)
There are 5 sunny instances. 3 of them are concluded as no, 2 of them are concluded as yes.
Entropy(Decision|Outlook=Sunny) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(3/5).log2(3/5) – (2/5).log2(2/5) = 0.441 + 0.528 = 0.970
Entropy(Decision|Outlook=Overcast) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(0/4).log2(0/4) – (4/4).log2(4/4) = 0
Notice that log2(0) is actually equal to -∞ but assume that it is equal to 0. Actually, lim (x->0) x.log2(x) = 0. If you wonder the proof, please look at this post.
Entropy(Decision|Outlook=Rain) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(2/5).log2(2/5) – (3/5).log2(3/5) = 0.528 + 0.441 = 0.970
Gain(Decision, Outlook) = 0.940 – (5/14).(0.970) – (4/14).(0) – (5/14).(0.970) – (5/14).(0.970) = 0.246
There are 5 instances for sunny, 4 instances for overcast and 5 instances for rain
SplitInfo(Decision, Outlook) = -(5/14).log2(5/14) -(4/14).log2(4/14) -(5/14).log2(5/14) = 1.577
GainRatio(Decision, Outlook) = Gain(Decision, Outlook)/SplitInfo(Decision, Outlook) = 0.246/1.577 = 0.155
Humidity Attribute
As an exception, humidity is a continuous attribute. We need to convert continuous values to nominal ones. C4.5 proposes to perform binary split based on a threshold value. Threshold should be a value which offers maximum gain for that attribute. Let’s focus on humidity attribute. Firstly, we need to sort humidity values smallest to largest.
Day | Humidity | Decision |
---|---|---|
7 | 65 | Yes |
6 | 70 | No |
9 | 70 | Yes |
11 | 70 | Yes |
13 | 75 | Yes |
3 | 78 | Yes |
5 | 80 | Yes |
10 | 80 | Yes |
14 | 80 | No |
1 | 85 | No |
2 | 90 | No |
12 | 90 | Yes |
8 | 95 | No |
4 | 96 | Yes |
Now, we need to iterate on all humidity values and seperate dataset into two parts as instances less than or equal to current value, and instances greater than the current value. We would calculate the gain or gain ratio for every step. The value which maximizes the gain would be the threshold.
Check 65 as a threshold for humidity
Entropy(Decision|Humidity<=65) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(0/1).log2(0/1) – (1/1).log2(1/1) = 0
Entropy(Decision|Humidity>65) = -(5/13).log2(5/13) – (8/13).log2(8/13) =0.530 + 0.431 = 0.961
Gain(Decision, Humidity<> 65) = 0.940 – (1/14).0 – (13/14).(0.961) = 0.048
* The statement above refers to that what would branch of decision tree be for less than or equal to 65, and greater than 65. It does not refer to that humidity is not equal to 65!
SplitInfo(Decision, Humidity<> 65) = -(1/14).log2(1/14) -(13/14).log2(13/14) = 0.371
GainRatio(Decision, Humidity<> 65) = 0.126
Check 70 as a threshold for humidity
Entropy(Decision|Humidity<=70) = – (1/4).log2(1/4) – (3/4).log2(3/4) = 0.811
Entropy(Decision|Humidity>70) = – (4/10).log2(4/10) – (6/10).log2(6/10) = 0.970
Gain(Decision, Humidity<> 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.940 – 0.231 – 0.692 = 0.014
SplitInfo(Decision, Humidity<> 70) = -(4/14).log2(4/14) -(10/14).log2(10/14) = 0.863
GainRatio(Decision, Humidity<> 70) = 0.016
Check 75 as a threshold for humidity
Entropy(Decision|Humidity<=75) = – (1/5).log2(1/5) – (4/5).log2(4/5) = 0.721
Entropy(Decision|Humidity>75) = – (4/9).log2(4/9) – (5/9).log2(5/9) = 0.991
Gain(Decision, Humidity<> 75) = 0.940 – (5/14).(0.721) – (9/14).(0.991) = 0.940 – 0.2575 – 0.637 = 0.045
SplitInfo(Decision, Humidity<> 75) = -(5/14).log2(4/14) -(9/14).log2(10/14) = 0.940
GainRatio(Decision, Humidity<> 75) = 0.047
I think calculation demonstrations are enough. Now, I skip the calculations and write only results.
Gain(Decision, Humidity <> 78) =0.090, GainRatio(Decision, Humidity <> 78) =0.090
Gain(Decision, Humidity <> 80) = 0.101, GainRatio(Decision, Humidity <> 80) = 0.107
Gain(Decision, Humidity <> 85) = 0.024, GainRatio(Decision, Humidity <> 85) = 0.027
Gain(Decision, Humidity <> 90) = 0.010, GainRatio(Decision, Humidity <> 90) = 0.016
Gain(Decision, Humidity <> 95) = 0.048, GainRatio(Decision, Humidity <> 95) = 0.128
Here, I ignore the value 96 as threshold because humidity cannot be greater than this value.
As seen, gain maximizes when threshold is equal to 80 for humidity. This means that we need to compare other nominal attributes and comparison of humidity to 80 to create a branch in our tree.
Temperature feature is continuous as well. When I apply binary split to temperature for all possible split points, the following decision rule maximizes for both gain and gain ratio.
Gain(Decision, Temperature <> 83) = 0.113, GainRatio(Decision, Temperature<> 83) = 0.305
Let’s summarize calculated gain and gain ratios. Outlook attribute comes with both maximized gain and gain ratio. This means that we need to put outlook decision in root of decision tree.
Attribute | Gain | GainRatio |
Wind | 0.049 | 0.049 |
Outlook | 0.246 | 0.155 |
Humidity <> 80 | 0.101 | 0.107 |
Temperature <> 83 | 0.113 | 0.305 |
If we will use gain metric, then outlook will be the root node because it has the highest gain value. On the other hand, if we use gain ratio metric, then temperature will be the root node because it has the highest gain ratio value. I prefer to use gain here similar to ID3. As a homework, please try to build a C4.5 decision tree based on gain ratio metric.
After then, we would apply similar steps just like as ID3 and create following decision tree. Outlook is put into root node. Now, we should look decisions for different outlook types.
Outlook = Sunny
We’ve split humidity for greater than 80, and less than or equal to 80. Surprisingly, decisions would be no if humidity is greater than 80 when outlook is sunny. Similarly, decision would be yes if humidity is less than or equal to 80 for sunny outlook.
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | 85 | Yes | Weak | No |
2 | Sunny | 80 | Yes | Strong | No |
8 | Sunny | 72 | Yes | Weak | No |
9 | Sunny | 69 | No | Weak | Yes |
11 | Sunny | 75 | No | Strong | Yes |
Outlook = Overcast
If outlook is overcast, then no matter temperature, humidity or wind are, decision will always be yes.
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
3 | Overcast | 83 | No | Weak | Yes |
7 | Overcast | 64 | No | Strong | Yes |
12 | Overcast | 72 | Yes | Strong | Yes |
13 | Overcast | 81 | No | Weak | Yes |
Outlook = Rain
We’ve just filtered rain outlook instances. As seen, decision would be yes when wind is weak, and it would be no if wind is strong.
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
4 | Rain | 70 | Yes | Weak | Yes |
5 | Rain | 68 | No | Weak | Yes |
6 | Rain | 65 | No | Strong | No |
10 | Rain | 75 | No | Weak | Yes |
14 | Rain | 71 | No | Strong | No |
Final form of decision table is demonstrated below.
Feature Importance
Decision trees are naturally explainable and interpretable algorithms. Besides, we can find the feature importance values as well to understand how model works.
Gradient Boosting Decision Trees
Nowadays, gradient boosting decision trees are very popular in machine learning community. They are actually not different than the decision tree algorithm mentioned in this blog post. They mainly builds sequantial decision trees based on the errors in the previous loop.
Random Forest vs Gradient Boosting
The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.
Adaboost vs Gradient Boosting
The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.
Conclusion
So, C4.5 algorithm solves most of problems in ID3. The algorithm uses gain ratios instead of gains. In this way, it creates more generalized trees and not to fall into overfitting. Moreover, the algorithm transforms continuous attributes to nominal ones based on gain maximization and in this way it can handle continuous data. Additionally, it can ignore instances including missing data and handle missing dataset. On the other hand, both ID3 and C4.5 requires high CPU and memory demand. Besides, most of authorities think decision tree algorithms in data mining field instead of machine learning.
Bonus
In this post, we have used gain metric to build a C4.5 decision tree. If we use gain ratio as a decision metric, then built decision tree would be a different look.
def findDecision(Outlook, Temperature, Humidity, Wind) if Temperature&amp;amp;amp;amp;lt;=83: if Outlook == 'Rain': if Wind == 'Weak': return 'Yes' elif Wind == 'Strong': return 'No' elif Outlook == 'Overcast': return 'Yes' elif Outlook == 'Sunny': if Humidity&amp;amp;amp;amp;gt;65: if Wind == 'Strong': return 'No' elif Wind == 'Weak': return 'No' elif Temperature&amp;amp;amp;amp;gt;83: return 'No'
I ask you to use gain ratio metric as a homework to understand C4.5 algorithm. Gain ratio based C4.5 decision tree will be look like the following decision rules.
Support this blog if you do like!
GainRatio(Decision, Humidity 65) = 0.126
GainRatio(Decision, Humidity 80) = 0.107
How comes you took into account a threshold of 80 when GainRatio for 65 is higher?
In case of Humidity<=80, there are 2 no and 7 yes decisions. Total number of instances is 9
Entropy(Decision|Humidity<=80) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = - (2/9) * log(2/9) - (7/9)*log(7/9) = 0.764 (BTW, log refers to the base 2)
In case of Humidity>80, there are 3 no and 2 yes decisions. Total number of instances is 5
Entropy(Decision|Humidity>80) = – p(No) . log2p(No) – p(Yes) . log2p(Yes) = -(3/5)*log(3/5) – (2/5)*log(2/5) = 0.971
Global entropy was calculated as 0.940 in previous steps
Now, it is time to calculate Gain.
Gain(Decision, Humidity<> 80) = Entropy(Decision) – p(Humidity<=80) * Entropy(Decision|Humidity<=80) - p(Humidity>80)*Entropy(Decision|Humidity>80)
Gain(Decision, Humidity<> 80) = 0.940 – (9/14)*0.764 – (5/14)*0.971 = 0.101
Now, we can calculate GainRatio but before we need to calculate SplitInfo first.
SplitInfo(Decision, Humidity<> 80) = -p(No)*log(p(No)) – p(Yes)*log(P(Yes)) = -(9/14)*log(9/14) – (5/14)*log(5/14) = 0.940
GainRatio = Gain / SplitInfo = 0.101 / 0.940 = 0.107
I hope this explanation is understandable.
Understood, thanks
I don’t understand why we choose 80 its gain ratio is only 0.107
for 65 its 0.126
and 0.107<0.126 ?
If we would use gain ratio, then you will be right. Humidity=65 (0.126) is more dominant than Humidity=80 (0.107).
But I use information gain instead of gain ratio here. In this case, Humidity=80 (0.101) is more dominant than Humidity= 65 (0.048).
If you ask that why information gain is used instead of gain ratio, it is all up to you. You might want to use gain ratio, and herein 65 will be your choice.
Why is humidity the best nodle if outlook=sunny? i dont get it
You have the following
Gain(Decision, Humidity 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.940 – 0.231 – 0.692 = 0.014
there are 3 values of 70 so surely P(Humidity 70) = P( No) = 3/14 not 4/14
Here, 4 is the number of instances which are less than or equal to 70. Humidity of instances for day 6, 7, 9, 11 are less than or equal to 70.
Similarly, 10 is the number of instances which are greater than 70. Number of instances greater than 70 is 10.
I’m sorry I posted the wrong reference. It should have been
Gain(Decision, Humidity 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.940 – 0.231 – 0.692 = 0.014
so,
there are 3 values of 70 so surely P(Humidity 70) = P( No) = 3/14 not 4/14
thanks for your attention
You were right if we need Gain(Decision, Humidity = 70) but we need Gain(Decision, Humidity <= 70). Gain(Decision, Humidity ? 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.940 – 0.231 – 0.692 = 0.014 In this equation 4/14 is probability of instances less than or equal to 70, and 10/14 is probability of instances greater than 70.
I do understand that but the statement is ‘not equal’ to 70, not less than or equal. If the objective is to have values less than or equal and values greater than then the calculation is that of the global entropy, i.e. all values surely.
Right, cause of poor communication. Actually, I would not intent as your understanding. I should mention that in the post.
The statement Gain(Decision, Humidity ? 70) refers to that what would be if the branch of decision tree is for less than or equal to 70, and greater than 70. All calculations made with this approach.
I hope everything is clear now. Thank you for your attention.
OK, now I get it. Thanks a lot both for your blog, attention and patience. It is much appreciated.
Looks as if the editor loses the not-equal sign, hence the poor communication..
The expression in question is
Gain(Decision, Humidity¬= 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.940 – 0.231 – 0.692 = 0.014
There being 3 instances of 70 so 3 instances where P(Humidity¬=70) is false, i.e. P(No) = 3/14
I’m sorry to have to return to the point made by Dinca Andrei but I think the confusion arises from a statement in your blog
The value which maximizes the gain would be the threshold.
Is it not the case that the threshold is the value that minimises Entropy(Decision|Humidity threshold)
Here are some calculations, which if taken with the ones you perform, does show 80 as the splitting point – Entropy(Decision|Humidity 80) is the least value
le 65 0
> 65 0.961237
65 0.892577
g ratio 0.047423
split 0.371232
g ratio 0.127745
le 78 0.650022
>78 1
78 0.85001
g 0.08999
split 0.985228
g ratio 0.09134
le 80 0.764205
>80 0.970951
80 0.838042
g 0.101958
spli 0.940286
g ratio 0.108433
le 85 0.881291
>85 1
85 0.915208
g 0.024792
split 0.863121
g ratio 0.028724
Thanks for your attention.
Yes, you are absolutely right. I summarized gain and gain ratios for every possible threshold point.
branch-> 65 70 75 78 80 85 90
gain 0.048 0.014 0.045 0.09 0.101 0.024 0.01
gain ratio 0.126 0.016 0.047 0.09 0.107 0.027 0.016
I stated that “We would calculate the gain OR gain ratio for every step. The value which maximizes the gain would be the threshold.”. Now, it is all up to you to decide threshold point based on gain or gain ratio. If prefer to use gain and my threshold would be 80. If you prefer to use gain ratio metric, your threshold would be 65. The both approaches are correct.
HI ,, are this algorithm good for a large database , like a dataset for large manufacture
Thank you
Decision tree algorithms require high memory demand. You should look its extended version – random forests, this might be adapted better for your problem
why we didnt calculate the gain ratio of temperature here
It’s all up to you. You can use either information gain or gain ratio. In this case, using information gain is my choice.
if we use the information gain, then what’s the difference between id3 and c4.5?
if you use information gain, then you will have an id3 decision tree
why when i tried your chefboost on github, i got an error
File “Chefboost.py”, line 81
print(header,end=”)
^
SyntaxError: invalid syntax
how to solve this, thanks
What is your python environment? This works in python 3.X.
ah sorry,i used python 2.x
i have another question, so in the end ‘num_of_trees’ variable is not used? i thought we can use the variable for limit the number of leaf tree
sorry if I am wrong in reading the code
The variable number of trees is using in random forest. Regular tree algorithms such as id3 or c4.5 will create a single tree.
Hi
Thank you for this excellent description .I just have a question , how I can calculate accuracy of training data?
thank you
When you run a decision tree algorithm, it builds decision rules. For example, I use C4.5 algorithm and the data set https://github.com/serengil/chefboost/blob/master/dataset/golf2.txt , the following rules created.
Now, I check all instances in the same data set. Let’s focus on the first instance.
Decision rules say that prediction is no because temperature is 85 and it is greater than 83. On the other hand, decision column says actual value and it is no, too. This instance is classified correctly. I will apply same procedures for all instances and check how many of them are classified correctly. Dividing the correctly classified instances to all instances will be the accuracy on training set.
Similarly, if you apply these decision rules for a data set haven’t been fed to training set will be you test accuracy.
I strongly recommend you to run this python code to understand clear: https://github.com/serengil/chefboost
sir, I need a clarification when to use info gain and gain ratio?
ID3 uses information gain whereas C4.5 uses gain ratio in most cases.
i drew decision tree using info gain…it came perfectly… branches were more. while I drew decision tree using gain ratio… the tree is pruned and I cannot solve it further because I get gain ratio either more than 1.0 or in -ve. while splitting, one I got [0,0,2](classified as III class) and another is [0,49,3] (not classified fully, can I finalize the branch as II class). is it correct to stop abruptly, telling class-II as a leaf node?
Sorry but I do not understand your case. Please explain it more clear.
for a particular dataset i got different decision tree for id3 and c4.5(output of both id3 and c4.5 decision tree are not the same);
while using c4.5, i calculated a gain ratio, i draw decision tree using it, but suddenly i started to get gain ratio values either more than 1.0 or in -ve. can i stop drawing decision tree while i get values of gain ratio like that?
It makes sense if you shared your data set.
How to cross validate decision tree midels using chefboost?
Cross validation is not currently supported in chefboost. I plan to add this feature in the next release.
Why you didn’t calculate Temp. attribute ?
I did actually
I mean you didn’t calculate Entropy, Gain, Split Info, and Gain Ratio for Temp. attribute. In summarize calculated Gain and Gain Ratios table you only provide Wind, Outlook, and Humidity 80 attributes. I confused because i have calculated all attributes and i found Gain = 0.113401 and Gain Ratio = 0.305472 for Temp. 83 which makes Temp. 83 comes with both maximized Gain and Gain Ratio and should be the root node ?
Right, temp is missing and I’ve added it to summary table. But in this post I check gain instead of gain ratio and outlook’s gain is 0.246 and it is greater than the gain of temperature – it was 0.113 as you mentioned.
Just to make sure is the exact rule are use max Gain to choose continuous attribute value from binary split process and then use max Gain Ratio for determine the root and branches ?
Nope, you can either choose max gain or gain ratio to determine the root node. Both are true but you should choose one.
So if i choose to use Gain i should use Gain to choose continuous attribute value from binary split process and use Gain too for determine the root node but, if choose to use Gain Ratio i should use Gain Ratio in choose continuous attribute value from binary split and determine the root node ?
Exactly!
One more question sir, which one should i choose when there is duplicate Gain or Gain Ratio, in this case i try use Gain Ratio metric and max Gain Ratio for Humidity is 0.128515 where it’s in Humidity 65 and Humidity 95
Interesting case. I haven’t have a similar case before. You might try to find max gain ratio, if there is a duplicate record, then you might find the highest gain in this case. Because, both metrics are meaningful. If duplication still appears, then you might choose the first one.
Additionally, I’ve put what would built decision tree look like if we use gain ratio metric. As you mentioned, temperature would be in the root node in this case.
Thanks for the post.
I don’t understand one thing – you have calculated the best split for continuous values (humidity and temperature) globally. However, when the root node is Outlook, shouldn’t you calculate the best split again for the part of the tree we are already in?
For example, in Figure 1, I think we should calculate the optimal split for the Humidity assuming that Outlook == Sunny, because we are already in a subspace defined by Outlook | Sunny.
I’ve implemented your algorithm in my version and got the same tree apart from the value that Humidity should be splat on 🙂
Hello, we already calculated the best splits again when outlook is the root node.
1. if parent entropy is less than child entropy can we split the child node (can child node become subtree)?
2. if one branch is pure set(4 yes/ 0 no) and another branch
impure(3 yes / 3 no). what should we do now. can we split the impure node(3 yes/3no) when the parent entropy is less than child entropy?
2- I mostly return the first class for impure branches.
1- Parent and child nodes are totally different. You should not split child node.
Thank you so much for your immediate response. still more i shall ask question later. i just need some confirmation. so that i need to ask question. Thank you for your reply in advance.
after selecting outlook attribute as root node..
we have three branches from outlook that is “sunny” ,” overcast” and “Rain” right…
After this the next step you do, when you consider “outlook=sunny” then how table becomes short??? I cannot understand?
I filtered the raw data set satisfying outlook==sunny instances. Then, I will calculate gain / gain ratio pair for this filtered data set.
Sir, can your library combine C4.5 and adaboost?
Thanks
No, it should not. Because adaboost uses a weak classifier (decision stamps) but C4.5 is strong enough.
How does the algorithm deal with missing values ‘?’ ??
Please help!
Suppose that your column has values in scale of [0, +∞]. If you set -1 to missing values, then the algorithm can handle to build decision trees now. To sum up, find the minimum value in that column, and set a smaller value to missing ones.
Is such workaround the way the original C4.5 handles missing values?
yes, the algorithm handles missing values by assigning some values such as avg
In the Gain ratio based C4.5 decision tree in the ‘Bonus’ section, I think the tree is wrong. In “outlook=sunny” -> “humidity>65” -> “wind=strong” -> “return no”, and also in “outlook=sunny” -> “humidity>65” -> “wind=weak” -> “return no”, I would like to tell you that there is 1 ‘yes’ decision each for wind=strong and wind=weak.
Please have a look at the dataset and let me know if the decision tree given in ‘Bonus’ section perfectly works for the given dataset or not.
The tree in the bonus section was built with information gain instead of gain ratio. It is all up to you choosing the metric. I mean that your tree is valid as well.
Thank you for your quick response of my previous question. I want to apply chefboost for analysis of different algorithms, how to apply it for confusion matrix with training and testing dataset?
You need to call its prediction function extract confusion matrix by yourself. As I mentioned, I should add this feature in the next release.
Thank you very much, waiting your next release….
May I know how the decision tree would look like if gain ratio metric is used? Temperature would be the root node because it has the highest gain ratio metric but how does it proceed from there?
Actually, that tree is mentioned in the bonus section.
What i don’t realize is in truth how you’re now not actually much more well-appreciated than you may be right now. You are so intelligent. You know thus considerably in relation to this subject, produced me personally imagine it from numerous numerous angles. Its like women and men aren’t involved except it’s one thing to accomplish with Girl gaga! Your own stuffs nice. Always deal with it up!|
Hey Sefik, I really appreciate your effort on this wonderful library. Anyway, Is there any walkaround when testing using a raw, undiscritized test data? I’ve constructed the decision tree using a discritized train data (using Fayyad-Irani’s EBB) and when I tried to run a test on it, I got some error message instead.
PS: For your information, I’m using traffic flow data so it’s all number.
Thank you
Could you share the dataset? I need it to understand the error.
I’ve sent you an email with the dataset attached to it
Sir, I have a doubt regarding continuous dataset [Iris dataset], threshold value is found for continuous dataset. Then, based on the threshold value, which gives the highest gain ratio, the decision node is selected and the branches hold the value of threshold ‘less than or equal to [svalue]’ and ‘ greater than [svalue]’. My question is should we remove the best_attribute once selected or can we have that same attribute for next iteration also. [In ID3 we have to remove the best attribute once selected as decision node but in C4.5 algorithm that too in continuous dataset; what should we do sir?]
you should remove that feature if it has the highest gain ratio value once
Thank you for your immediate response sir.
Feeling gratitude for your immediate response sir.
Sir, I am not good in Coding specially in Python . I don’t even know which code i need to write to get Split Info and Gain Ratio as well as Max Gain ratio.I tried non-numerical values dataset to implement C4.5 Algorithm. Can you please help this out. 🙏
Hi, I have a question. If say the rules.py file is very large then do you have any method from which I can get a decision tree. as in any packages which can directly give the decision tree.
rules.py is the decision tree itself
If gain ratio are same in all atributes then what we select for root node?
You can select a random one in that case