A decision tree is explainable machine learning algorithm all by itself. Beyond its transparency, feature importance is a common way to explain built models as well. Coefficients of linear regression equation give a opinion about feature importance but that would fail for non-linear models. Herein, feature importance derived from decision trees can explain non-linear models as well. In this post, we will mention how to calculate feature importance in decision tree algorithms by hand.
This amazing flashcard about feature importance is created by Chris Albon.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
Vlog
You can either watch the following video or follow this blog post. They both cover the feature importance for decision trees.
Besides, decision trees are not the only way to find feature importance. We can find it in linear regression as well.
Building a decision tree
Firstly, we have to build a decision tree to calculate feature importance.
Suppose that we have the following data set. It is the regular golf data set mentioned in data mining classes. We’ve mentioned it in ID3 post as well. The only difference is that features are numerical instead of nominal. Remember that binary splits can be applied to continuous features.
Outlook | Temperature | Humidity | Wind | Decision |
1 | 1 | 1 | 1 | No |
1 | 1 | 1 | 2 | No |
2 | 1 | 1 | 1 | Yes |
3 | 2 | 1 | 1 | Yes |
3 | 3 | 2 | 1 | Yes |
3 | 3 | 2 | 2 | No |
2 | 3 | 2 | 2 | Yes |
1 | 2 | 1 | 1 | No |
1 | 3 | 2 | 1 | Yes |
3 | 2 | 2 | 1 | Yes |
1 | 2 | 2 | 2 | Yes |
2 | 2 | 1 | 2 | Yes |
2 | 1 | 2 | 1 | Yes |
3 | 2 | 1 | 2 | No |
Secondly, decision tree algorithms have different metric to find the decision splits. For example, CHAID uses Chi-Square test value, ID3 and C4.5 uses entropy, CART uses GINI Index. Herein, we should note those metrics for each decision point in the tree based on the selected algorithm, and number of instances satisfying that rule in the data set.
The following decision tree was built by C4.5 algorithm. You should read the C4.5 post to learn how the following tree was built step by step.
The both entropy and number of satisfying instances in the data set are noted next to the decision points.
Extracting rules
BTW, you can just pass the data set to chefboost framework. It extracts those rules.
def findDecision(Outlook, Temperature, Humidity, Wind):
if Humidity>1:
if Wind<=1:
return 'Yes'
elif Wind>1:
if Outlook>1:
return ‘Yes’
elif Outlook<=1:
return 'Yes'
elif Humidity<=1:
if Outlook>1:
if Wind>1:
return ‘Yes’
elif Wind<=1:
return 'Yes'
elif Outlook<=1:
return 'No'
Feature importance formula
To calculate the importance of each feature, we will mention the decision point itself and its child nodes as well. The following formula covers the calculation of feature importance. Herein, the metric is entropy because C4.5 algorithm adopted. It would be GINI if the algorithm were CART.
Feature importance (FI) = Feature metric * number of instances – its left child node metric * number of instances for the left child – its right child node metric * number of instances for the right child
Some sources mention feature importance formula a little different. Still, the normalized values will be same.
N_t / N * (impurity – N_t_R / N_t * right_impurity – N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
Notice that a feature can appear several times in a decision tree as a decision point. For example, the feature outlook appears 2 times in the decision tree in 2nd and 3rd level. sum of those individual decision points will be the feature importance of Outlook.
1st level of the decision tree
FI(Humidity|1st level) = 14x 0.940 – 7×0.985 – 7×0.591 = 2.121
2nd level of the decision tree
FI(Outlook|2nd level) = 7×0.985 – 4×0.811 = 3.651
FI(Wind|2nd level) = 7×0.591 – 3×0.918 = 1.390
Notice that the both outlook and wind decision points in the 2nd level have direct decision leafs. I mean that outlook is greater than 1 then it would be No. Herein, No branch has no contribution to feature importance calculation because entropy of a decision is 0.
3rd level of the decision tree
FI(Wind|3rd level) = 4×0.811 = 3.244
FI(Outlook|3rd level) = 3×0.918 = 2.754
Results
FI(Humidity) = FI(Humidity|1st level) = 2.121
FI(Outlook) = FI(Outlook|2nd level) + FI(Outlook|3rd level) = 3.651 + 2.754 = 6.405
FI(Wind) = FI(Wind|2nd level) + FI(Wind|3rd level) = 1.390 + 3.244 = 4.634
Normalization
We can normalize these results if we divide them all with their sum
FI(Sum) = FI(Humidity) + FI(Outlook) + FI(Wind) = 2.121 + 6.405 + 4.634 = 13.16
FI(Humidity)’ = FI(Humidity) / FI(Sum) = 2.121 / 13.16 = 0.16
FI(Outlook)’ = FI(Outlook) / FI(Sum) = 6.405 / 13.16 = 0.48
FI(Wind)’ = FI(Wind) / FI(Sum) = 4.634 / 13.16 = 0.35
Demonstration
So, outlook is the most important feature whereas wind comes after it and humidity follows wind. Notice that temperature feature does not appear in the built decision tree. This means that its feature importance value is 0. In other words, it is an identity element.
We mostly represent feature importance values as horizontal bar charts.
Bagging and boosting methods
Gradient boosting machines and random forest have several decision trees. We will calculate feature importance values for each tree in same way and find average to find the final feature importance values.
The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.
Adaboost vs Gradient Boosting
The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.
Python Code
Herein, chefboost framework for python offers you to build decision trees with a few lines of code. It covers feature importance calculation as well.
Conclusion
So, we’ve mentioned how to calculate feature importance in decision trees and adopt C4.5 algorithm to build a tree. We can apply same logic to any decision tree algorithm. Decision tree algorithms offer both explainable rules and feature importance values for non-linear models.
Support this blog if you do like!