Feature Importance in Decision Trees

A decision tree is explainable machine learning algorithm all by itself. Beyond its transparency, feature importance is a common way to explain built models as well. Coefficients of linear regression equation give a opinion about feature importance but that would fail for non-linear models. Herein, feature importance derived from decision trees can explain non-linear models as well. In this post, we will mention how to calculate feature importance in decision tree algorithms by hand.

This amazing flashcard about feature importance is created by Chris Albon.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Vlog

You can either watch the following video or follow this blog post. They both cover the feature importance for decision trees.

Besides, decision trees are not the only way to find feature importance. We can find it in linear regression as well.

Building a decision tree

Firstly, we have to build a decision tree to calculate feature importance.

Suppose that we have the following data set. It is the regular golf data set mentioned in data mining classes. We’ve mentioned it in ID3 post as well. The only difference is that features are numerical instead of nominal. Remember that binary splits can be applied to continuous features.

Outlook	Temperature	Humidity	Wind	Decision
1	1	1	1	No
1	1	1	2	No
2	1	1	1	Yes
3	2	1	1	Yes
3	3	2	1	Yes
3	3	2	2	No
2	3	2	2	Yes
1	2	1	1	No
1	3	2	1	Yes
3	2	2	1	Yes
1	2	2	2	Yes
2	2	1	2	Yes
2	1	2	1	Yes
3	2	1	2	No

Secondly, decision tree algorithms have different metric to find the decision splits. For example, CHAID uses Chi-Square test value, ID3 and C4.5 uses entropy, CART uses GINI Index. Herein, we should note those metrics for each decision point in the tree based on the selected algorithm, and number of instances satisfying that rule in the data set.

The following decision tree was built by C4.5 algorithm. You should read the C4.5 post to learn how the following tree was built step by step.

The both entropy and number of satisfying instances in the data set are noted next to the decision points.

Extracting rules

BTW, you can just pass the data set to chefboost framework. It extracts those rules.

def findDecision(Outlook, Temperature, Humidity, Wind):
if Humidity>1:
if Wind<=1: return 'Yes' elif Wind>1:
if Outlook>1:
return ‘Yes’
elif Outlook<=1: return 'Yes' elif Humidity<=1: if Outlook>1:
if Wind>1:
return ‘Yes’
elif Wind<=1: return 'Yes' elif Outlook<=1: return 'No'

Feature importance formula

To calculate the importance of each feature, we will mention the decision point itself and its child nodes as well. The following formula covers the calculation of feature importance. Herein, the metric is entropy because C4.5 algorithm adopted. It would be GINI if the algorithm were CART.

Feature importance (FI) = Feature metric * number of instances – its left child node metric * number of instances for the left child – its right child node metric * number of instances for the right child

Some sources mention feature importance formula a little different. Still, the normalized values will be same.

N_t / N * (impurity – N_t_R / N_t * right_impurity – N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

Notice that a feature can appear several times in a decision tree as a decision point. For example, the feature outlook appears 2 times in the decision tree in 2nd and 3rd level. sum of those individual decision points will be the feature importance of Outlook.

1st level of the decision tree

FI(Humidity|1st level) = 14x 0.940 – 7×0.985 – 7×0.591 = 2.121

2nd level of the decision tree

FI(Outlook|2nd level) = 7×0.985 – 4×0.811 = 3.651

FI(Wind|2nd level) = 7×0.591 – 3×0.918 = 1.390

Notice that the both outlook and wind decision points in the 2nd level have direct decision leafs. I mean that outlook is greater than 1 then it would be No. Herein, No branch has no contribution to feature importance calculation because entropy of a decision is 0.

3rd level of the decision tree

FI(Wind|3rd level) = 4×0.811 = 3.244

FI(Outlook|3rd level) = 3×0.918 = 2.754

Results

FI(Humidity) = FI(Humidity|1st level) = 2.121

FI(Outlook) = FI(Outlook|2nd level) + FI(Outlook|3rd level) = 3.651 + 2.754 = 6.405

FI(Wind) = FI(Wind|2nd level) + FI(Wind|3rd level) = 1.390 + 3.244 = 4.634

Normalization

We can normalize these results if we divide them all with their sum

FI(Sum) = FI(Humidity) + FI(Outlook) + FI(Wind) = 2.121 + 6.405 + 4.634 = 13.16

FI(Humidity)’ = FI(Humidity) / FI(Sum) = 2.121 / 13.16 = 0.16

FI(Outlook)’ = FI(Outlook) / FI(Sum) = 6.405 / 13.16 = 0.48

FI(Wind)’ = FI(Wind) / FI(Sum) = 4.634 / 13.16 = 0.35

Demonstration

So, outlook is the most important feature whereas wind comes after it and humidity follows wind. Notice that temperature feature does not appear in the built decision tree. This means that its feature importance value is 0. In other words, it is an identity element.

We mostly represent feature importance values as horizontal bar charts.

c45-fi-results — Feature importance graph

Bagging and boosting methods

Gradient boosting machines and random forest have several decision trees. We will calculate feature importance values for each tree in same way and find average to find the final feature importance values.

The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.

Adaboost vs Gradient Boosting

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Python Code

Herein, chefboost framework for python offers you to build decision trees with a few lines of code. It covers feature importance calculation as well.

feature-importance-in-chefboost — Feature importance in chefboost

Conclusion

So, we’ve mentioned how to calculate feature importance in decision trees and adopt C4.5 algorithm to build a tree. We can apply same logic to any decision tree algorithm. Decision tree algorithms offer both explainable rules and feature importance values for non-linear models.

Support this blog if you do like!