Decision tree algorithms create understandable and readable decision rules. This is one of most important advantage of this motivation. This also enables to modify some rules. This modification is called pruning in decision trees. It is a common technique in applied machine learning studies. We can apply pruning to avoid overfitting and to over-perform. We will mention pruning techniques in this post.
Pruning can be handled as pre-pruning and post-pruning.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
Pre-pruning
We’ve mentioned regression tree in a previous post. We are going to use the same data set in that post as demonstrated below.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
3 | Overcast | Hot | High | Weak | 46 |
4 | Rain | Mild | High | Weak | 45 |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
7 | Overcast | Cool | Normal | Strong | 43 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
10 | Rain | Mild | Normal | Weak | 46 |
11 | Sunny | Mild | Normal | Strong | 48 |
12 | Overcast | Mild | High | Strong | 52 |
13 | Overcast | Hot | Normal | Weak | 44 |
14 | Rain | Mild | High | Strong | 30 |
Running regression tree algorithm constructs the following decision tree.
def findDecision(Outlook,Temp.,Humidity,Wind): if Outlook == 'Rain' : if Wind == 'Weak' : if Humidity <=95 : if Temp. <=83 : return 46 if Humidity >95 : return 45 if Wind == 'Strong' : if Temp. <=83 : if Humidity <=95 : return 23 if Outlook == 'Sunny' : if Temp. <=83 : if Wind == 'Weak' : if Humidity <=95 : return 35 if Wind == 'Strong' : if Humidity <=95 : return 30 if Temp. >83 : return 25 if Outlook == 'Overcast' : if Wind == 'Weak' : if Temp. <=83 : if Humidity <=95 : return 46 if Wind == 'Strong' : if Temp. <=83 : if Humidity <=95 : return 43
Disappointing huge tree is created.
Smells overfitting
As seen, a huge tree is created. This is typical problem of regression trees. Decision rules at the bottom includes a few instances or single instance. This causes overfitting. Here, we can apply early stop. Here, we might check number of instances in the current branch or ratio of standard deviation of the current branch to all global data set.
if algorithm == 'Regression' and subdataset.shape[0] < 5: #if algorithm == 'Regression' and subdataset['Decision'].std(ddof=0)/global_stdev < 0.4: final_decision = subdataset['Decision'].mean() #get average terminateBuilding = True
Enabling early stop if sub data sets in the current branch is less than e.g. 5 will construct the following decision tree. As seen, more generalized decision rules are created. This avoids overfitting.
def findDecision(Outlook,Temp.,Humidity,Wind): if Outlook == 'Rain' : if Wind == 'Weak' : return 47.666666666666664 if Wind == 'Strong' : return 26.5 if Outlook == 'Sunny' : if Temp. <=83 : return 37.75 if Temp. >83 : return 25 if Outlook == 'Overcast' : return 46.25
Post pruning
We’ve mentioned C4.5 decision tree algorithm in a previous post. Suppose that we are going to work on the following data set.
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
3 | Overcast | 83 | 78 | Weak | Yes |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
7 | Overcast | 64 | 65 | Strong | Yes |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
10 | Rain | 75 | 80 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
C4.5 algorithm constructs the following decision tree. Notice that built decision tree is in a different form in related blog post because we picked information gain metric in that post but we picked gain ratio metric in this post.
def findDecision(Outlook,Temp.,Humidity,Wind): if Temp. <=83 : if Outlook == 'Rain' : if Wind == 'Weak' : return 'Yes' if Wind == 'Strong' : return 'No' if Outlook == 'Overcast' : return 'Yes' if Outlook == 'Sunny' : if Humidity >65 : if Wind == 'Strong' : return 'Yes' if Wind == 'Weak' : return 'Yes' if Temp. >83 : return 'No'
Droning rules
Here, please focus on decisions when temperature is less than or equal to 82, and sunny outlook. This branch makes positive decision no matter what wind is. Still, it checks wind feature. We can prune checking wind feature in that level. Also, you might realize that there is no answer when humidity is less than or equal to 65. It actually comes from that humidity feature could have 65 as minimum value. We can prune this rule, too. The final form of the decision tree is illustrated below.
def findDecision(Outlook,Temp.,Humidity,Wind): if Temp. <=83 : if Outlook == 'Rain' : if Wind == 'Weak' : return 'Yes' if Wind == 'Strong' : return 'No' if Outlook == 'Overcast' : return 'Yes' if Outlook == 'Sunny' : return 'Yes' if Temp. >83 : return 'No'
This modification improves the performance of running decision tree. Because, it will always make same decisions even though its result wouldn’t be changed.
We have been pruning some decision rules because its upper branch includes them both. But this is not a must. You should prune some branches if they might derive from a few instances in the training set. This enables to avoid overfitting.
To sum up, post pruning covers building decision tree first and pruning some decision rules from end to beginning. In contrast, pre-pruning and building decision trees are handled simultaneously. In both cases, less complex trees are created and this causes to run decision rules faster. Also, this might enables to avoid overfitting.
All code and data sets are already pushed into GitHub. You might run it by yourself.
Support this blog if you do like!