How Pruning Works in Decision Trees

Decision tree algorithms create understandable and readable decision rules. This is one of most important advantage of this motivation. This also enables to modify some rules. This modification is called pruning in decision trees. It is a common technique in applied machine learning studies. We can apply pruning to avoid overfitting and to over-perform. We will mention pruning techniques in this post.

pruning
Pruning

Pruning can be handled as pre-pruning and post-pruning.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Pre-pruning

We’ve mentioned regression tree in a previous post. We are going to use the same data set in that post as demonstrated below.

Day Outlook Temp. Humidity Wind Golf Players
1 Sunny Hot High Weak 25
2 Sunny Hot High Strong 30
3 Overcast Hot High Weak 46
4 Rain Mild High Weak 45
5 Rain Cool Normal Weak 52
6 Rain Cool Normal Strong 23
7 Overcast Cool Normal Strong 43
8 Sunny Mild High Weak 35
9 Sunny Cool Normal Weak 38
10 Rain Mild Normal Weak 46
11 Sunny Mild Normal Strong 48
12 Overcast Mild High Strong 52
13 Overcast Hot Normal Weak 44
14 Rain Mild High Strong 30

Running regression tree algorithm constructs the following decision tree.

def findDecision(Outlook,Temp.,Humidity,Wind):
 if  Outlook  == 'Rain' :
  if  Wind  == 'Weak' :
   if  Humidity <=95 :
    if  Temp. <=83 :
     return  46
   if  Humidity >95 :
    return  45
  if  Wind  == 'Strong' :
   if  Temp. <=83 :
    if  Humidity <=95 :
     return  23
 if  Outlook  == 'Sunny' :
  if  Temp. <=83 :
   if  Wind  == 'Weak' :
    if  Humidity <=95 :
     return  35
   if  Wind  == 'Strong' :
    if  Humidity <=95 :
     return  30
  if  Temp. >83 :
   return  25
 if  Outlook  == 'Overcast' :
  if  Wind  == 'Weak' :
   if  Temp. <=83 :
    if  Humidity <=95 :
     return  46
  if  Wind  == 'Strong' :
   if  Temp. <=83 :
    if  Humidity <=95 :
     return  43

Disappointing huge tree is created.

Smells overfitting

As seen, a huge tree is created. This is typical problem of regression trees. Decision rules at the bottom includes a few instances or single instance. This causes overfitting. Here, we can apply early stop. Here, we might check number of instances in the current branch or ratio of standard deviation of the current branch to all global data set.

if algorithm == 'Regression' and subdataset.shape[0] < 5:
#if algorithm == 'Regression' and subdataset['Decision'].std(ddof=0)/global_stdev < 0.4:
 final_decision = subdataset['Decision'].mean() #get average
 terminateBuilding = True

Enabling early stop if sub data sets in the current branch is less than e.g. 5 will construct the following decision tree. As seen, more generalized decision rules are created. This avoids overfitting.

def findDecision(Outlook,Temp.,Humidity,Wind):
 if  Outlook  == 'Rain' :
  if  Wind  == 'Weak' :
   return  47.666666666666664
  if  Wind  == 'Strong' :
   return  26.5
 if  Outlook  == 'Sunny' :
  if  Temp. <=83 :
   return  37.75
  if  Temp. >83 :
   return  25
 if  Outlook  == 'Overcast' :
   return  46.25

Post pruning

We’ve mentioned C4.5 decision tree algorithm in a previous post. Suppose that we are going to work on the following data set.

Day Outlook Temp. Humidity Wind Decision
1 Sunny 85 85 Weak No
2 Sunny 80 90 Strong No
3 Overcast 83 78 Weak Yes
4 Rain 70 96 Weak Yes
5 Rain 68 80 Weak Yes
6 Rain 65 70 Strong No
7 Overcast 64 65 Strong Yes
8 Sunny 72 95 Weak No
9 Sunny 69 70 Weak Yes
10 Rain 75 80 Weak Yes
11 Sunny 75 70 Strong Yes
12 Overcast 72 90 Strong Yes
13 Overcast 81 75 Weak Yes
14 Rain 71 80 Strong No

C4.5 algorithm constructs the following decision tree. Notice that built decision tree is in a different form in related blog post because we picked information gain metric in that post but we picked gain ratio metric in this post.

def findDecision(Outlook,Temp.,Humidity,Wind):
 if  Temp. <=83 :
  if  Outlook  == 'Rain' :
   if  Wind  == 'Weak' :
    return  'Yes'
   if  Wind  == 'Strong' :
    return  'No'
  if  Outlook  == 'Overcast' :
   return  'Yes'
  if  Outlook  == 'Sunny' :
   if  Humidity >65 :
    if  Wind  == 'Strong' :
     return  'Yes'
    if  Wind  == 'Weak' :
     return  'Yes'
 if  Temp. >83 :
  return  'No'

Droning rules

Here, please focus on decisions when temperature is less than or equal to 82, and sunny outlook. This branch makes positive decision no matter what wind is. Still, it checks wind feature. We can prune checking wind feature in that level. Also, you might realize that there is no answer when humidity is less than or equal to 65. It actually comes from that humidity feature could have 65 as minimum value. We can prune this rule, too. The final form of the decision tree is illustrated below.





def findDecision(Outlook,Temp.,Humidity,Wind):
 if  Temp. <=83 :
  if  Outlook  == 'Rain' :
   if  Wind  == 'Weak' :
    return  'Yes'
   if  Wind  == 'Strong' :
    return  'No'
  if  Outlook  == 'Overcast' :
   return  'Yes'
  if  Outlook  == 'Sunny' :
   return  'Yes'
 if  Temp. >83 :
  return  'No'

This modification improves the performance of running decision tree.  Because, it will always make same decisions even though its result wouldn’t be changed.

We have been pruning some decision rules because its upper branch includes them both. But this is not a must. You should prune some branches if they might derive from a few instances in the training set. This enables to avoid overfitting.

To sum up, post pruning covers building decision tree first and pruning some decision rules from end to beginning. In contrast, pre-pruning and building decision trees are handled simultaneously. In both cases, less complex trees are created and this causes to run decision rules faster. Also, this might enables to avoid overfitting.

All code and data sets are already pushed into GitHub. You might run it by yourself.


Like this blog? Support me on Patreon

Buy me a coffee