How Pruning Works in Decision Trees

Decision tree algorithms create understandable and readable decision rules. This is one of most important advantage of this motivation. This also enables to modify some rules. This modification is called pruning in decision trees. It is a common technique in applied machine learning studies. We can apply pruning to avoid overfitting and to over-perform. We will mention pruning techniques in this post.

Pruning can be handled as pre-pruning and post-pruning.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Pre-pruning

We’ve mentioned regression tree in a previous post. We are going to use the same data set in that post as demonstrated below.

Day	Outlook	Temp.	Humidity	Wind	Golf Players
1	Sunny	Hot	High	Weak	25
2	Sunny	Hot	High	Strong	30
3	Overcast	Hot	High	Weak	46
4	Rain	Mild	High	Weak	45
5	Rain	Cool	Normal	Weak	52
6	Rain	Cool	Normal	Strong	23
7	Overcast	Cool	Normal	Strong	43
8	Sunny	Mild	High	Weak	35
9	Sunny	Cool	Normal	Weak	38
10	Rain	Mild	Normal	Weak	46
11	Sunny	Mild	Normal	Strong	48
12	Overcast	Mild	High	Strong	52
13	Overcast	Hot	Normal	Weak	44
14	Rain	Mild	High	Strong	30

Running regression tree algorithm constructs the following decision tree.

def findDecision(Outlook,Temp.,Humidity,Wind):
 if  Outlook  == 'Rain' :
  if  Wind  == 'Weak' :
   if  Humidity <=95 :
    if  Temp. <=83 :
     return  46
   if  Humidity >95 :
    return  45
  if  Wind  == 'Strong' :
   if  Temp. <=83 :
    if  Humidity <=95 :
     return  23
 if  Outlook  == 'Sunny' :
  if  Temp. <=83 :
   if  Wind  == 'Weak' :
    if  Humidity <=95 :
     return  35
   if  Wind  == 'Strong' :
    if  Humidity <=95 :
     return  30
  if  Temp. >83 :
   return  25
 if  Outlook  == 'Overcast' :
  if  Wind  == 'Weak' :
   if  Temp. <=83 :
    if  Humidity <=95 :
     return  46
  if  Wind  == 'Strong' :
   if  Temp. <=83 :
    if  Humidity <=95 :
     return  43

Disappointing huge tree is created.

Smells overfitting

As seen, a huge tree is created. This is typical problem of regression trees. Decision rules at the bottom includes a few instances or single instance. This causes overfitting. Here, we can apply early stop. Here, we might check number of instances in the current branch or ratio of standard deviation of the current branch to all global data set.

if algorithm == 'Regression' and subdataset.shape[0] < 5:
#if algorithm == 'Regression' and subdataset['Decision'].std(ddof=0)/global_stdev < 0.4:
 final_decision = subdataset['Decision'].mean() #get average
 terminateBuilding = True

Enabling early stop if sub data sets in the current branch is less than e.g. 5 will construct the following decision tree. As seen, more generalized decision rules are created. This avoids overfitting.

def findDecision(Outlook,Temp.,Humidity,Wind):
 if  Outlook  == 'Rain' :
  if  Wind  == 'Weak' :
   return  47.666666666666664
  if  Wind  == 'Strong' :
   return  26.5
 if  Outlook  == 'Sunny' :
  if  Temp. <=83 :
   return  37.75
  if  Temp. >83 :
   return  25
 if  Outlook  == 'Overcast' :
   return  46.25

Post pruning

We’ve mentioned C4.5 decision tree algorithm in a previous post. Suppose that we are going to work on the following data set.

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	85	85	Weak	No
2	Sunny	80	90	Strong	No
3	Overcast	83	78	Weak	Yes
4	Rain	70	96	Weak	Yes
5	Rain	68	80	Weak	Yes
6	Rain	65	70	Strong	No
7	Overcast	64	65	Strong	Yes
8	Sunny	72	95	Weak	No
9	Sunny	69	70	Weak	Yes
10	Rain	75	80	Weak	Yes
11	Sunny	75	70	Strong	Yes
12	Overcast	72	90	Strong	Yes
13	Overcast	81	75	Weak	Yes
14	Rain	71	80	Strong	No

C4.5 algorithm constructs the following decision tree. Notice that built decision tree is in a different form in related blog post because we picked information gain metric in that post but we picked gain ratio metric in this post.

def findDecision(Outlook,Temp.,Humidity,Wind):
 if  Temp. <=83 :
  if  Outlook  == 'Rain' :
   if  Wind  == 'Weak' :
    return  'Yes'
   if  Wind  == 'Strong' :
    return  'No'
  if  Outlook  == 'Overcast' :
   return  'Yes'
  if  Outlook  == 'Sunny' :
   if  Humidity >65 :
    if  Wind  == 'Strong' :
     return  'Yes'
    if  Wind  == 'Weak' :
     return  'Yes'
 if  Temp. >83 :
  return  'No'

Droning rules

Here, please focus on decisions when temperature is less than or equal to 82, and sunny outlook. This branch makes positive decision no matter what wind is. Still, it checks wind feature. We can prune checking wind feature in that level. Also, you might realize that there is no answer when humidity is less than or equal to 65. It actually comes from that humidity feature could have 65 as minimum value. We can prune this rule, too. The final form of the decision tree is illustrated below.

def findDecision(Outlook,Temp.,Humidity,Wind):
 if  Temp. <=83 :
  if  Outlook  == 'Rain' :
   if  Wind  == 'Weak' :
    return  'Yes'
   if  Wind  == 'Strong' :
    return  'No'
  if  Outlook  == 'Overcast' :
   return  'Yes'
  if  Outlook  == 'Sunny' :
   return  'Yes'
 if  Temp. >83 :
  return  'No'

This modification improves the performance of running decision tree. Because, it will always make same decisions even though its result wouldn’t be changed.

We have been pruning some decision rules because its upper branch includes them both. But this is not a must. You should prune some branches if they might derive from a few instances in the training set. This enables to avoid overfitting.

To sum up, post pruning covers building decision tree first and pruning some decision rules from end to beginning. In contrast, pre-pruning and building decision trees are handled simultaneously. In both cases, less complex trees are created and this causes to run decision rules faster. Also, this might enables to avoid overfitting.

All code and data sets are already pushed into GitHub. You might run it by yourself.

Support this blog financially if you do like!

How Pruning Works in Decision Trees

Pre-pruning

Smells overfitting

Post pruning

Droning rules

Related

Leave a Reply Cancel reply

Pre-pruning

Smells overfitting

Post pruning

Droning rules

Related

Leave a Reply Cancel reply

Discover more from Sefik Ilkin Serengil