A Step by Step Adaboost Example

Adaptive boosting or shortly adaboost is awarded boosting algorithm. The principle is basic. A weak worker cannot move a heavy rock but weak workers come together and move heavy rocks and build a pyramid. The algorithm expects to run weak learners. This could be a single layer perceptron, too. A weak learner cannot solve non-linear problems but its sequential usage enables to solve non-linear problems. The trick is to increase the weight of incorrect decisions and to decrease the weight of correct decisions between sequences. Adaboost is not related to decision trees. You might consume an 1-level basic decision tree (decision stumps) but this is not a must.

tug-of-war
Tug of war

Adaboost in Python

This blog post mentions the deeply explanation of adaboost algorithm and we will solve a problem step by step. On the other hand, you might just want to run adaboost algorithm. Its mathematical background might not attract your attention.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Herein, you can find the python implementation of Adaboost algorithm here. This package supports regular decision tree algorithms such as ID3C4.5, CART, CHAID or Regression Trees, also bagging methods such as random forest and some boosting methods such as gradient boosting. You can support this work by just starring⭐️ the GitHub repository.

Data set

We are going to work on the following data set. Each instances are represented as 2-dimensional space and we also have its class value. You can find the raw data set here.

x1 x2 Decision
2 3 true
2.1 2 true
4.5 6 true
4 3.5 false
3.5 1 false
5 7 true
5 3 false
6 5.5 true
8 6 false
8 2 false

True classes are replaced as +1, and false classes are replaced as -1 in the Decision column.

Plotting the data set

We should plot features and class value to be understand clearly.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv("dataset/adaboost.txt")

positives = df[df['Decision'] >= 0]
negatives = df[df['Decision'] < 0]

plt.scatter(positives['x1'], positives['x2'], marker='+', s=500*abs(positives['Decision']), c='blue')
plt.scatter(negatives['x1'], negatives['x2'], marker='_', s=500*abs(negatives['Decision']), c='red')
plt.show()

This code block produces the following graph. As seen, true classes are marked with plus characters whereas false classes are marked with minus character.

adaboost-dataset
Data set for adaboost

We would like to separate true and false classes. This is not a linearly separable problem. Linear classifiers such as perceptrons or decision stumps cannot classify this problem. Herein, adaboost enables linear classifiers to solve this problem.

Decision stumps

Decision trees approaches problems with divide and conquer method. They might have lots of nested decision rules. This makes them non-linear classifiers. In contrast, decision stumps are 1-level decision trees. They are linear classifiers just like (single layer) perceptrons. You might think that if height of someone is greater than 1.70 meters (5.57 feet), then it would be male. Otherwise, it would be female. This decision stump would classify gender correctly at least 50% accuracy. That’s why, these classifiers are weak learners.

decision-stump-for-gender
Decision stump for gender

Decision stump building

I’ve modified my decision tree repository to handle decision stumps. Basically, buildDecisionTree function calls itself until reaching a decision. I terminated this recursive calling if adaboost enabled.





Regression for adaboost

The main principle in adaboost is to increase the weight of unclassified ones and to decrease the weight value of classified ones. But we are working on a classification problem. Target values in the data set are nominal values. That’s why, we are going to transform the problem to a regression task. I will set true classes to 1 whereas false classes to -1 to handle this.

Initially, we distribute weights in uniform distribution. I set weights of all instances to 1/n where n is the total number of instances.

x1 x2 actual weight weighted_actual
2 3 1 0.1 0.1
2 2 1 0.1 0.1
4 6 1 0.1 0.1
4 3 -1 0.1 -0.1
4 1 -1 0.1 -0.1
5 7 1 0.1 0.1
5 3 -1 0.1 -0.1
6 5 1 0.1 0.1
8 6 -1 0.1 -0.1
8 2 -1 0.1 -0.1

Weighted actual stores weight times actual value for each line. Now, we are going to use weighted actual as target value whereas x1 and x2 are features to build a decision stump. The following rule set is created when I run the decision stump algorithm.

def findDecision(x1, x2):
 if x1>2.1: return -0.025
 if x1<=2.1: return 0.1 

We’ve set actual values as values ±1 but decision stump returns decimal values. Here, the trick is applying sign function handles this issue.

 def sign(x): if x > 0: return 1
 elif x < 0: return -1
 else: return 0

To sum up, prediction will be sign(-0.025) = -1 when x1 is greater than 2.1, and it will be sign(0.1) = +1 when x1 is less than or equal to 2.1.

I’ll put predictions as a column. Also, I check the equality of actual and prediction in loss column. It will be 0 if the prediction is correct, will be 1 if the prediction is incorrect.

x1 x2 actual weight weighted_actual prediction loss weight * loss
2 3 1 0.1 0.1 1 0 0
2 2 1 0.1 0.1 1 0 0
4 6 1 0.1 0.1 -1 1 0.1
4 3 -1 0.1 -0.1 -1 0 0
4 1 -1 0.1 -0.1 -1 0 0
5 7 1 0.1 0.1 -1 1 0.1
5 3 -1 0.1 -0.1 -1 0 0
6 5 1 0.1 0.1 -1 1 0.1
8 6 -1 0.1 -0.1 -1 0 0
8 2 -1 0.1 -0.1 -1 0 0

Sum of weight times loss column stores the total error. It is 0.3 in this case. Here, we’ll define a new variable alpha. It stores logarithm (1 – epsilon)/epsilon to the base e over 2.

alpha = ln[(1-epsilon)/epsilon] / 2 = ln[(1 – 0.3)/0.3] / 2

alpha = 0.42

We’ll use alpha to update weights in the next round.





wi+1 = wi * math.exp(-alpha * actual * prediction) where i refers to instance number.

Also, sum of weights must be equal to 1. That’s why, we have to normalize weight values. Dividing each weight value to sum of weights column enables normalization.

x1 x2 actual weight prediction w_(i+1) norm(w_(i+1))
2 3 1 0.1 1 0.065 0.071
2 2 1 0.1 1 0.065 0.071
4 6 1 0.1 -1 0.153 0.167
4 3 -1 0.1 -1 0.065 0.071
4 1 -1 0.1 -1 0.065 0.071
5 7 1 0.1 -1 0.153 0.167
5 3 -1 0.1 -1 0.065 0.071
6 5 1 0.1 -1 0.153 0.167
8 6 -1 0.1 -1 0.065 0.071
8 2 -1 0.1 -1 0.065 0.071

This round is over.

Round 2

I shift normalized w_(i+1) column to weight column in this round. Then, build a decision stump. Still, x1 and x2 are features whereas weighted actual is the target value.

x1 x2 actual weight weighted_actual
2 3 1 0.071 0.071
2 2 1 0.071 0.071
4 6 1 0.167 0.167
4 3 -1 0.071 -0.071
4 1 -1 0.071 -0.071
5 7 1 0.167 0.167
5 3 -1 0.071 -0.071
6 5 1 0.167 0.167
8 6 -1 0.071 -0.071
8 2 -1 0.071 -0.071

Graph of the new data set is demonstrated below. Weights of correct classified ones decreased whereas incorrect ones increased.

adaboost-dataset-v2
Data set with tuned weights

The following decision stump will be built for this data set.

def findDecision(x1, x2):
 if x2<=3.5:   return -0.02380952380952381  if x2>3.5:
  return 0.10714285714285714

I’ve applied sign function to predictions. Then, I put loss and weight times loss values as columns.

x1 x2 actual weight prediction loss weight * loss
2 3 1 0.071 -1 1 0.071
2 2 1 0.071 -1 1 0.071
4 6 1 0.167 1 0 0.000
4 3 -1 0.071 -1 0 0.000
4 1 -1 0.071 -1 0 0.000
5 7 1 0.167 1 0 0.000
5 3 -1 0.071 -1 0 0.000
6 5 1 0.167 1 0 0.000
8 6 -1 0.071 1 1 0.071
8 2 -1 0.071 -1 0 0.000

I can calculate error and alpha values for round 2.

epsilon = 0.21, alpha = 0.65

So, weights for the following round can be found.





x1 x2 actual weight prediction w_(i+1) norm(w_(i+1))
2 3 1 0.071 -1 0.137 0.167
2 2 1 0.071 -1 0.137 0.167
4 6 1 0.167 1 0.087 0.106
4 3 -1 0.071 -1 0.037 0.045
4 1 -1 0.071 -1 0.037 0.045
5 7 1 0.167 1 0.087 0.106
5 3 -1 0.071 -1 0.037 0.045
6 5 1 0.167 1 0.087 0.106
8 6 -1 0.071 1 0.137 0.167
8 2 -1 0.071 -1 0.037 0.045

Round 3

I skipped calculations for the following rounds

x1 x2 actual weight prediction loss w * loss w_(i+1) norm(w_(i+1))
2 3 1 0.167 1 0 0.000 0.114 0.122
2 2 1 0.167 1 0 0.000 0.114 0.122
4 6 1 0.106 -1 1 0.106 0.155 0.167
4 3 -1 0.045 -1 0 0.000 0.031 0.033
4 1 -1 0.045 -1 0 0.000 0.031 0.033
5 7 1 0.106 -1 1 0.106 0.155 0.167
5 3 -1 0.045 -1 0 0.000 0.031 0.033
6 5 1 0.106 -1 1 0.106 0.155 0.167
8 6 -1 0.167 -1 0 0.000 0.114 0.122
8 2 -1 0.045 -1 0 0.000 0.031 0.033

epsilon = 0.31, alpha = 0.38

adaboost-dataset-v3
Weights in round 3
def findDecision(x1, x2):
 if x1>2.1:
  return -0.003787878787878794
 if x1<=2.1:
  return 0.16666666666666666

Round 4

x1 x2 actual weight prediction loss w * loss w_(i+1) norm(w_(i+1))
2 3 1 0.122 1 0 0.000 0.041 0.068
2 2 1 0.122 1 0 0.000 0.041 0.068
4 6 1 0.167 1 0 0.000 0.056 0.093
4 3 -1 0.033 1 1 0.033 0.100 0.167
4 1 -1 0.033 1 1 0.033 0.100 0.167
5 7 1 0.167 1 0 0.000 0.056 0.093
5 3 -1 0.033 1 1 0.033 0.100 0.167
6 5 1 0.167 1 0 0.000 0.056 0.093
8 6 -1 0.122 -1 0 0.000 0.041 0.068
8 2 -1 0.033 -1 0 0.000 0.011 0.019

epsilon = 0.10, alpha = 1.10

adaboost-dataset-v4
Weights in round 4
def findDecision(x1,x2):
 if x1<=6.0: 
   return 0.08055555555555555 
 if x1>6.0:
   return -0.07777777777777778

Prediction

Cumulative sum of each round’s alpha times prediction gives the final prediction

round 1 alpha round 2 alpha round 3 alpha round 4 alpha
0.42 0.65 0.38 1.1
round 1 prediction round 2 prediction  round 3 prediction round 4 prediction
1 -1 1 1
1 -1 1 1
-1 1 -1 1
-1 -1 -1 1
-1 -1 -1 1
-1 1 -1 1
-1 -1 -1 1
-1 1 -1 1
-1 1 -1 -1
-1 -1 -1 -1

For example, prediction of the 1st instance will be

0.42 x 1 + 0.65 x (-1) + 0.38 x 1 + 1.1 x 1 = 1.25

And we will apply sign function

Sign(0.25) = +1 aka true which is correctly classified.

We can summarize adaboost calculations as illustrated below. There are 4 different (weak) classifier and its multiplier alphas. Instances located in blue area will be classified as true whereas located in read area will be classified as false.

adaboost-calculations





If we apply this calculation for all instances, all instances are classified correctly. In this way, you can find decision for a new instance not appearing in the train set.

adaboost-final-decision-v2
Final decision

Adaboost vs Gradient Boosting

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Pruning

You might realize that both round 1 and round 3 produce same results. Pruning in adaboost proposes to remove similar weak classifier to overperform. Besides, you should increase the multiplier alpha value of remaining one. In this case, I remove round 3 and append its coefficient to round 1.

adaboost-calculations-pruned-v2

Even though we’ve used linear weak classifiers, all instances can be classified correctly.

Feature Importance

Explainable AI and machine learning interpretability are the hottest topics in the data world nowadays. Herein, adaboost is naturally explainable algorithm because it is based on regular decision trees and decision trees are explainable.

Adaboost in real world

Face recognition is a hot topic in deep learning nowadays. Face detection and face alignment are mandatory early stages of a face recognition pipeline. The most common method for face detection is haar cascade. It is mainly based on adaboost algorithm.

haar-cascade-schema
Haar cascade schema map

You can see the performance of haar cascade algorithm in the following video. Even though there are modern algorithms exist, haar cascade is still promising one.

Conclusion

So, we’ve mentioned adaptive boosting algorithm. In this example, we’ve used decision stumps as a weak classifier. You might consume perceptrons for more complex data sets. I’ve pushed the adaboost logic into my GitHub repository.

Special thank to Olga Veksler. Her lecture notes help me to understand this concept.





1_KuOM67No3IQABZIIVLHrAA
Adaboost in Ancient  Eygpt


Support this blog if you do like!

Buy me a coffee      Buy me a coffee


15 Comments

  1. Hi,
    First of all, thank you very much. It was quite educative.
    But I have a question about the final prediction. The H final is 1.25 as you have found but while inserting it to the sign function, it is written as Sign(0.25) = +1 aka true. Where is 0.25 came from, or is it a typo?

Leave a Reply