A Step by Step Adaboost Example

Adaptive boosting or shortly adaboost is awarded boosting algorithm. The principle is basic. A weak worker cannot move a heavy rock but weak workers come together and move heavy rocks and build a pyramid. The algorithm expects to run weak learners. This could be a single layer perceptron, too. A weak learner cannot solve non-linear problems but its sequential usage enables to solve non-linear problems. The trick is to increase the weight of incorrect decisions and to decrease the weight of correct decisions between sequences. Adaboost is not related to decision trees. You might consume an 1-level basic decision tree (decision stumps) but this is not a must.

tug-of-war
Tug of war

Adaboost in Python

This blog post mentions the deeply explanation of adaboost algorithm and we will solve a problem step by step. On the other hand, you might just want to run adaboost algorithm. Its mathematical background might not attract your attention.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Herein, you can find the python implementation of Adaboost algorithm here. This package supports regular decision tree algorithms such as ID3C4.5, CART, CHAID or Regression Trees, also bagging methods such as random forest and some boosting methods such as gradient boosting. You can support this work by just starring⭐️ the GitHub repository.

Data set

We are going to work on the following data set. Each instances are represented as 2-dimensional space and we also have its class value. You can find the raw data set here.

x1 x2 Decision
2 3 true
2.1 2 true
4.5 6 true
4 3.5 false
3.5 1 false
5 7 true
5 3 false
6 5.5 true
8 6 false
8 2 false

True classes are replaced as +1, and false classes are replaced as -1 in the Decision column.

Plotting the data set

We should plot features and class value to be understand clearly.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv("dataset/adaboost.txt")

positives = df[df['Decision'] >= 0]
negatives = df[df['Decision'] < 0]

plt.scatter(positives['x1'], positives['x2'], marker='+', s=500*abs(positives['Decision']), c='blue')
plt.scatter(negatives['x1'], negatives['x2'], marker='_', s=500*abs(negatives['Decision']), c='red')
plt.show()

This code block produces the following graph. As seen, true classes are marked with plus characters whereas false classes are marked with minus character.

adaboost-dataset
Data set for adaboost

We would like to separate true and false classes. This is not a linearly separable problem. Linear classifiers such as perceptrons or decision stumps cannot classify this problem. Herein, adaboost enables linear classifiers to solve this problem.

Decision stumps

Decision trees approaches problems with divide and conquer method. They might have lots of nested decision rules. This makes them non-linear classifiers. In contrast, decision stumps are 1-level decision trees. They are linear classifiers just like (single layer) perceptrons. You might think that if height of someone is greater than 1.70 meters (5.57 feet), then it would be male. Otherwise, it would be female. This decision stump would classify gender correctly at least 50% accuracy. That’s why, these classifiers are weak learners.

decision-stump-for-gender
Decision stump for gender

Decision stump building

I’ve modified my decision tree repository to handle decision stumps. Basically, buildDecisionTree function calls itself until reaching a decision. I terminated this recursive calling if adaboost enabled.





Regression for adaboost

The main principle in adaboost is to increase the weight of unclassified ones and to decrease the weight value of classified ones. But we are working on a classification problem. Target values in the data set are nominal values. That’s why, we are going to transform the problem to a regression task. I will set true classes to 1 whereas false classes to -1 to handle this.

Initially, we distribute weights in uniform distribution. I set weights of all instances to 1/n where n is the total number of instances.

x1 x2 actual weight weighted_actual
2 3 1 0.1 0.1
2 2 1 0.1 0.1
4 6 1 0.1 0.1
4 3 -1 0.1 -0.1
4 1 -1 0.1 -0.1
5 7 1 0.1 0.1
5 3 -1 0.1 -0.1
6 5 1 0.1 0.1
8 6 -1 0.1 -0.1
8 2 -1 0.1 -0.1

Weighted actual stores weight times actual value for each line. Now, we are going to use weighted actual as target value whereas x1 and x2 are features to build a decision stump. The following rule set is created when I run the decision stump algorithm.

def findDecision(x1, x2):
 if x1>2.1: return -0.025
 if x1<=2.1: return 0.1 

We’ve set actual values as values ±1 but decision stump returns decimal values. Here, the trick is applying sign function handles this issue.

 def sign(x): if x > 0: return 1
 elif x < 0: return -1
 else: return 0

To sum up, prediction will be sign(-0.025) = -1 when x1 is greater than 2.1, and it will be sign(0.1) = +1 when x1 is less than or equal to 2.1.

I’ll put predictions as a column. Also, I check the equality of actual and prediction in loss column. It will be 0 if the prediction is correct, will be 1 if the prediction is incorrect.

x1 x2 actual weight weighted_actual prediction loss weight * loss
2 3 1 0.1 0.1 1 0 0
2 2 1 0.1 0.1 1 0 0
4 6 1 0.1 0.1 -1 1 0.1
4 3 -1 0.1 -0.1 -1 0 0
4 1 -1 0.1 -0.1 -1 0 0
5 7 1 0.1 0.1 -1 1 0.1
5 3 -1 0.1 -0.1 -1 0 0
6 5 1 0.1 0.1 -1 1 0.1
8 6 -1 0.1 -0.1 -1 0 0
8 2 -1 0.1 -0.1 -1 0 0

Sum of weight times loss column stores the total error. It is 0.3 in this case. Here, we’ll define a new variable alpha. It stores logarithm (1 – epsilon)/epsilon to the base e over 2.

alpha = ln[(1-epsilon)/epsilon] / 2 = ln[(1 – 0.3)/0.3] / 2

alpha = 0.42

We’ll use alpha to update weights in the next round.





wi+1 = wi * math.exp(-alpha * actual * prediction) where i refers to instance number.

Also, sum of weights must be equal to 1. That’s why, we have to normalize weight values. Dividing each weight value to sum of weights column enables normalization.

x1 x2 actual weight prediction w_(i+1) norm(w_(i+1))
2 3 1 0.1 1 0.065 0.071
2 2 1 0.1 1 0.065 0.071
4 6 1 0.1 -1 0.153 0.167
4 3 -1 0.1 -1 0.065 0.071
4 1 -1 0.1 -1 0.065 0.071
5 7 1 0.1 -1 0.153 0.167
5 3 -1 0.1 -1 0.065 0.071
6 5 1 0.1 -1 0.153 0.167
8 6 -1 0.1 -1 0.065 0.071
8 2 -1 0.1 -1 0.065 0.071

This round is over.

Round 2

I shift normalized w_(i+1) column to weight column in this round. Then, build a decision stump. Still, x1 and x2 are features whereas weighted actual is the target value.

x1 x2 actual weight weighted_actual
2 3 1 0.071 0.071
2 2 1 0.071 0.071
4 6 1 0.167 0.167
4 3 -1 0.071 -0.071
4 1 -1 0.071 -0.071
5 7 1 0.167 0.167
5 3 -1 0.071 -0.071
6 5 1 0.167 0.167
8 6 -1 0.071 -0.071
8 2 -1 0.071 -0.071

Graph of the new data set is demonstrated below. Weights of correct classified ones decreased whereas incorrect ones increased.

adaboost-dataset-v2
Data set with tuned weights

The following decision stump will be built for this data set.

def findDecision(x1, x2):
 if x2<=3.5:   return -0.02380952380952381  if x2>3.5:
  return 0.10714285714285714

I’ve applied sign function to predictions. Then, I put loss and weight times loss values as columns.

x1 x2 actual weight prediction loss weight * loss
2 3 1 0.071 -1 1 0.071
2 2 1 0.071 -1 1 0.071
4 6 1 0.167 1 0 0.000
4 3 -1 0.071 -1 0 0.000
4 1 -1 0.071 -1 0 0.000
5 7 1 0.167 1 0 0.000
5 3 -1 0.071 -1 0 0.000
6 5 1 0.167 1 0 0.000
8 6 -1 0.071 1 1 0.071
8 2 -1 0.071 -1 0 0.000

I can calculate error and alpha values for round 2.

epsilon = 0.21, alpha = 0.65

So, weights for the following round can be found.





x1 x2 actual weight prediction w_(i+1) norm(w_(i+1))
2 3 1 0.071 -1 0.137 0.167
2 2 1 0.071 -1 0.137 0.167
4 6 1 0.167 1 0.087 0.106
4 3 -1 0.071 -1 0.037 0.045
4 1 -1 0.071 -1 0.037 0.045
5 7 1 0.167 1 0.087 0.106
5 3 -1 0.071 -1 0.037 0.045
6 5 1 0.167 1 0.087 0.106
8 6 -1 0.071 1 0.137 0.167
8 2 -1 0.071 -1 0.037 0.045

Round 3

I skipped calculations for the following rounds

x1 x2 actual weight prediction loss w * loss w_(i+1) norm(w_(i+1))
2 3 1 0.167 1 0 0.000 0.114 0.122
2 2 1 0.167 1 0 0.000 0.114 0.122
4 6 1 0.106 -1 1 0.106 0.155 0.167
4 3 -1 0.045 -1 0 0.000 0.031 0.033
4 1 -1 0.045 -1 0 0.000 0.031 0.033
5 7 1 0.106 -1 1 0.106 0.155 0.167
5 3 -1 0.045 -1 0 0.000 0.031 0.033
6 5 1 0.106 -1 1 0.106 0.155 0.167
8 6 -1 0.167 -1 0 0.000 0.114 0.122
8 2 -1 0.045 -1 0 0.000 0.031 0.033

epsilon = 0.31, alpha = 0.38

adaboost-dataset-v3
Weights in round 3
def findDecision(x1, x2):
 if x1>2.1:
  return -0.003787878787878794
 if x1<=2.1:
  return 0.16666666666666666

Round 4

x1 x2 actual weight prediction loss w * loss w_(i+1) norm(w_(i+1))
2 3 1 0.122 1 0 0.000 0.041 0.068
2 2 1 0.122 1 0 0.000 0.041 0.068
4 6 1 0.167 1 0 0.000 0.056 0.093
4 3 -1 0.033 1 1 0.033 0.100 0.167
4 1 -1 0.033 1 1 0.033 0.100 0.167
5 7 1 0.167 1 0 0.000 0.056 0.093
5 3 -1 0.033 1 1 0.033 0.100 0.167
6 5 1 0.167 1 0 0.000 0.056 0.093
8 6 -1 0.122 -1 0 0.000 0.041 0.068
8 2 -1 0.033 -1 0 0.000 0.011 0.019

epsilon = 0.10, alpha = 1.10

adaboost-dataset-v4
Weights in round 4
def findDecision(x1,x2):
 if x1<=6.0: 
   return 0.08055555555555555 
 if x1>6.0:
   return -0.07777777777777778

Prediction

Cumulative sum of each round’s alpha times prediction gives the final prediction

round 1 alpha round 2 alpha round 3 alpha round 4 alpha
0.42 0.65 0.38 1.1
round 1 prediction round 2 prediction  round 3 prediction round 4 prediction
1 -1 1 1
1 -1 1 1
-1 1 -1 1
-1 -1 -1 1
-1 -1 -1 1
-1 1 -1 1
-1 -1 -1 1
-1 1 -1 1
-1 1 -1 -1
-1 -1 -1 -1

For example, prediction of the 1st instance will be

0.42 x 1 + 0.65 x (-1) + 0.38 x 1 + 1.1 x 1 = 1.25

And we will apply sign function

Sign(0.25) = +1 aka true which is correctly classified.

We can summarize adaboost calculations as illustrated below. There are 4 different (weak) classifier and its multiplier alphas. Instances located in blue area will be classified as true whereas located in read area will be classified as false.

adaboost-calculations





If we apply this calculation for all instances, all instances are classified correctly. In this way, you can find decision for a new instance not appearing in the train set.

adaboost-final-decision-v2
Final decision

Adaboost vs Gradient Boosting

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Pruning

You might realize that both round 1 and round 3 produce same results. Pruning in adaboost proposes to remove similar weak classifier to overperform. Besides, you should increase the multiplier alpha value of remaining one. In this case, I remove round 3 and append its coefficient to round 1.

adaboost-calculations-pruned-v2

Even though we’ve used linear weak classifiers, all instances can be classified correctly.

Feature Importance

Explainable AI and machine learning interpretability are the hottest topics in the data world nowadays. Herein, adaboost is naturally explainable algorithm because it is based on regular decision trees and decision trees are explainable.

Adaboost in real world

Face recognition is a hot topic in deep learning nowadays. Face detection and face alignment are mandatory early stages of a face recognition pipeline. The most common method for face detection is haar cascade. It is mainly based on adaboost algorithm.

haar-cascade-schema
Haar cascade schema map

You can see the performance of haar cascade algorithm in the following video. Even though there are modern algorithms exist, haar cascade is still promising one.

Conclusion

So, we’ve mentioned adaptive boosting algorithm. In this example, we’ve used decision stumps as a weak classifier. You might consume perceptrons for more complex data sets. I’ve pushed the adaboost logic into my GitHub repository.

Special thank to Olga Veksler. Her lecture notes help me to understand this concept.





1_KuOM67No3IQABZIIVLHrAA
Adaboost in Ancient  Eygpt


Support this blog if you do like!

Buy me a coffee      Buy me a coffee


15 Comments

  1. “Initially we distribute weights normally”. That’s a uniform distribution you are using at initialization, not a normal distribution.

  2. how do we calculate the find Decision function
    I tried to calculate it in Gini index but its not working
    for round1 why x1>2.1 = -0.025
    What is -0.025?

    1. Procedures for regression trees are applied here instead of GINI index. Please read that article.

      1. Thanks for quick reply, one thing I don’t know is about regression tree deep
        U just said stop propagation when the sample data reach 5 or less than 5
        however id u deal will 1 Million data sample it is not that simple to say 5

        This blog is really helpful for beginner and intermediate level learner(I already share to my friends)
        but I expect advanced topic based on basic knowledge
        for example we can talk about decision tree applied in recognition

        1. Termination rule is not strictly defined, unfortunately. You can customize it based on your problem. If you are working on a large scale data set e.g. 1M, then you can check the ratio of sub data set to base data set. If the size of sub data set is less than the 2% of the base data set, you can terminate decision tree building.

  3. How the value of -0.025 came and on what basis need clarity over it and on what grounds is x 2.1 is taken?

    1. 1- get all unique values in the x1 column (2, 2.1, 3.5, 4, 4.5, 5, 6, 8)
      2- iterate over these unique values and replace raw data set as greater then this unique value or not. There is no case x1 < 2 that's why I begin with 2.1. index x1>2.1 Decision
      0 FALSE 1
      1 FALSE 1
      2 TRUE 1
      3 TRUE -1
      4 TRUE -1
      5 TRUE 1
      6 TRUE -1
      7 TRUE 1
      8 TRUE -1
      9 TRUE -1

      These ones are basic regression trees. You can find detailed description of regression trees here: https://sefiks.com/2018/08/28/a-step-by-step-regression-decision-tree-example/

      You need to find average and standard deviation values of Decision column. Avarage is 0, stdev is 1.

      Now, seperate data set based on true and false.

      Sub data set 1
      index x1>2.1 Decision
      0 FALSE 1
      1 FALSE 1

      Sub data set 2
      index x1>2.1 Decision
      2 TRUE 1
      3 TRUE -1
      4 TRUE -1
      5 TRUE 1
      6 TRUE -1
      7 TRUE 1
      8 TRUE -1
      9 TRUE -1

      Find average and standard deviation again for sub data set 1, and sub data set 2.

      Average of sub data set 1 is 1, stdev is 0
      Average of sub data set 2 is -0.25, stdev is 0.968

      Notice that sub data set 1 is consisting of 2 items, and sub data set 2 is consisting of 8 items. Let’s find weighted standard deviation values.

      weighted standard deviation for x1>2.1 is (2/10) * 0 + (8/10) * 0.968 = 0.774

      Remember that standard deviation of the raw data set was 1. Find standard deviation reduction. It is equal to global stdev minus weighted stdev.

      Stdev reduction for x1>2.1 is (1 – 0.774) = 0.225

      You should do this for x1 > 3.5, x1> 4, x1 > 4.5, x1 > 5 and x1 > 6. I skip these calculations. I did these calculations and no one is greater than 0.225. That’s why, decision stump checks x1 is greater than 2.1 or not.

      So, if x1 > 2.1 is satisfied, then average of decision column in sub data set 2 will be returned. It is -0.25.

      Otherwise, if x1 is not greater than 2.1, then average of decision column in sub data set 1 will be returned. It is 1.

      I hope this explanation makes adaboost understand clear. I strongly recommend you to read regression trees post.

  4. Hi , I have a few questions. Your reply will be highly appreciated.
    1) Considering you have found the alphas for say T rounds in Adaboost and when real testing with new test data is to be done, do you just find the predictions on the test data from the T classifiers, multiply the predictions with corresponding alpha, take the sign of sum of those to get the final prediction? I am trying to implement Adaboost in Tensorflow using custom Estimator. So my understanding after reading your article is that I will have to save T models during training and then for prediction, load the models one by one to get the predictions. Is that right?

    2) I have existing Tensorflow training script using Tensorflow Estimator for classification problem in which softmax cross entropy loss function is being minimized. Now to implement Adaboost, I pass the sample weights to the softmax_cross_entropy function and use the returned weighted loss (sum of weighted loss) as epsilon or error to calculate the classifier voting weight. Is this the correct approach?

    1. 1- Adaboost is just like you described. You should apply T different classifiers. Each classifier actually has a alpha value. Sign function applied cumulative sum of each round’s prediction times that round’s alpha will be final prediction on your test data. I’ve already implement adaboost from scratch. You should clone / fork https://github.com/serengil/chefboost . Adaboost related file of the repository is https://github.com/serengil/chefboost/blob/master/tuning/adaboost.py .

      2- Yes but I cannot imagine how to implement this algorithm recursively in TF. Because you know you update target labels in eac round.

  5. Hi,
    First of all, thank you very much. It was quite educative.
    But I have a question about the final prediction. The H final is 1.25 as you have found but while inserting it to the sign function, it is written as Sign(0.25) = +1 aka true. Where is 0.25 came from, or is it a typo?

Comments are closed.