Adaptive boosting or shortly adaboost is awarded boosting algorithm. The principle is basic. A weak worker cannot move a heavy rock but weak workers come together and move heavy rocks and build a pyramid. The algorithm expects to run weak learners. This could be a single layer perceptron, too. A weak learner cannot solve non-linear problems but its sequential usage enables to solve non-linear problems. The trick is to increase the weight of incorrect decisions and to decrease the weight of correct decisions between sequences. Adaboost is not related to decision trees. You might consume an 1-level basic decision tree (decision stumps) but this is not a must.

Adaboost in Python
This blog post mentions the deeply explanation of adaboost algorithm and we will solve a problem step by step. On the other hand, you might just want to run adaboost algorithm. Its mathematical background might not attract your attention.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy

Herein, you can find the python implementation of Adaboost algorithm here. This package supports regular decision tree algorithms such as ID3, C4.5, CART, CHAID or Regression Trees, also bagging methods such as random forest and some boosting methods such as gradient boosting. You can support this work by just starring⭐️ the GitHub repository.
Data set
We are going to work on the following data set. Each instances are represented as 2-dimensional space and we also have its class value. You can find the raw data set here.
x1 | x2 | Decision |
2 | 3 | true |
2.1 | 2 | true |
4.5 | 6 | true |
4 | 3.5 | false |
3.5 | 1 | false |
5 | 7 | true |
5 | 3 | false |
6 | 5.5 | true |
8 | 6 | false |
8 | 2 | false |
True classes are replaced as +1, and false classes are replaced as -1 in the Decision column.
Plotting the data set
We should plot features and class value to be understand clearly.
import pandas as pd import matplotlib.pyplot as plt import numpy as np df = pd.read_csv("dataset/adaboost.txt") positives = df[df['Decision'] >= 0] negatives = df[df['Decision'] < 0] plt.scatter(positives['x1'], positives['x2'], marker='+', s=500*abs(positives['Decision']), c='blue') plt.scatter(negatives['x1'], negatives['x2'], marker='_', s=500*abs(negatives['Decision']), c='red') plt.show()
This code block produces the following graph. As seen, true classes are marked with plus characters whereas false classes are marked with minus character.

We would like to separate true and false classes. This is not a linearly separable problem. Linear classifiers such as perceptrons or decision stumps cannot classify this problem. Herein, adaboost enables linear classifiers to solve this problem.
Decision stumps
Decision trees approaches problems with divide and conquer method. They might have lots of nested decision rules. This makes them non-linear classifiers. In contrast, decision stumps are 1-level decision trees. They are linear classifiers just like (single layer) perceptrons. You might think that if height of someone is greater than 1.70 meters (5.57 feet), then it would be male. Otherwise, it would be female. This decision stump would classify gender correctly at least 50% accuracy. That’s why, these classifiers are weak learners.

Decision stump building
I’ve modified my decision tree repository to handle decision stumps. Basically, buildDecisionTree function calls itself until reaching a decision. I terminated this recursive calling if adaboost enabled.
Regression for adaboost
The main principle in adaboost is to increase the weight of unclassified ones and to decrease the weight value of classified ones. But we are working on a classification problem. Target values in the data set are nominal values. That’s why, we are going to transform the problem to a regression task. I will set true classes to 1 whereas false classes to -1 to handle this.
Initially, we distribute weights in uniform distribution. I set weights of all instances to 1/n where n is the total number of instances.
x1 | x2 | actual | weight | weighted_actual |
2 | 3 | 1 | 0.1 | 0.1 |
2 | 2 | 1 | 0.1 | 0.1 |
4 | 6 | 1 | 0.1 | 0.1 |
4 | 3 | -1 | 0.1 | -0.1 |
4 | 1 | -1 | 0.1 | -0.1 |
5 | 7 | 1 | 0.1 | 0.1 |
5 | 3 | -1 | 0.1 | -0.1 |
6 | 5 | 1 | 0.1 | 0.1 |
8 | 6 | -1 | 0.1 | -0.1 |
8 | 2 | -1 | 0.1 | -0.1 |
Weighted actual stores weight times actual value for each line. Now, we are going to use weighted actual as target value whereas x1 and x2 are features to build a decision stump. The following rule set is created when I run the decision stump algorithm.
def findDecision(x1, x2): if x1>2.1: return -0.025 if x1<=2.1: return 0.1
We’ve set actual values as values ±1 but decision stump returns decimal values. Here, the trick is applying sign function handles this issue.
def sign(x): if x > 0: return 1 elif x < 0: return -1 else: return 0
To sum up, prediction will be sign(-0.025) = -1 when x1 is greater than 2.1, and it will be sign(0.1) = +1 when x1 is less than or equal to 2.1.
I’ll put predictions as a column. Also, I check the equality of actual and prediction in loss column. It will be 0 if the prediction is correct, will be 1 if the prediction is incorrect.
x1 | x2 | actual | weight | weighted_actual | prediction | loss | weight * loss |
2 | 3 | 1 | 0.1 | 0.1 | 1 | 0 | 0 |
2 | 2 | 1 | 0.1 | 0.1 | 1 | 0 | 0 |
4 | 6 | 1 | 0.1 | 0.1 | -1 | 1 | 0.1 |
4 | 3 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
4 | 1 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
5 | 7 | 1 | 0.1 | 0.1 | -1 | 1 | 0.1 |
5 | 3 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
6 | 5 | 1 | 0.1 | 0.1 | -1 | 1 | 0.1 |
8 | 6 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
8 | 2 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
Sum of weight times loss column stores the total error. It is 0.3 in this case. Here, we’ll define a new variable alpha. It stores logarithm (1 – epsilon)/epsilon to the base e over 2.
alpha = ln[(1-epsilon)/epsilon] / 2 = ln[(1 – 0.3)/0.3] / 2
alpha = 0.42
We’ll use alpha to update weights in the next round.
wi+1 = wi * math.exp(-alpha * actual * prediction) where i refers to instance number.
Also, sum of weights must be equal to 1. That’s why, we have to normalize weight values. Dividing each weight value to sum of weights column enables normalization.
x1 | x2 | actual | weight | prediction | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.1 | 1 | 0.065 | 0.071 |
2 | 2 | 1 | 0.1 | 1 | 0.065 | 0.071 |
4 | 6 | 1 | 0.1 | -1 | 0.153 | 0.167 |
4 | 3 | -1 | 0.1 | -1 | 0.065 | 0.071 |
4 | 1 | -1 | 0.1 | -1 | 0.065 | 0.071 |
5 | 7 | 1 | 0.1 | -1 | 0.153 | 0.167 |
5 | 3 | -1 | 0.1 | -1 | 0.065 | 0.071 |
6 | 5 | 1 | 0.1 | -1 | 0.153 | 0.167 |
8 | 6 | -1 | 0.1 | -1 | 0.065 | 0.071 |
8 | 2 | -1 | 0.1 | -1 | 0.065 | 0.071 |
This round is over.
Round 2
I shift normalized w_(i+1) column to weight column in this round. Then, build a decision stump. Still, x1 and x2 are features whereas weighted actual is the target value.
x1 | x2 | actual | weight | weighted_actual |
2 | 3 | 1 | 0.071 | 0.071 |
2 | 2 | 1 | 0.071 | 0.071 |
4 | 6 | 1 | 0.167 | 0.167 |
4 | 3 | -1 | 0.071 | -0.071 |
4 | 1 | -1 | 0.071 | -0.071 |
5 | 7 | 1 | 0.167 | 0.167 |
5 | 3 | -1 | 0.071 | -0.071 |
6 | 5 | 1 | 0.167 | 0.167 |
8 | 6 | -1 | 0.071 | -0.071 |
8 | 2 | -1 | 0.071 | -0.071 |
Graph of the new data set is demonstrated below. Weights of correct classified ones decreased whereas incorrect ones increased.

The following decision stump will be built for this data set.
def findDecision(x1, x2): if x2<=3.5: return -0.02380952380952381 if x2>3.5: return 0.10714285714285714
I’ve applied sign function to predictions. Then, I put loss and weight times loss values as columns.
x1 | x2 | actual | weight | prediction | loss | weight * loss |
2 | 3 | 1 | 0.071 | -1 | 1 | 0.071 |
2 | 2 | 1 | 0.071 | -1 | 1 | 0.071 |
4 | 6 | 1 | 0.167 | 1 | 0 | 0.000 |
4 | 3 | -1 | 0.071 | -1 | 0 | 0.000 |
4 | 1 | -1 | 0.071 | -1 | 0 | 0.000 |
5 | 7 | 1 | 0.167 | 1 | 0 | 0.000 |
5 | 3 | -1 | 0.071 | -1 | 0 | 0.000 |
6 | 5 | 1 | 0.167 | 1 | 0 | 0.000 |
8 | 6 | -1 | 0.071 | 1 | 1 | 0.071 |
8 | 2 | -1 | 0.071 | -1 | 0 | 0.000 |
I can calculate error and alpha values for round 2.
epsilon = 0.21, alpha = 0.65
So, weights for the following round can be found.
x1 | x2 | actual | weight | prediction | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.071 | -1 | 0.137 | 0.167 |
2 | 2 | 1 | 0.071 | -1 | 0.137 | 0.167 |
4 | 6 | 1 | 0.167 | 1 | 0.087 | 0.106 |
4 | 3 | -1 | 0.071 | -1 | 0.037 | 0.045 |
4 | 1 | -1 | 0.071 | -1 | 0.037 | 0.045 |
5 | 7 | 1 | 0.167 | 1 | 0.087 | 0.106 |
5 | 3 | -1 | 0.071 | -1 | 0.037 | 0.045 |
6 | 5 | 1 | 0.167 | 1 | 0.087 | 0.106 |
8 | 6 | -1 | 0.071 | 1 | 0.137 | 0.167 |
8 | 2 | -1 | 0.071 | -1 | 0.037 | 0.045 |
Round 3
I skipped calculations for the following rounds
x1 | x2 | actual | weight | prediction | loss | w * loss | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.167 | 1 | 0 | 0.000 | 0.114 | 0.122 |
2 | 2 | 1 | 0.167 | 1 | 0 | 0.000 | 0.114 | 0.122 |
4 | 6 | 1 | 0.106 | -1 | 1 | 0.106 | 0.155 | 0.167 |
4 | 3 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
4 | 1 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
5 | 7 | 1 | 0.106 | -1 | 1 | 0.106 | 0.155 | 0.167 |
5 | 3 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
6 | 5 | 1 | 0.106 | -1 | 1 | 0.106 | 0.155 | 0.167 |
8 | 6 | -1 | 0.167 | -1 | 0 | 0.000 | 0.114 | 0.122 |
8 | 2 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
epsilon = 0.31, alpha = 0.38

def findDecision(x1, x2): if x1>2.1: return -0.003787878787878794 if x1<=2.1: return 0.16666666666666666
Round 4
x1 | x2 | actual | weight | prediction | loss | w * loss | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.122 | 1 | 0 | 0.000 | 0.041 | 0.068 |
2 | 2 | 1 | 0.122 | 1 | 0 | 0.000 | 0.041 | 0.068 |
4 | 6 | 1 | 0.167 | 1 | 0 | 0.000 | 0.056 | 0.093 |
4 | 3 | -1 | 0.033 | 1 | 1 | 0.033 | 0.100 | 0.167 |
4 | 1 | -1 | 0.033 | 1 | 1 | 0.033 | 0.100 | 0.167 |
5 | 7 | 1 | 0.167 | 1 | 0 | 0.000 | 0.056 | 0.093 |
5 | 3 | -1 | 0.033 | 1 | 1 | 0.033 | 0.100 | 0.167 |
6 | 5 | 1 | 0.167 | 1 | 0 | 0.000 | 0.056 | 0.093 |
8 | 6 | -1 | 0.122 | -1 | 0 | 0.000 | 0.041 | 0.068 |
8 | 2 | -1 | 0.033 | -1 | 0 | 0.000 | 0.011 | 0.019 |
epsilon = 0.10, alpha = 1.10

def findDecision(x1,x2): if x1<=6.0: return 0.08055555555555555 if x1>6.0: return -0.07777777777777778
Prediction
Cumulative sum of each round’s alpha times prediction gives the final prediction
round 1 alpha | round 2 alpha | round 3 alpha | round 4 alpha |
0.42 | 0.65 | 0.38 | 1.1 |
round 1 prediction | round 2 prediction | round 3 prediction | round 4 prediction |
1 | -1 | 1 | 1 |
1 | -1 | 1 | 1 |
-1 | 1 | -1 | 1 |
-1 | -1 | -1 | 1 |
-1 | -1 | -1 | 1 |
-1 | 1 | -1 | 1 |
-1 | -1 | -1 | 1 |
-1 | 1 | -1 | 1 |
-1 | 1 | -1 | -1 |
-1 | -1 | -1 | -1 |
For example, prediction of the 1st instance will be
0.42 x 1 + 0.65 x (-1) + 0.38 x 1 + 1.1 x 1 = 1.25
And we will apply sign function
Sign(0.25) = +1 aka true which is correctly classified.
We can summarize adaboost calculations as illustrated below. There are 4 different (weak) classifier and its multiplier alphas. Instances located in blue area will be classified as true whereas located in read area will be classified as false.
If we apply this calculation for all instances, all instances are classified correctly. In this way, you can find decision for a new instance not appearing in the train set.

Adaboost vs Gradient Boosting
The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.
Pruning
You might realize that both round 1 and round 3 produce same results. Pruning in adaboost proposes to remove similar weak classifier to overperform. Besides, you should increase the multiplier alpha value of remaining one. In this case, I remove round 3 and append its coefficient to round 1.
Even though we’ve used linear weak classifiers, all instances can be classified correctly.
Feature Importance
Explainable AI and machine learning interpretability are the hottest topics in the data world nowadays. Herein, adaboost is naturally explainable algorithm because it is based on regular decision trees and decision trees are explainable.
Adaboost in real world
Face recognition is a hot topic in deep learning nowadays. Face detection and face alignment are mandatory early stages of a face recognition pipeline. The most common method for face detection is haar cascade. It is mainly based on adaboost algorithm.

You can see the performance of haar cascade algorithm in the following video. Even though there are modern algorithms exist, haar cascade is still promising one.
Conclusion
So, we’ve mentioned adaptive boosting algorithm. In this example, we’ve used decision stumps as a weak classifier. You might consume perceptrons for more complex data sets. I’ve pushed the adaboost logic into my GitHub repository.
Special thank to Olga Veksler. Her lecture notes help me to understand this concept.

Support this blog if you do like!
Hi,
First of all, thank you very much. It was quite educative.
But I have a question about the final prediction. The H final is 1.25 as you have found but while inserting it to the sign function, it is written as Sign(0.25) = +1 aka true. Where is 0.25 came from, or is it a typo?
It is a typo. Sign(1.25) = +1