A Step by Step Gradient Boosting Example for Classification

Gradient boosting machines might be confusing for beginners. Even though most of resources say that GBM can handle both regression and classification problems, its practical examples always cover regression studies. This is actually tricky statement because GBM is designed for only regression. But we can transform classification tasks into regression tasks.

baobab-sunset
Baobab trees and sunset

What was GBM?

Notice that gradient boosting is not a decision tree algorithm. It proposes to run a regression trees sequentially.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

GBM in Python

Hands-on coding might help some people to understand algorithms better. You can find the python implementation of gradient boosting for classification algorithm here.

Data set

Here, we are going to work on Iris data set. There are 150 instances of 3 homogeneous classes. They are setosa, versicolor and virginica. This is the target output whereas top and bottom leaf sizes are input features.

iris-dataset-description
Iris data set

Why we need bossting?

Applying C4.5 decision tree algorithm to this data set classifies 105 instances correctly whereas 45 instances incorrectly. This means 70% accuracy which is far away from the success. We will run same C4.5 algorithm in the following steps but boosting enables to increase the accuracy.

You can find the building decision tree code here.

Binary classification for multi trees

We are going to apply one-hot-encoding to target output. Thus, output will be represented as three dimensional vector. However, decision tree algorithms can handle one output only. That’s why, we will build 3 different regression trees each time. You might think each decision tree as different binary classification problem.

I mean that I’ve selected sample rows of the data set to illustrate. This is the original one.

instance sepal_length sepal_width petal_length petal_width label
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
51 7 3.2 4.7 1.4 versicolor
101 6.3 3.3 6 2.5 virginica

Label consists of 3 classes: setosa, versicolor and virginica.

Data set for setosa class

Firstly, I prepare a data set to check instances setosa or not.





instance sepal_length sepal_width petal_length petal_width setosa
1 5.1 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
51 7 3.2 4.7 1.4 0
101 6.3 3.3 6 2.5 0

Data set for versicolor class

Secondly, I prepare a data set to check instances versicolor or not.

instance sepal_length sepal_width petal_length petal_width versicolor
1 5.1 3.5 1.4 0.2 0
2 4.9 3 1.4 0.2 0
51 7 3.2 4.7 1.4 1
101 6.3 3.3 6 2.5 0

Data set for virginica class

Finally, I’ll prepare a data set to check instances virginica or not.

instance sepal_length sepal_width petal_length petal_width virginica
1 5.1 3.5 1.4 0.2 0
2 4.9 3 1.4 0.2 0
51 7 3.2 4.7 1.4 0
101 6.3 3.3 6 2.5 1

Now, I have 3 different data sets. I can build 3 decision trees for these data sets.

I’m going to put actual labels and predictions in the same table in the following steps. Columns beginning with F_ prefix are predictions.

instance Y_setosa Y_versicolor Y_virginica F_setosa F_versicolor F_virginica
1 1 0 0 1 0 0
2 1 0 0 1 0 0
51 0 1 0 0 1 0
101 0 0 1 0 1 1

Notice that instance 101 is predicted as versicolor and virginica with same probability. This has an error.

Softmax probability function

Initially, we need to apply softmax function for predictions. This function normalize all inputs in scale of [0, 1], and also sum of normalized values are always equal to 1. But there is no out-of-the-box function for softmax in python. Still we can create it easily as coded below.

def softmax(w):
   e = np.exp(np.array(w))
   dist = e / np.sum(e)
   return dist

 

I’m going to add these probabilities as columns. I’ve also hided actual values (Y_prefix) to fit the table.

ins F_setosa F_versicolor F_virginica P_setosa P_versicolor P_virginica
1 1 0 0 0.576 0.212 0.212
2 1 0 0 0.576 0.212 0.212
51 0 1 0 0.212 0.576 0.212
101 0 1 1 0.155 0.422 0.422

Finding negative gradient

Remember that we’ve built new tree for actual minus prediction target value in regression case. This difference comes from the derivative of mean squared error. Herein, we’ve applied softmax function. The maximum one between probabilities of predictions (columns have P_ prefix) would be the prediction. In other words, we’ll apply one-hot-encoding as 1 for max one whereas 0 for others. Herein, cross entropy stores the relation between probabilities and one-hot-encoded results. Applying softmax and cross entropy respectively has surprising derivative. It is equal to prediction (probabilities in this case) minus actual. The negative gradient would be actual (columns have Y_ prefix) minus prediction (columns have P_ prefix). We will derive this value.





instance Y_setosa
– P_setosa
Y_versicolor
– P_versicolor
Y_virginica
– P_virginica
1 0.424 -0.212 -0.212
2 0.424 -0.212 -0.212
51 -0.212 0.424 -0.212
101 -0.155 -0.422 0.578

This is the round 1. Target values will be replaced as these negative gradients in the following round.

Reflecting negative gradient

Target column for setosa will be replaced with Y_setosa – P_setosa.

instance sepal_length sepal_width petal_length petal_width setosa
1 5.1 3.5 1.4 0.2 0.424
2 4.9 3 1.4 0.2 0.424
51 7 3.2 4.7 1.4 -0.212
101 6.3 3.3 6 2.5 -0.155

Target column for versicolor will be replaced with Y_versicolor – P_versicolor.

instance sepal_length sepal_width petal_length petal_width versicolor
1 5.1 3.5 1.4 0.2 -0.212
2 4.9 3 1.4 0.2 -0.212
51 7 3.2 4.7 1.4 0.424
101 6.3 3.3 6 2.5 -0.422

I will apply similar replacements for virginica, too. These are my new data sets. I’m going to build 3 different decision trees for these 3 different data set. This operation will be repeated until I get satisfactory success.

Final predictions

Finally, I’m going to sum predictions (F_ prefix) for all rounds. The maximum index value will be my prediction.

At round 10, I can classify 144 instances correctly whereas 6 instances incorrectly. This means I got 96% accuracy. Remember that I got 70% accuracy before boosting. This is a major improvement!

Random Forest vs Gradient Boosting

The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.

Adaboost vs Gradient Boosting

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Feature importance

Decision trees are naturally explainable and interpretable algorithms. However, it is hard to explain GBM. Herein, feature importance offers to understand built trees better.

Bonus: binary classification

I’ve demonstrated gradient boosting for classification on a multi-class classification problem where number of classes is greater than 2. Running it for a binary classification problem (true/false) might require to consume sigmoid function. Still, softmax and cross-entropy pair works for binary classification.





To be concluded

So, we’ve mentioned a step by step gradient boosting example for classification. I cannot find this in literature. Basically, we’ve transformed classification example to multiple regression tasks to boost. I am grateful to Cheng Li. His lecture notes guide me to understand this topic. Finally, running and debugging code by yourself makes concept much more understandable. That’s why, I’ve already pushed the code of gradient boosting for classification into GitHub.


Like this blog? Support me on Patreon

Buy me a coffee