A Step by Step Gradient Boosting Example for Classification

A Step by Step Gradient Boosting Example for Classification

Discover more from Sefik Ilkin Serengil

Related

Leave a Reply Cancel reply

Related

Leave a Reply Cancel reply

What was GBM?

GBM in Python

Data set

Why we need bossting?

Binary classification for multi trees

Data set for setosa class

Data set for versicolor class

Data set for virginica class

Softmax probability function

Finding negative gradient

Reflecting negative gradient

Final predictions

Random Forest vs Gradient Boosting

Adaboost vs Gradient Boosting

Feature importance

Bonus: binary classification

To be concluded

What was GBM?

GBM in Python

Data set

Why we need bossting?

Binary classification for multi trees

Data set for setosa class

Data set for versicolor class

Data set for virginica class

Softmax probability function

Finding negative gradient

Reflecting negative gradient

Final predictions

Random Forest vs Gradient Boosting

Adaboost vs Gradient Boosting

Feature importance

Bonus: binary classification

To be concluded

What was GBM?

GBM in Python

Data set

Why we need bossting?

Binary classification for multi trees

Data set for setosa class

Data set for versicolor class

Data set for virginica class

Softmax probability function

Finding negative gradient

Reflecting negative gradient

Final predictions

Random Forest vs Gradient Boosting

Adaboost vs Gradient Boosting

Feature importance

Bonus: binary classification

To be concluded

Gradient boosting machines might be confusing for beginners. Even though most of resources say that GBM can handle both regression and classification problems, its practical examples always cover regression studies. This is actually tricky statement because GBM is designed for only regression. But we can transform classification tasks into regression tasks.

Notice that gradient boosting is not a decision tree algorithm. It proposes to run a regression trees sequentially.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Hands-on coding might help some people to understand algorithms better. You can find the python implementation of gradient boosting for classification algorithm here.

Here, we are going to work on Iris data set. There are 150 instances of 3 homogeneous classes. They are setosa, versicolor and virginica. This is the target output whereas top and bottom leaf sizes are input features.

Applying C4.5 decision tree algorithm to this data set classifies 105 instances correctly whereas 45 instances incorrectly. This means 70% accuracy which is far away from the success. We will run same C4.5 algorithm in the following steps but boosting enables to increase the accuracy.

You can find the building decision tree code here.

We are going to apply one-hot-encoding to target output. Thus, output will be represented as three dimensional vector. However, decision tree algorithms can handle one output only. That’s why, we will build 3 different regression trees each time. You might think each decision tree as different binary classification problem.

I mean that I’ve selected sample rows of the data set to illustrate. This is the original one.

instance	sepal_length	sepal_width	petal_length	petal_width	label
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3	1.4	0.2	setosa
51	7	3.2	4.7	1.4	versicolor
101	6.3	3.3	6	2.5	virginica

Label consists of 3 classes: setosa, versicolor and virginica.

Firstly, I prepare a data set to check instances setosa or not.

instance	sepal_length	sepal_width	petal_length	petal_width	setosa
1	5.1	3.5	1.4	0.2	1
2	4.9	3	1.4	0.2	1
51	7	3.2	4.7	1.4	0
101	6.3	3.3	6	2.5	0

Secondly, I prepare a data set to check instances versicolor or not.

instance	sepal_length	sepal_width	petal_length	petal_width	versicolor
1	5.1	3.5	1.4	0.2	0
2	4.9	3	1.4	0.2	0
51	7	3.2	4.7	1.4	1
101	6.3	3.3	6	2.5	0

Finally, I’ll prepare a data set to check instances virginica or not.

instance	sepal_length	sepal_width	petal_length	petal_width	virginica
1	5.1	3.5	1.4	0.2	0
2	4.9	3	1.4	0.2	0
51	7	3.2	4.7	1.4	0
101	6.3	3.3	6	2.5	1

Now, I have 3 different data sets. I can build 3 decision trees for these data sets.

I’m going to put actual labels and predictions in the same table in the following steps. Columns beginning with F_ prefix are predictions.

Notice that instance 101 is predicted as versicolor and virginica with same probability. This has an error.

Initially, we need to apply softmax function for predictions. This function normalize all inputs in scale of [0, 1], and also sum of normalized values are always equal to 1. But there is no out-of-the-box function for softmax in python. Still we can create it easily as coded below.

def softmax(w):
   e = np.exp(np.array(w))
   dist = e / np.sum(e)
   return dist

I’m going to add these probabilities as columns. I’ve also hided actual values (Y_prefix) to fit the table.

ins	F_setosa	F_versicolor	F_virginica	P_setosa	P_versicolor	P_virginica
1	1	0	0	0.576	0.212	0.212
2	1	0	0	0.576	0.212	0.212
51	0	1	0	0.212	0.576	0.212
101	0	1	1	0.155	0.422	0.422

Remember that we’ve built new tree for actual minus prediction target value in regression case. This difference comes from the derivative of mean squared error. Herein, we’ve applied softmax function. The maximum one between probabilities of predictions (columns have P_ prefix) would be the prediction. In other words, we’ll apply one-hot-encoding as 1 for max one whereas 0 for others. Herein, cross entropy stores the relation between probabilities and one-hot-encoded results. Applying softmax and cross entropy respectively has surprising derivative. It is equal to prediction (probabilities in this case) minus actual. The negative gradient would be actual (columns have Y_ prefix) minus prediction (columns have P_ prefix). We will derive this value.

instance	Y_setosa – P_setosa	Y_versicolor – P_versicolor	Y_virginica – P_virginica
1	0.424	-0.212	-0.212
2	0.424	-0.212	-0.212
51	-0.212	0.424	-0.212
101	-0.155	-0.422	0.578

This is the round 1. Target values will be replaced as these negative gradients in the following round.

Target column for setosa will be replaced with Y_setosa – P_setosa.

instance	sepal_length	sepal_width	petal_length	petal_width	setosa
1	5.1	3.5	1.4	0.2	0.424
2	4.9	3	1.4	0.2	0.424
51	7	3.2	4.7	1.4	-0.212
101	6.3	3.3	6	2.5	-0.155

Target column for versicolor will be replaced with Y_versicolor – P_versicolor.

instance	sepal_length	sepal_width	petal_length	petal_width	versicolor
1	5.1	3.5	1.4	0.2	-0.212
2	4.9	3	1.4	0.2	-0.212
51	7	3.2	4.7	1.4	0.424
101	6.3	3.3	6	2.5	-0.422

I will apply similar replacements for virginica, too. These are my new data sets. I’m going to build 3 different decision trees for these 3 different data set. This operation will be repeated until I get satisfactory success.

Finally, I’m going to sum predictions (F_ prefix) for all rounds. The maximum index value will be my prediction.

At round 10, I can classify 144 instances correctly whereas 6 instances incorrectly. This means I got 96% accuracy. Remember that I got 70% accuracy before boosting. This is a major improvement!

The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Decision trees are naturally explainable and interpretable algorithms. However, it is hard to explain GBM. Herein, feature importance offers to understand built trees better.

I’ve demonstrated gradient boosting for classification on a multi-class classification problem where number of classes is greater than 2. Running it for a binary classification problem (true/false) might require to consume sigmoid function. Still, softmax and cross-entropy pair works for binary classification.

So, we’ve mentioned a step by step gradient boosting example for classification. I cannot find this in literature. Basically, we’ve transformed classification example to multiple regression tasks to boost. I am grateful to Cheng Li. His lecture notes guide me to understand this topic. Finally, running and debugging code by yourself makes concept much more understandable. That’s why, I’ve already pushed the code of gradient boosting for classification into GitHub.

Support this blog financially if you do like!

Y_versicolor

Y_virginica

F_versicolor

F_virginica