Gradient boosting machines might be confusing for beginners. Even though most of resources say that GBM can handle both regression and classification problems, its practical examples always cover regression studies. This is actually tricky statement because GBM is designed for only regression. But we can transform classification tasks into regression tasks.
What was GBM?
Notice that gradient boosting is not a decision tree algorithm. It proposes to run a regression trees sequentially.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
GBM in Python
Hands-on coding might help some people to understand algorithms better. You can find the python implementation of gradient boosting for classification algorithm here.
Data set
Here, we are going to work on Iris data set. There are 150 instances of 3 homogeneous classes. They are setosa, versicolor and virginica. This is the target output whereas top and bottom leaf sizes are input features.
Why we need bossting?
Applying C4.5 decision tree algorithm to this data set classifies 105 instances correctly whereas 45 instances incorrectly. This means 70% accuracy which is far away from the success. We will run same C4.5 algorithm in the following steps but boosting enables to increase the accuracy.
You can find the building decision tree code here.
Binary classification for multi trees
We are going to apply one-hot-encoding to target output. Thus, output will be represented as three dimensional vector. However, decision tree algorithms can handle one output only. That’s why, we will build 3 different regression trees each time. You might think each decision tree as different binary classification problem.
I mean that I’ve selected sample rows of the data set to illustrate. This is the original one.
instance | sepal_length | sepal_width | petal_length | petal_width | label |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
51 | 7 | 3.2 | 4.7 | 1.4 | versicolor |
101 | 6.3 | 3.3 | 6 | 2.5 | virginica |
Label consists of 3 classes: setosa, versicolor and virginica.
Data set for setosa class
Firstly, I prepare a data set to check instances setosa or not.
instance | sepal_length | sepal_width | petal_length | petal_width | setosa |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 1 |
2 | 4.9 | 3 | 1.4 | 0.2 | 1 |
51 | 7 | 3.2 | 4.7 | 1.4 | 0 |
101 | 6.3 | 3.3 | 6 | 2.5 | 0 |
Data set for versicolor class
Secondly, I prepare a data set to check instances versicolor or not.
instance | sepal_length | sepal_width | petal_length | petal_width | versicolor |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
2 | 4.9 | 3 | 1.4 | 0.2 | 0 |
51 | 7 | 3.2 | 4.7 | 1.4 | 1 |
101 | 6.3 | 3.3 | 6 | 2.5 | 0 |
Data set for virginica class
Finally, I’ll prepare a data set to check instances virginica or not.
instance | sepal_length | sepal_width | petal_length | petal_width | virginica |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
2 | 4.9 | 3 | 1.4 | 0.2 | 0 |
51 | 7 | 3.2 | 4.7 | 1.4 | 0 |
101 | 6.3 | 3.3 | 6 | 2.5 | 1 |
Now, I have 3 different data sets. I can build 3 decision trees for these data sets.
I’m going to put actual labels and predictions in the same table in the following steps. Columns beginning with F_ prefix are predictions.
instance | Y_setosa | Y_versicolor | Y_virginica | F_setosa | F_versicolor | F_virginica |
1 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 1 | 0 | 0 | 1 | 0 | 0 |
51 | 0 | 1 | 0 | 0 | 1 | 0 |
101 | 0 | 0 | 1 | 0 | 1 | 1 |
Notice that instance 101 is predicted as versicolor and virginica with same probability. This has an error.
Softmax probability function
Initially, we need to apply softmax function for predictions. This function normalize all inputs in scale of [0, 1], and also sum of normalized values are always equal to 1. But there is no out-of-the-box function for softmax in python. Still we can create it easily as coded below.
def softmax(w): e = np.exp(np.array(w)) dist = e / np.sum(e) return dist
I’m going to add these probabilities as columns. I’ve also hided actual values (Y_prefix) to fit the table.
ins | F_setosa | F_versicolor | F_virginica | P_setosa | P_versicolor | P_virginica |
1 | 1 | 0 | 0 | 0.576 | 0.212 | 0.212 |
2 | 1 | 0 | 0 | 0.576 | 0.212 | 0.212 |
51 | 0 | 1 | 0 | 0.212 | 0.576 | 0.212 |
101 | 0 | 1 | 1 | 0.155 | 0.422 | 0.422 |
Finding negative gradient
Remember that we’ve built new tree for actual minus prediction target value in regression case. This difference comes from the derivative of mean squared error. Herein, we’ve applied softmax function. The maximum one between probabilities of predictions (columns have P_ prefix) would be the prediction. In other words, we’ll apply one-hot-encoding as 1 for max one whereas 0 for others. Herein, cross entropy stores the relation between probabilities and one-hot-encoded results. Applying softmax and cross entropy respectively has surprising derivative. It is equal to prediction (probabilities in this case) minus actual. The negative gradient would be actual (columns have Y_ prefix) minus prediction (columns have P_ prefix). We will derive this value.
instance | Y_setosa – P_setosa |
Y_versicolor – P_versicolor |
Y_virginica – P_virginica |
1 | 0.424 | -0.212 | -0.212 |
2 | 0.424 | -0.212 | -0.212 |
51 | -0.212 | 0.424 | -0.212 |
101 | -0.155 | -0.422 | 0.578 |
This is the round 1. Target values will be replaced as these negative gradients in the following round.
Reflecting negative gradient
Target column for setosa will be replaced with Y_setosa – P_setosa.
instance | sepal_length | sepal_width | petal_length | petal_width | setosa |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0.424 |
2 | 4.9 | 3 | 1.4 | 0.2 | 0.424 |
51 | 7 | 3.2 | 4.7 | 1.4 | -0.212 |
101 | 6.3 | 3.3 | 6 | 2.5 | -0.155 |
Target column for versicolor will be replaced with Y_versicolor – P_versicolor.
instance | sepal_length | sepal_width | petal_length | petal_width | versicolor |
1 | 5.1 | 3.5 | 1.4 | 0.2 | -0.212 |
2 | 4.9 | 3 | 1.4 | 0.2 | -0.212 |
51 | 7 | 3.2 | 4.7 | 1.4 | 0.424 |
101 | 6.3 | 3.3 | 6 | 2.5 | -0.422 |
I will apply similar replacements for virginica, too. These are my new data sets. I’m going to build 3 different decision trees for these 3 different data set. This operation will be repeated until I get satisfactory success.
Final predictions
Finally, I’m going to sum predictions (F_ prefix) for all rounds. The maximum index value will be my prediction.
At round 10, I can classify 144 instances correctly whereas 6 instances incorrectly. This means I got 96% accuracy. Remember that I got 70% accuracy before boosting. This is a major improvement!
Random Forest vs Gradient Boosting
The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.
Adaboost vs Gradient Boosting
The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.
Feature importance
Decision trees are naturally explainable and interpretable algorithms. However, it is hard to explain GBM. Herein, feature importance offers to understand built trees better.
Bonus: binary classification
I’ve demonstrated gradient boosting for classification on a multi-class classification problem where number of classes is greater than 2. Running it for a binary classification problem (true/false) might require to consume sigmoid function. Still, softmax and cross-entropy pair works for binary classification.
To be concluded
So, we’ve mentioned a step by step gradient boosting example for classification. I cannot find this in literature. Basically, we’ve transformed classification example to multiple regression tasks to boost. I am grateful to Cheng Li. His lecture notes guide me to understand this topic. Finally, running and debugging code by yourself makes concept much more understandable. That’s why, I’ve already pushed the code of gradient boosting for classification into GitHub.
Support this blog if you do like!