A Gentle Introduction to LightGBM for Applied Machine Learning

It is a fact that decision tree based machine learning algorithms dominate Kaggle competitions. More than half of the winning solutions have adopted XGBoost. Recently, Microsoft announced its gradient boosting framework LightGBM. Nowadays, it steals the spotlight in gradient boosting machines. Kagglers start to use LightGBM more than XGBoost. Even though XGBoost might have higher accuracy, LightGBM runs previously 10 times and currently 6 times faster than XGBoost. Moreover, there are tens of solutions standing atop a challenge podium.

alonso-podium-in-f1
Podium ceremony in Formula 1

What was GBM?

LightGBM stands for lightweight gradient boosting machines.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Gradient boosting machines build sequential decision trees. Each tree will be built based on the previous tree’s error. Finally, predictions will be made by the sum of all of those trees.

Hands-on video

We will build a machine learning model in Python with LightGBM framework in this episode.

Installation

You might run pip install lightgbm command to install LightGBM package. Then, we will reference the related library.

import lightgbm as lgb

Tree growth

XGBoost applies level-wise tree growth whereas LightGBM applies leaf-wise tree growth. This makes LightGBM faster.

tree-growth
Tree growth types. Illustration: Felipe Sulser

Data set

The data set that we are going to work on is about playing Golf decision based on some features. You can find the data set here. I choose this data set because it has both numeric and string features. Decision column is the target that we would like to extract decision rules. I will load the data set with pandas because it will simplify column based operations in the following steps.

import pandas as pd
dataset = pd.read_csv('golf2.txt')
dataset.head()

Data frame’s head function prints the first 5 rows.

Outlook Temp. Humidity Wind Decision
0 Sunny 85 85 Weak No
1 Sunny 80 90 Strong No
2 Overcast 83 78 Weak Yes
3 Rain 70 96 Weak Yes
4 Rain 68 80 Weak Yes

Label encoding

LightGBM expects to convert categorical features to integer. Here, temperature and humidity features are already numeric but outlook and wind features are categorical. We need to convert these features. I will use scikit-learn’s transformer.

Even though categorical features will be converted to integer, we will specify categorical features in the following steps. That’s why, I store both all features and categorical ones in different variables.





from sklearn import preprocessing
le = preprocessing.LabelEncoder()

features = []; categorical_features = []
num_of_columns = dataset.shape[1]

for i in range(0, num_of_columns):
column_name = dataset.columns[i]
column_type = dataset[column_name].dtypes

if i != num_of_columns - 1: #skip target
features.append(column_name)

if column_type == 'object':
le.fit(dataset[column_name])
feature_classes = list(le.classes_)
encoded_feature = le.transform(dataset[column_name])
dataset[column_name] = pd.DataFrame(encoded_feature)

if i != num_of_columns - 1: #skip target
categorical_features.append(column_name)

if is_regression == False and i == num_of_columns - 1:
num_of_classes = len(feature_classes)

In this way, we can handle different data sets. Let’s check the encoded data set.

dataset.head()
Outlook Temp. Humidity Wind Decision
0 2 85 85 1 0
1 2 80 90 0 0
2 0 83 78 1 1
3 1 70 96 1 1
4 1 68 80 1 1

Data set is transformed into the final form. We need to separate input features and output labels to feed LightGBM.

y_train = dataset['Decision'].values
x_train = dataset.drop(columns=['Decision']).values

Specifying categorical features

Remember that we have converted string features to integer. Here, we need to specify categorical features. Even though it still work if categorical features wouldn’t mention. But in this case, some node in the decision tree might check that feature is greater than something, or less than or equal to it. Consider that gender would be a feature in our data set. We set unknown gender to 0, male to 1, and woman to 2. What if decision tree checks gender is greater than 0, or less than or equal to 0? We might miss an important gender information. Specifying categorical features enables to check gender for male, for woman and for unknown respectively.

lgb_train = lgb.Dataset(x_train, y_train
,feature_name = features
, categorical_feature = categorical_features
)

Training

We can solve this problem for both classification and regression. Typically, objective and metric parameters should be different. Passing parameter set and LightGBM’s data set will start training.

params = {
'task': 'train'
, 'boosting_type': 'gbdt'
, 'objective': 'regression' if is_regression == True else 'multiclass'
, 'num_class': num_of_classes
, 'metric': 'rmsle' if is_regression == True else 'multi_logloss'
, 'min_data': 1
, 'verbose': -1
}

gbm = lgb.train(params, lgb_train, num_boost_round=50)

Prediction

Trained tree stored in gbm variable. We can ask gbm to predict the decision for a new instance. Similarly, we can feed features of training set instances and want gbm to predict decisions.

predictions = gbm.predict(x_train)

for index, instance in dataset.iterrows():
actual = instance[target_name]

if is_regression == True:
prediction = round(predictions[index])
else: #classification
prediction = np.argmax(predictions[index])

print((index+1),". actual= ",actual,", prediction= ",prediction)

This code block makes following predictions for the training data set. As seen, all instances can be predicted successfully.

actual=  0 , prediction=  0
actual=  0 , prediction=  0
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  0 , prediction=  0
actual=  1 , prediction=  1
actual=  0 , prediction=  0
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  0 , prediction=  0

Visualization

Luckily, LightGBM enables to visualize built decision tree and importance of data set features. This makes decisions understandable. This requires to install Graph Visualization Software.

Firstly, you need to run pip install graphviz command to install python package.

Secondly, please install graphviz package related to your OS here. You can specify the installed directory as illustrated below.

import matplotlib.pyplot as plt
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin'

Plotting tree is an easy task now.





ax = lgb.plot_importance(gbm, max_num_features=10)
plt.show()

ax = lgb.plot_tree(gbm)
plt.show()

Decision rules can be extracted from the built tree easily.

lgb-built-tree
Built decision tree

Now, we know feature importance for the data set.

lgb-features
Feature importance values found by LightGBM

Accuracy Report

We can monitor accuracy score as coded below

predictions_classes = []
for i in predictions:
if is_regression == True:
predictions_classes.append(round(i))
else:
predictions_classes.append(np.argmax(i))

predictions_classes = np.array(predictions_classes)

from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc
accuracy = accuracy_score(predictions_classes, y_train)*100
print(accuracy,"%")

Moreover, if the problem were a classification problem, then precision and recall will be more important metric than the raw accuracy.

if is_regression == False:
actuals_onehot = pd.get_dummies(y_train).values
false_positive_rate, recall, thresholds = roc_curve(actuals_onehot[0], np.round(predictions)[0])
roc_auc = auc(false_positive_rate, recall)
print("AUC score ",roc_auc)

LightGBM vs XGBoost

LightGBM and XGBoost are the most popular gradient boosting frameworks.

Random Forest vs Gradient Boosting

LightGBM covers the both random forest and gradient boosting algorithms. So, we will discuss how they are similar and how they are different in the following video.

Feature importance

Decision trees are naturally interpretable and explainable machine learning algorithms. So, LightGBM is a explainable as well. Have you ever wonder how to explain decision trees?

Conclusion

So, we have discovered Microsoft’s light gradient boosting machine framework adopted by many applied machine learning studies. Moreover, we’ve mentioned its pros and cons compared to its alternatives. Besides, we’ve developed a hello world model with LightGBM. Finally, I pushed the source code of this blog post to my GitHub profile.


Like this blog? Support me on Patreon

Buy me a coffee


10 Comments

      1. NameError Traceback (most recent call last)
        in ()
        2 ‘task’: ‘train’
        3 , ‘boosting_type’: ‘gbdt’
        —-> 4 , ‘objective’: ‘regression’ if is_regression == True else ‘multiclass’
        5 , ‘num_class’: num_of_classes
        6 , ‘metric’: ‘rmsle’ if is_regression == True else ‘multi_logloss’

        NameError: name ‘is_regression’ is not defined

      1. set is_regression = False in a cell above. You should follow the github repo of the post.

Comments are closed.