A Gentle Introduction to LightGBM for Applied Machine Learning

It is a fact that decision tree based machine learning algorithms dominate Kaggle competitions. More than half of the winning solutions have adopted XGBoost. Recently, Microsoft announced its gradient boosting framework LightGBM. Nowadays, it steals the spotlight in gradient boosting machines. Kagglers start to use LightGBM more than XGBoost. Even though XGBoost might have higher accuracy, LightGBM runs previously 10 times and currently 6 times faster than XGBoost. Moreover, there are tens of solutions standing atop a challenge podium.

alonso-podium-in-f1 — Podium ceremony in Formula 1

What was GBM?

LightGBM stands for lightweight gradient boosting machines.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Gradient boosting machines build sequential decision trees. Each tree will be built based on the previous tree’s error. Finally, predictions will be made by the sum of all of those trees.

Hands-on video

We will build a machine learning model in Python with LightGBM framework in this episode.

Installation

You might run pip install lightgbm command to install LightGBM package. Then, we will reference the related library.

import lightgbm as lgb

Tree growth

XGBoost applies level-wise tree growth whereas LightGBM applies leaf-wise tree growth. This makes LightGBM faster.

Data set

The data set that we are going to work on is about playing Golf decision based on some features. You can find the data set here. I choose this data set because it has both numeric and string features. Decision column is the target that we would like to extract decision rules. I will load the data set with pandas because it will simplify column based operations in the following steps.

import pandas as pd
dataset = pd.read_csv('golf2.txt')
dataset.head()

Data frame’s head function prints the first 5 rows.

	Outlook	Temp.	Humidity	Wind	Decision
0	Sunny	85	85	Weak	No
1	Sunny	80	90	Strong	No
2	Overcast	83	78	Weak	Yes
3	Rain	70	96	Weak	Yes
4	Rain	68	80	Weak	Yes

Label encoding

LightGBM expects to convert categorical features to integer. Here, temperature and humidity features are already numeric but outlook and wind features are categorical. We need to convert these features. I will use scikit-learn’s transformer.

Even though categorical features will be converted to integer, we will specify categorical features in the following steps. That’s why, I store both all features and categorical ones in different variables.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

features = []; categorical_features = []
num_of_columns = dataset.shape[1]

for i in range(0, num_of_columns):
column_name = dataset.columns[i]
column_type = dataset[column_name].dtypes

if i != num_of_columns - 1: #skip target
features.append(column_name)

if column_type == 'object':
le.fit(dataset[column_name])
feature_classes = list(le.classes_)
encoded_feature = le.transform(dataset[column_name])
dataset[column_name] = pd.DataFrame(encoded_feature)

if i != num_of_columns - 1: #skip target
categorical_features.append(column_name)

if is_regression == False and i == num_of_columns - 1:
num_of_classes = len(feature_classes)

In this way, we can handle different data sets. Let’s check the encoded data set.

dataset.head()

	Outlook	Temp.	Humidity	Wind	Decision
0	2	85	85	1	0
1	2	80	90	0	0
2	0	83	78	1	1
3	1	70	96	1	1
4	1	68	80	1	1

Data set is transformed into the final form. We need to separate input features and output labels to feed LightGBM.

y_train = dataset['Decision'].values
x_train = dataset.drop(columns=['Decision']).values

Specifying categorical features

Remember that we have converted string features to integer. Here, we need to specify categorical features. Even though it still work if categorical features wouldn’t mention. But in this case, some node in the decision tree might check that feature is greater than something, or less than or equal to it. Consider that gender would be a feature in our data set. We set unknown gender to 0, male to 1, and woman to 2. What if decision tree checks gender is greater than 0, or less than or equal to 0? We might miss an important gender information. Specifying categorical features enables to check gender for male, for woman and for unknown respectively.

lgb_train = lgb.Dataset(x_train, y_train
,feature_name = features
, categorical_feature = categorical_features
)

Training

We can solve this problem for both classification and regression. Typically, objective and metric parameters should be different. Passing parameter set and LightGBM’s data set will start training.

params = {
'task': 'train'
, 'boosting_type': 'gbdt'
, 'objective': 'regression' if is_regression == True else 'multiclass'
, 'num_class': num_of_classes
, 'metric': 'rmsle' if is_regression == True else 'multi_logloss'
, 'min_data': 1
, 'verbose': -1
}

gbm = lgb.train(params, lgb_train, num_boost_round=50)

Prediction

Trained tree stored in gbm variable. We can ask gbm to predict the decision for a new instance. Similarly, we can feed features of training set instances and want gbm to predict decisions.

predictions = gbm.predict(x_train)

for index, instance in dataset.iterrows():
actual = instance[target_name]

if is_regression == True:
prediction = round(predictions[index])
else: #classification
prediction = np.argmax(predictions[index])

print((index+1),&amp;amp;amp;quot;. actual= &amp;amp;amp;quot;,actual,&amp;amp;amp;quot;, prediction= &amp;amp;amp;quot;,prediction)

This code block makes following predictions for the training data set. As seen, all instances can be predicted successfully.

actual=  0 , prediction=  0
actual=  0 , prediction=  0
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  0 , prediction=  0
actual=  1 , prediction=  1
actual=  0 , prediction=  0
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  1 , prediction=  1
actual=  0 , prediction=  0

Visualization

Luckily, LightGBM enables to visualize built decision tree and importance of data set features. This makes decisions understandable. This requires to install Graph Visualization Software.

Firstly, you need to run pip install graphviz command to install python package.

Secondly, please install graphviz package related to your OS here. You can specify the installed directory as illustrated below.

import matplotlib.pyplot as plt
import os
os.environ[&amp;amp;amp;quot;PATH&amp;amp;amp;quot;] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin'

Plotting tree is an easy task now.

ax = lgb.plot_importance(gbm, max_num_features=10)
plt.show()

ax = lgb.plot_tree(gbm)
plt.show()

Decision rules can be extracted from the built tree easily.

Now, we know feature importance for the data set.

lgb-features — Feature importance values found by LightGBM

Accuracy Report

We can monitor accuracy score as coded below

predictions_classes = []
for i in predictions:
if is_regression == True:
predictions_classes.append(round(i))
else:
predictions_classes.append(np.argmax(i))

predictions_classes = np.array(predictions_classes)

from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc
accuracy = accuracy_score(predictions_classes, y_train)*100
print(accuracy,"%")

Moreover, if the problem were a classification problem, then precision and recall will be more important metric than the raw accuracy.

if is_regression == False:
actuals_onehot = pd.get_dummies(y_train).values
false_positive_rate, recall, thresholds = roc_curve(actuals_onehot[0], np.round(predictions)[0])
roc_auc = auc(false_positive_rate, recall)
print("AUC score ",roc_auc)

LightGBM vs XGBoost

LightGBM and XGBoost are the most popular gradient boosting frameworks.

Random Forest vs Gradient Boosting

LightGBM covers the both random forest and gradient boosting algorithms. So, we will discuss how they are similar and how they are different in the following video.

Feature importance

Decision trees are naturally interpretable and explainable machine learning algorithms. So, LightGBM is a explainable as well. Have you ever wonder how to explain decision trees?

Conclusion

So, we have discovered Microsoft’s light gradient boosting machine framework adopted by many applied machine learning studies. Moreover, we’ve mentioned its pros and cons compared to its alternatives. Besides, we’ve developed a hello world model with LightGBM. Finally, I pushed the source code of this blog post to my GitHub profile.

Support this blog financially if you do like!

10 Comments

Nonye says:

March 27, 2020 at 11:27 pm

Hi, I get the error ‘is_regression is not defined’. Running on Jupyter Notebook

Log in to Reply
1. Sefik Serengil says:
  
  March 28, 2020 at 10:27 am
  
  You are using the same notebook? https://github.com/serengil/decision-trees-for-ml/blob/master/python/LightGBM/LightGBM.ipynb
  
  In the 3th code block, that variable is initialized
  
  is_regression = False #set this True to run classification
  
  Log in to Reply
  1. Adeyink Michael says:
    
    April 19, 2020 at 4:46 pm
    
    NameError Traceback (most recent call last)
    in ()
    2 ‘task’: ‘train’
    3 , ‘boosting_type’: ‘gbdt’
    —-> 4 , ‘objective’: ‘regression’ if is_regression == True else ‘multiclass’
    5 , ‘num_class’: num_of_classes
    6 , ‘metric’: ‘rmsle’ if is_regression == True else ‘multi_logloss’
    
    NameError: name ‘is_regression’ is not defined
    
    Log in to Reply
    1. Sefik Serengil says:
      
      April 19, 2020 at 4:49 pm
      
      Hello,
      
      Please look at this repo: https://github.com/serengil/decision-trees-for-ml/blob/master/python/LightGBM/LightGBM.ipynb
      
      In the 3th code block that variable is set and initialized-> is_regression = False
      
      Log in to Reply
2. Adeyink Michael says:
  
  April 19, 2020 at 4:38 pm
  
  Good. i got the same kind of error too.
  
  Log in to Reply
  1. Sefik Serengil says:
    
    April 19, 2020 at 4:39 pm
    
    set is_regression = False in a cell above. You should follow the github repo of the post.
    
    Log in to Reply

A Gentle Introduction to LightGBM for Applied Machine Learning

What was GBM?

Hands-on video

Installation

Tree growth

Data set

Label encoding

Specifying categorical features

Training

Prediction

Visualization

Accuracy Report

LightGBM vs XGBoost

Random Forest vs Gradient Boosting

Feature importance

Conclusion

Related

10 Comments

Leave a Reply Cancel reply

What was GBM?

Hands-on video

Installation

Tree growth

Data set

Label encoding

Specifying categorical features

Training

Prediction

Visualization

Accuracy Report

LightGBM vs XGBoost

Random Forest vs Gradient Boosting

Feature importance

Conclusion

Related

10 Comments

Leave a Reply Cancel reply

Discover more from Sefik Ilkin Serengil