It is a fact that decision tree based machine learning algorithms dominate Kaggle competitions. More than half of the winning solutions have adopted XGBoost. Recently, Microsoft announced its gradient boosting framework LightGBM. Nowadays, it steals the spotlight in gradient boosting machines. Kagglers start to use LightGBM more than XGBoost. Even though XGBoost might have higher accuracy, LightGBM runs previously 10 times and currently 6 times faster than XGBoost. Moreover, there are tens of solutions standing atop a challenge podium.
What was GBM?
LightGBM stands for lightweight gradient boosting machines.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
Gradient boosting machines build sequential decision trees. Each tree will be built based on the previous tree’s error. Finally, predictions will be made by the sum of all of those trees.
Hands-on video
We will build a machine learning model in Python with LightGBM framework in this episode.
Installation
You might run pip install lightgbm command to install LightGBM package. Then, we will reference the related library.
import lightgbm as lgb
Tree growth
XGBoost applies level-wise tree growth whereas LightGBM applies leaf-wise tree growth. This makes LightGBM faster.
Data set
The data set that we are going to work on is about playing Golf decision based on some features. You can find the data set here. I choose this data set because it has both numeric and string features. Decision column is the target that we would like to extract decision rules. I will load the data set with pandas because it will simplify column based operations in the following steps.
import pandas as pd dataset = pd.read_csv('golf2.txt') dataset.head()
Data frame’s head function prints the first 5 rows.
Outlook | Temp. | Humidity | Wind | Decision | |
0 | Sunny | 85 | 85 | Weak | No |
1 | Sunny | 80 | 90 | Strong | No |
2 | Overcast | 83 | 78 | Weak | Yes |
3 | Rain | 70 | 96 | Weak | Yes |
4 | Rain | 68 | 80 | Weak | Yes |
Label encoding
LightGBM expects to convert categorical features to integer. Here, temperature and humidity features are already numeric but outlook and wind features are categorical. We need to convert these features. I will use scikit-learn’s transformer.
Even though categorical features will be converted to integer, we will specify categorical features in the following steps. That’s why, I store both all features and categorical ones in different variables.
from sklearn import preprocessing le = preprocessing.LabelEncoder() features = []; categorical_features = [] num_of_columns = dataset.shape[1] for i in range(0, num_of_columns): column_name = dataset.columns[i] column_type = dataset[column_name].dtypes if i != num_of_columns - 1: #skip target features.append(column_name) if column_type == 'object': le.fit(dataset[column_name]) feature_classes = list(le.classes_) encoded_feature = le.transform(dataset[column_name]) dataset[column_name] = pd.DataFrame(encoded_feature) if i != num_of_columns - 1: #skip target categorical_features.append(column_name) if is_regression == False and i == num_of_columns - 1: num_of_classes = len(feature_classes)
In this way, we can handle different data sets. Let’s check the encoded data set.
dataset.head()
Outlook | Temp. | Humidity | Wind | Decision | |
0 | 2 | 85 | 85 | 1 | 0 |
1 | 2 | 80 | 90 | 0 | 0 |
2 | 0 | 83 | 78 | 1 | 1 |
3 | 1 | 70 | 96 | 1 | 1 |
4 | 1 | 68 | 80 | 1 | 1 |
Data set is transformed into the final form. We need to separate input features and output labels to feed LightGBM.
y_train = dataset['Decision'].values x_train = dataset.drop(columns=['Decision']).values
Specifying categorical features
Remember that we have converted string features to integer. Here, we need to specify categorical features. Even though it still work if categorical features wouldn’t mention. But in this case, some node in the decision tree might check that feature is greater than something, or less than or equal to it. Consider that gender would be a feature in our data set. We set unknown gender to 0, male to 1, and woman to 2. What if decision tree checks gender is greater than 0, or less than or equal to 0? We might miss an important gender information. Specifying categorical features enables to check gender for male, for woman and for unknown respectively.
lgb_train = lgb.Dataset(x_train, y_train ,feature_name = features , categorical_feature = categorical_features )
Training
We can solve this problem for both classification and regression. Typically, objective and metric parameters should be different. Passing parameter set and LightGBM’s data set will start training.
params = { 'task': 'train' , 'boosting_type': 'gbdt' , 'objective': 'regression' if is_regression == True else 'multiclass' , 'num_class': num_of_classes , 'metric': 'rmsle' if is_regression == True else 'multi_logloss' , 'min_data': 1 , 'verbose': -1 } gbm = lgb.train(params, lgb_train, num_boost_round=50)
Prediction
Trained tree stored in gbm variable. We can ask gbm to predict the decision for a new instance. Similarly, we can feed features of training set instances and want gbm to predict decisions.
predictions = gbm.predict(x_train) for index, instance in dataset.iterrows(): actual = instance[target_name] if is_regression == True: prediction = round(predictions[index]) else: #classification prediction = np.argmax(predictions[index]) print((index+1),". actual= ",actual,", prediction= ",prediction)
This code block makes following predictions for the training data set. As seen, all instances can be predicted successfully.
actual= 0 , prediction= 0 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 0 , prediction= 0
Visualization
Luckily, LightGBM enables to visualize built decision tree and importance of data set features. This makes decisions understandable. This requires to install Graph Visualization Software.
Firstly, you need to run pip install graphviz command to install python package.
Secondly, please install graphviz package related to your OS here. You can specify the installed directory as illustrated below.
import matplotlib.pyplot as plt import os os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin'
Plotting tree is an easy task now.
ax = lgb.plot_importance(gbm, max_num_features=10) plt.show() ax = lgb.plot_tree(gbm) plt.show()
Decision rules can be extracted from the built tree easily.
Now, we know feature importance for the data set.
Accuracy Report
We can monitor accuracy score as coded below
predictions_classes = [] for i in predictions: if is_regression == True: predictions_classes.append(round(i)) else: predictions_classes.append(np.argmax(i)) predictions_classes = np.array(predictions_classes) from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc accuracy = accuracy_score(predictions_classes, y_train)*100 print(accuracy,"%")
Moreover, if the problem were a classification problem, then precision and recall will be more important metric than the raw accuracy.
if is_regression == False: actuals_onehot = pd.get_dummies(y_train).values false_positive_rate, recall, thresholds = roc_curve(actuals_onehot[0], np.round(predictions)[0]) roc_auc = auc(false_positive_rate, recall) print("AUC score ",roc_auc)
LightGBM vs XGBoost
LightGBM and XGBoost are the most popular gradient boosting frameworks.
Random Forest vs Gradient Boosting
LightGBM covers the both random forest and gradient boosting algorithms. So, we will discuss how they are similar and how they are different in the following video.
Feature importance
Decision trees are naturally interpretable and explainable machine learning algorithms. So, LightGBM is a explainable as well. Have you ever wonder how to explain decision trees?
Conclusion
So, we have discovered Microsoft’s light gradient boosting machine framework adopted by many applied machine learning studies. Moreover, we’ve mentioned its pros and cons compared to its alternatives. Besides, we’ve developed a hello world model with LightGBM. Finally, I pushed the source code of this blog post to my GitHub profile.
Support this blog if you do like!
How can i print the accuracy report?
I’ve just added the accuracy report section in the post. Please stay up-to-dated.
there are errors executions. Some variables are not defined.
You are running the code in github?
Hi, I get the error ‘is_regression is not defined’. Running on Jupyter Notebook
You are using the same notebook? https://github.com/serengil/decision-trees-for-ml/blob/master/python/LightGBM/LightGBM.ipynb
In the 3th code block, that variable is initialized
is_regression = False #set this True to run classification
NameError Traceback (most recent call last)
in ()
2 ‘task’: ‘train’
3 , ‘boosting_type’: ‘gbdt’
—-> 4 , ‘objective’: ‘regression’ if is_regression == True else ‘multiclass’
5 , ‘num_class’: num_of_classes
6 , ‘metric’: ‘rmsle’ if is_regression == True else ‘multi_logloss’
NameError: name ‘is_regression’ is not defined
Hello,
Please look at this repo: https://github.com/serengil/decision-trees-for-ml/blob/master/python/LightGBM/LightGBM.ipynb
In the 3th code block that variable is set and initialized-> is_regression = False
Good. i got the same kind of error too.
set is_regression = False in a cell above. You should follow the github repo of the post.