A Gentle Introduction to XGBoost for Applied Machine Learning

XGBoost is firstly introduced in 2016 by Washington University Professors Tianqi Chen and Carlos Guestrin. Even though XGBoost appears in an academic event before, more than half of winner kaggle competitions make it much more popular in daily data science studies than academia. The name of the framework is acronym of extreme gradient boosting. It applies some improvements over regular GBM such as regularization to avoid overfitting, pruning and parallelism.

What was gradient boosting?

Gradient boosting builds sequential decision trees. Each decision tree will be built based on the previous tree’s error. Finally, sum of the predictions of all of those tree will be the boosted prediction. Besides, those sequential trees are called as boosted trees.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Road map

We will develop a hello world application with XGBoost from scratch framework. The following video covers the step by step explanations of hands-on code.

Data set

We’re going to work on Golf data set. It states golf playing decisions based on outlook, temperature, humidity and wind.

import pandas as pd
df = pd.read_csv("golf2.txt")
df.head()

Outlook and Wind features are categorical where Temperature and Humidity features are numerical. The following code block will find categorical features for any data set.

categorical_features = []
features_names = []

for col in df.columns:
features_names.append(col)

if df[col].dtype == 'object': #categorical features
if col != target_name:
categorical_features.append(col)

Encoding

XGBoost expects features and target values in numerical format. Herein, we can apply label encoding if the feature stores sequential information such as weekday. On the other hand, nominal feature does not store sequential information, then we should apply one hot encoding. Because, built decision tree might detect some patterns for sequences. For example, I set 1 to Monday, 2 to Thursday and 7 to Sunday. A decision rule would be day is greater than or equal to 6. This means weekends. On the other hand, if the feature stores randomly assigned ID information, decision rule should have branches for all individual IDs.

Label encoding

We should apply label encoding for the target label.

target_name = 'Decision'

unique_values =  df[target_name].unique()
print(target_name," column has ",unique_values," classes")

for j in range(0, len(unique_values)):
idx = df[df[target_name] == unique_values[j]].index
df.loc[idx, target_name] = j
print(unique_values[j]," is transformed to ",(j))

One hot encoding

One hot encoding should be applied to categorical features in the data set.

for column in categorical_features:
unique_values = df[column].unique()
one_hot = pd.get_dummies(unique_values, prefix=column)
one_hot[column] = unique_values

df = df.merge(one_hot, left_on = [column], right_on=[column], how="left")
df = df.drop(columns = [column])

Encoded data set will be shown below.

xgboost-encoded-data-set — Encoded data set

You might need to apply label encoding to a feature in your data set.

Consider that you must apply one hot encoding for a feature having 1000 classes over millions of data. Pandas is already run on a single CPU core. This might last hours. That’s why, I use XGBoost in H2O because it speeds pre-processing tasks up.

Modelling

We will model this problem as both classification and regression.

Classification

Cross entropy is the loss function in multi class classification problems. Herein, the number of classes is 2, then this is a binary classification problem and it could use sigmoid (logistic) function. On the other hand, if the number of classes is more than 2, then loss function must be softmax.

if len(df[target_name].unique()) == 2:
objective = 'binary:logistic'
else:
objective = 'multi:softmax'

eval_metric = 'logloss'

We normally split the data set into train and validation sets but the number of instances is very small in this case. That’s why, I set same set to train and validation sets.

params = {
'learning_rate': 0.01
, 'max_depth': 5
, 'min_child_weight': 0.5
, 'n_estimators': 250
, 'object':  objective
}

model = xgboost.XGBClassifier(**params)

eval_set = [(df.drop(columns=[target_name]), df[target_name])]

model.fit(df.drop(columns=[target_name]), df[target_name]
, eval_metric=eval_metric
, eval_set=eval_set, early_stopping_rounds=5, verbose=True
)

We can make predictios when GBM model is built. Herein, we can both find the dominant prediction or class probabilities. I mean that built model can say an instance would be Yes/No

predictions = model.predict(df.drop(columns=[target_name]))
actuals = df[target_name].values

xgboost-classification — Classification results

On the other hand, built model can predict the probability of both Yes and No.

prediction_proba = model.predict_proba(df.drop(columns=[target_name]))
pd.DataFrame(prediction_proba, columns=['P_No', 'P_Yes'])

xgboost-prediction-proba — Prediction probabilities

Regression

Loss function will be root mean square error if the problem is defined as regression.

objective = 'rmse'
eval_metric = 'rmse'

We will construct XGBRegressor instead of XGBClassifier.

params = {
'learning_rate': 0.01
, 'max_depth': 5
, 'min_child_weight': 0.5
, 'n_estimators': 250
, 'seed': 17
, 'object':  objective
}

model = xgboost.XGBRegressor(**params)
model.fit(df.drop(columns=[target_name]), df[target_name])

You can think No decision as 0 and Yes decision as 1. Our predictions will be in scale of [0, 1]. We can assign a prediction to the closest class.

predictions = model.predict(df.drop(columns=[target_name]))
for i in range(0, len(predictions)):
prediction = predictions[i]
actual = actuals[i]

error = abs(actual - prediction)
mae += error
print("Prediction is ", prediction," whereas actual was ", actual," (Error: ",error,")")

Feature importance

Built model stores feature importance values. This is important to explain the model and have an interpretable model.

from xgboost import plot_importance
plot_importance(model)

xgboost-feature-importance — Feature importance

Have you ever wonder how feature importance found in decision trees?

Parallelism

Regular GBM algorithm is based on building an initial decision tree and then building another one based on the error of the previous one. Herein, building sequential decision tree cannot be handled theoretically. However, you can still build branches parallel. Suppose that the most dominant feature is Outlook and it could have Rain, Overcast and Sunny classes. These classes are independent and you don’t have to have rain branch result to create overcast branch. This is called as Level-wise tree growth. This approach makes XGBoost fast.

Tree Growth Approaches

LightGBM appplies Leaf-wise tree growth. If you expand all tree, level-wise and leaf-wise approaches will build same trees. However, we mostly apply early stopping and pruning in decision trees. That’s why, leaf-wise approach performs faster. This makes LightGBM almost 10 times faster than XGBoost in CPU.

Building trees in GPU

Faster one becomes XGBoost when GPU is enabled.

Besides, if you have a GPU, then you just need to pass a parameter to run your XGBoost code on GPU. On the other hand, running LightGBM code is a really problematic. You need to compile the framework from scratch.

Robustness

Feature engineering would be your most of work in daily data science studies. That’s why, trial and error will be a large fraction of data scientist task. Herein, LightGBM is really fast. You can evaluate 10 models with LightGBM or 1 model with XGBoost. However, XGBoost builds much more robust models. My experiments show that XGBoost builds almost 2% more accurate models than LightGBM. It is an option that you can run LightGBM for early steps whereas XGBoost for your final model.

Categorical Features

Both XGBoost and LightGBM expect you to transform your nominal features and target to numerical. However, LightGBM offers categorical feature support. On the other hand, we have to apply one-hot encoding for really categorical features.

#LightGBM categorical feature support
lgb_train = lgb.Dataset(x_train, y_train, categorical_feature = ['Outlook'])

LightGBM vs XGBoost

LightGBM and XGBoost are the most popular gradient boosting frameworks.

Random Forest vs Gradient Boosting

XGBoost covers the both random forest and gradient boosting algorithms. So, we will discuss how they are similar and how they are different in the following video.

Conclusion

We’ve mentioned to build a GBM model with XGBoost from scratch in this post. Besides, we’ve focused on the idea behind it and its pros and cons over LightGBM.

I pushed the source code of this post to GitHub.

Support this blog financially if you do like!

A Gentle Introduction to XGBoost for Applied Machine Learning

What was gradient boosting?

Road map

Data set

Encoding

Label encoding

One hot encoding

Modelling

Classification

Regression

Feature importance

Parallelism

Tree Growth Approaches

Building trees in GPU

Robustness

Categorical Features

LightGBM vs XGBoost

Random Forest vs Gradient Boosting

Conclusion

Related

Leave a Reply Cancel reply

What was gradient boosting?

Road map

Data set

Encoding

Label encoding

One hot encoding

Modelling

Classification

Regression

Feature importance

Parallelism

Tree Growth Approaches

Building trees in GPU

Robustness

Categorical Features

LightGBM vs XGBoost

Random Forest vs Gradient Boosting

Conclusion

Related

Leave a Reply Cancel reply

Discover more from Sefik Ilkin Serengil