XGBoost is firstly introduced in 2016 by Washington University Professors Tianqi Chen and Carlos Guestrin. Even though XGBoost appears in an academic event before, more than half of winner kaggle competitions make it much more popular in daily data science studies than academia. The name of the framework is acronym of extreme gradient boosting. It applies some improvements over regular GBM such as regularization to avoid overfitting, pruning and parallelism.
What was gradient boosting?
Gradient boosting builds sequential decision trees. Each decision tree will be built based on the previous tree’s error. Finally, sum of the predictions of all of those tree will be the boosted prediction. Besides, those sequential trees are called as boosted trees.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
Road map
We will develop a hello world application with XGBoost from scratch framework. The following video covers the step by step explanations of hands-on code.
Data set
We’re going to work on Golf data set. It states golf playing decisions based on outlook, temperature, humidity and wind.
import pandas as pd df = pd.read_csv("golf2.txt") df.head()
Outlook and Wind features are categorical where Temperature and Humidity features are numerical. The following code block will find categorical features for any data set.
categorical_features = [] features_names = [] for col in df.columns: features_names.append(col) if df[col].dtype == 'object': #categorical features if col != target_name: categorical_features.append(col)
Encoding
XGBoost expects features and target values in numerical format. Herein, we can apply label encoding if the feature stores sequential information such as weekday. On the other hand, nominal feature does not store sequential information, then we should apply one hot encoding. Because, built decision tree might detect some patterns for sequences. For example, I set 1 to Monday, 2 to Thursday and 7 to Sunday. A decision rule would be day is greater than or equal to 6. This means weekends. On the other hand, if the feature stores randomly assigned ID information, decision rule should have branches for all individual IDs.
Label encoding
We should apply label encoding for the target label.
target_name = 'Decision' unique_values = df[target_name].unique() print(target_name," column has ",unique_values," classes") for j in range(0, len(unique_values)): idx = df[df[target_name] == unique_values[j]].index df.loc[idx, target_name] = j print(unique_values[j]," is transformed to ",(j))
One hot encoding
One hot encoding should be applied to categorical features in the data set.
for column in categorical_features: unique_values = df[column].unique() one_hot = pd.get_dummies(unique_values, prefix=column) one_hot[column] = unique_values df = df.merge(one_hot, left_on = [column], right_on=[column], how="left") df = df.drop(columns = [column])
Encoded data set will be shown below.
You might need to apply label encoding to a feature in your data set.
Consider that you must apply one hot encoding for a feature having 1000 classes over millions of data. Pandas is already run on a single CPU core. This might last hours. That’s why, I use XGBoost in H2O because it speeds pre-processing tasks up.
Modelling
We will model this problem as both classification and regression.
Classification
Cross entropy is the loss function in multi class classification problems. Herein, the number of classes is 2, then this is a binary classification problem and it could use sigmoid (logistic) function. On the other hand, if the number of classes is more than 2, then loss function must be softmax.
if len(df[target_name].unique()) == 2: objective = 'binary:logistic' else: objective = 'multi:softmax' eval_metric = 'logloss'
We normally split the data set into train and validation sets but the number of instances is very small in this case. That’s why, I set same set to train and validation sets.
params = { 'learning_rate': 0.01 , 'max_depth': 5 , 'min_child_weight': 0.5 , 'n_estimators': 250 , 'object': objective } model = xgboost.XGBClassifier(**params) eval_set = [(df.drop(columns=[target_name]), df[target_name])] model.fit(df.drop(columns=[target_name]), df[target_name] , eval_metric=eval_metric , eval_set=eval_set, early_stopping_rounds=5, verbose=True )
We can make predictios when GBM model is built. Herein, we can both find the dominant prediction or class probabilities. I mean that built model can say an instance would be Yes/No
predictions = model.predict(df.drop(columns=[target_name])) actuals = df[target_name].values
On the other hand, built model can predict the probability of both Yes and No.
prediction_proba = model.predict_proba(df.drop(columns=[target_name])) pd.DataFrame(prediction_proba, columns=['P_No', 'P_Yes'])
Regression
Loss function will be root mean square error if the problem is defined as regression.
objective = 'rmse' eval_metric = 'rmse'
We will construct XGBRegressor instead of XGBClassifier.
params = { 'learning_rate': 0.01 , 'max_depth': 5 , 'min_child_weight': 0.5 , 'n_estimators': 250 , 'seed': 17 , 'object': objective } model = xgboost.XGBRegressor(**params) model.fit(df.drop(columns=[target_name]), df[target_name])
You can think No decision as 0 and Yes decision as 1. Our predictions will be in scale of [0, 1]. We can assign a prediction to the closest class.
predictions = model.predict(df.drop(columns=[target_name])) for i in range(0, len(predictions)): prediction = predictions[i] actual = actuals[i] error = abs(actual - prediction) mae += error print("Prediction is ", prediction," whereas actual was ", actual," (Error: ",error,")")
Feature importance
Built model stores feature importance values. This is important to explain the model and have an interpretable model.
from xgboost import plot_importance plot_importance(model)
Have you ever wonder how feature importance found in decision trees?
Parallelism
Regular GBM algorithm is based on building an initial decision tree and then building another one based on the error of the previous one. Herein, building sequential decision tree cannot be handled theoretically. However, you can still build branches parallel. Suppose that the most dominant feature is Outlook and it could have Rain, Overcast and Sunny classes. These classes are independent and you don’t have to have rain branch result to create overcast branch. This is called as Level-wise tree growth. This approach makes XGBoost fast.
Tree Growth Approaches
LightGBM appplies Leaf-wise tree growth. If you expand all tree, level-wise and leaf-wise approaches will build same trees. However, we mostly apply early stopping and pruning in decision trees. That’s why, leaf-wise approach performs faster. This makes LightGBM almost 10 times faster than XGBoost in CPU.
Building trees in GPU
Faster one becomes XGBoost when GPU is enabled.
Besides, if you have a GPU, then you just need to pass a parameter to run your XGBoost code on GPU. On the other hand, running LightGBM code is a really problematic. You need to compile the framework from scratch.
Robustness
Feature engineering would be your most of work in daily data science studies. That’s why, trial and error will be a large fraction of data scientist task. Herein, LightGBM is really fast. You can evaluate 10 models with LightGBM or 1 model with XGBoost. However, XGBoost builds much more robust models. My experiments show that XGBoost builds almost 2% more accurate models than LightGBM. It is an option that you can run LightGBM for early steps whereas XGBoost for your final model.
Categorical Features
Both XGBoost and LightGBM expect you to transform your nominal features and target to numerical. However, LightGBM offers categorical feature support. On the other hand, we have to apply one-hot encoding for really categorical features.
#LightGBM categorical feature support lgb_train = lgb.Dataset(x_train, y_train, categorical_feature = ['Outlook'])
LightGBM vs XGBoost
LightGBM and XGBoost are the most popular gradient boosting frameworks.
Random Forest vs Gradient Boosting
XGBoost covers the both random forest and gradient boosting algorithms. So, we will discuss how they are similar and how they are different in the following video.
Conclusion
We’ve mentioned to build a GBM model with XGBoost from scratch in this post. Besides, we’ve focused on the idea behind it and its pros and cons over LightGBM.
I pushed the source code of this post to GitHub.
Support this blog if you do like!