XGBoost triggered the rise of the tree based models in the machine learning world. It earns reputation with its robust models. Its built models mostly get almost 2% more accuracy. On the other hand, it is a fact that XGBoost is almost 10 times slower than LightGBM. Speed means a lot in a data challenge. Because you have to try several models in the feature engineering step. That’s why, most of data scientists prefer speed instead of accuracy. Herein, h2o.ai offers a faster XGBoost implementation includes model building and pre-processing steps.
What was gradient boosting?
XGBoost is the acronym of extreme gradient boosting. Do you remember that what gradient boosting is?
πββοΈ You may consider to enroll my top-rated machine learning course on Udemy
Gradient boosting builds sequential decision trees. Each tree should be built based on the previous treeβs error. Finally, the sum of the predictions of all those trees will be boosted prediction. Besides, these sequential trees will be called as boosted trees.
Data set
You can realize the differece between regular XGBoost and H2O XGBoost in a large scale data. Recently, I’ve joined the ASHARE Energy Prediction competition in Kaggle. It has a challanging data set. Train set consists of 20M rows whereas submission set consists of 40M rows. We will work on this data set to compare XGBoost distributions.
There are 3 different files in the data set. We will merge them as illustrated below.
train = pd.read_csv('train.csv') metadata = pd.read_csv('building_metadata.csv') weather = pd.read_csv('weather_train.csv') train = train.merge(metadata, on="building_id", how="left") train = train.merge(weather, on=["site_id", "timestamp"], how="left")
We can import these files in h2o and merge. Remember that h2o frame runs on multi cpu cores. This makes merging operation fast. However, merging two frames based on different data types causes a problem. Train and weather should be merged on float site id and timestamp columns. That’s why, I’ve read and merged data files with pandas and convert to h2o frame later.
hf_train = h2o.H2OFrame(train)
20M lined train set is converted to the h2o frame in less than 4 minutes in my experiments. If we would use regular xgboost, we will not spend time to convert data frame.
Target
meter_reading column is the target we would like to predict. It is a numerical column in scale of [0, 21904700]. This is a very large scale. We should apply log1p function to reduce the target. This applies ln(1 + x) function to the target. Notice that ln(0) is infinite and ln(1) is 0. That’s why, adding 1 to the target smooths the set because minimum target value was 0.
train['meter_reading'] = np.log1p(train['meter_reading']).astype(np.float32)
Then, we can restore the target and predictions as well with expm1 function. This actually calculates the e to the power of x minus 1. You should apply this approach all regression problems having a large scale.
Feature engineering
We can expand timestamp column based on its time based features. This post does not aim to find the best solution of this competition. We just want to compare regular and h2o based xgboost distributions. That’s why, I add some common features.
def expandFeatures(df): df['year'] = df['timestamp'].year() df['month'] = df['timestamp'].month() df['day'] = df['timestamp'].day() df['hour'] = df['timestamp'].hour() df['weekday'] = df['timestamp'].dayOfWeek() df['square_feet'] = df['square_feet'].log() df['year_built'] = df['year_built'].asnumeric() df['year_built'] = df[df['year_built'] > 0]['year_built'].max() - df['year_built'] df['sea_level_pressure'] = df['sea_level_pressure']- df[df['sea_level_pressure'] >= 0]['sea_level_pressure'].min() df = df.drop('timestamp') return df hf_train = expandFeatures(hf_train)
Data frame is expanded in 10 seconds in h2o frame whereas 24 seconds in pandas data frame.
Nominal features
Similar to LightGBM, XGBoost expects you to transform string features to numerical. Passing string features to XGBoost causes the following exception message.
DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields primary_use
You have to apply label encoding for nominal features as illustrated below.
#For Regular XGBoost feature_classes = train['primary_use'].unique() for j in range(len(feature_classes)): feature_class = feature_classes[j] train['primary_use'] = train['primary_use'].replace(feature_class, str(j)) train = train.astype({'primary_use': 'int32'})
On the other hand, nominal features are fine in h2o. String columns are already loaded in enum type. You can check it with describe() command. You don’t have to do anything in h2o.
Pandas can apply label encoding in 16 seconds whereas you do not have to apply label encoding in h2o.
Categorical features
Some features (doesn’t matter numerical or nominal) might be categorical. Even though LightGBM has a categorical feature support, XGBoost hasn’t. You just need to pass categorical feature names when creating the data set in LightGBM. On the other hand, you have to apply one-hot-encoding for categorical features in XGBoost.
#For Regular XGBoost categorical_features = ['meter', 'primary_use', 'site_id'] #, 'building_id'] for column in categorical_features: unique_values = train[column].unique() one_hot = pd.get_dummies(unique_values, prefix=column) one_hot[column] = unique_values train = train.merge(one_hot, left_on = [column], right_on=[column], how="left") train = train.drop(columns = [column])
You can specify which columns are categorical in H2O as shown below.
#For XGBoost within H2O categorical_features = ['meter', 'primary_use', 'site_id'] #, 'building_id'] for key, col_type in hf_train.types.items(): if key in categorical_features: hf_train[key] = hf_train[key].asfactor() else: hf_train[key] = hf_train[key].asnumeric()
Here, building id has thousands of categories. Applying one hot encoding to it causes to run regular xgboost too long in cpu. However, I applied one hot encoding to building id in my GPU tests.
Regular XGBoost lasts 311 seconds when building id included whereas it lasts 40 seconds when building id discarded. On the other hand, h2o completed in milliseconds in both case.
Train test split
We can split the data set into train and test with sklearn in regular xgboost.
#For Regular XGBoost from sklearn.model_selection import train_test_split x_train, x_test = train_test_split(train, test_size=0.30, random_state=17) x_test, x_validation = train_test_split(x_test, test_size=0.50, random_state=17)
H2O offers its own split function.
#For XGBoost within H2O train, test, validation = hf_train.split_frame(ratios=[0.70, 0.15], seed=17)
Splitting lasts 18 seconds in regular XGBoost if one hot encoding would not be applied to building id whereas it lasts 483 seconds if one hot encoding wold be applied to building id. On the other hand, h2o completes splitting in 5 second in both case.
Training
We will build boosted trees with same configuration.
#Regular XGBoost import xgboost model = xgboost.XGBRegressor( n_estimators=250 , max_depth=10 , learning_rate=0.01 , seed = 4241 , nthread = 5 #, tree_method='gpu_hist' #, gpu_id=0 ) eval_set = [(x_validation.drop(columns=[target_label]), x_validation[target_label])] model.fit(x_train.drop(columns=[target_label]), x_train[target_label] , eval_metric="rmse", eval_set=eval_set, early_stopping_rounds=50, verbose=True)
In H2O, training is handled similar to regular XGBoost.
#For XGBoost within H2O from h2o.estimators.xgboost import H2OXGBoostEstimator model = H2OXGBoostEstimator( ntrees = 250 , max_depth = 10 , learn_rate = 0.01 , seed = 4241 , stopping_rounds = 50 , stopping_metric = "RMSE" ) model.train(x=feature_names, y=target_label , training_frame=train, validation_frame = validation)
Herein, h2o is run on GPU by default whereas you have to pass tree_method and gpu_id parameters in regular XGBoost.
Training lasts 10412 seconds (2.8 hours) in h2o if GPU is disabled where as training lasts 16389 seconds (4.5 hours) if GPU is disabled. This means that XGBoost within h2o is 1.5 times faster than regular XGBoost.
Besides, training lasts 204 seconds in h2o when GPU is enabled whereas regular XGBoost cannot handle memory if GPU is enabled and this causes kernel to die. This is almost 80 times faster than the Regular XGBoost completed time.
Loss
Root mean square error (RMSE) of validation data 1.24 in h2o whereas 1.32 in regular xgboost. Similarly, mean squared error (MSE) of test data is 1.55 in h2o whereas 1.76 in regular xgboost. We can say that h2o offers faster and more robust model than regular xgboost.
Feature importance
You might think that h2o would not apply one hot encoding to data set and this might cause its speed. We can see that one hot encoding is applied to data set when we plot the feature importance values.
#Regular XGBoost from xgboost import plot_importance plot_importance(model, max_num_features=15, show_values=True) #For XGBoost within H2O variables = model._model_json['output']['variable_importances']['variable'] scaled_importance = model._model_json['output']['variable_importances']['scaled_importance'] model.varimp_plot()
As seen, meter and site_id columns have some postfix. Besides, variables store actual features.
Have you ever wonder how feature importance found in decision trees?
Total run time
GPU enabled XGBoost within H2O completed in 554 seconds (9 minutes) whereas its CPU implementation (limited to 5 CPU cores) completed in 10743 seconds (174 minutes).
On the other hand, Regular XGBoost on CPU lasts 16932 seconds (4.7 hours) and it dies if GPU is enalbed.
To sum up, h2o distribution is 1.6 times faster than the regular xgboost on CPU. Besides, building a model on a gpu can be run on just h2o for a large data set.
Random Forest vs Gradient Boosting
XGBoost covers the both random forest and gradient boosting algorithms. So, we will discuss how they are similar and how they are different in the following video.
Conclusion
I believe that h2o comes with two significant advantages. Firstly, you can skip most of boring pre-processing steps with H2O implementation. In this way, you can just focus on the satisfactory part of a data science study. Secondly, even if you need to apply manipulation on data it performs much faster. Still, its training time on CPU is too slower than LightGBM.
So, H2O implementation is really amazing. That’s why, you should build XGBoost models within H2O!
I pushed the source codes of this study as a notebook to my personal GitHub profile. You can find XGBoost within H2O GPU, XGBoost within H2O CPU and Regular XGBoost CPU notebooks there. You can compare the running time of each block and entire code in these notebooks.
Finally, you can support this study if you star the repository.
Support this blog if you do like!