A Gentle Introduction to H2O GBM - Sefik Ilkin Serengil

GBM dominates tabular data based kaggle challenges. Putting it in the tool box is a must for data scientist. XGBoost, LightGBM and Catboost are the most common framework candidates. Herein, h2o covers both XGBoost and its own GBM implementation. We’ve mentioned that h2o frame provides a significant advantage against regular pandas for large scale data sets. Consuming a GBM algorithm inside the h2o platform would be reasonable because of its powerful components.

What was GBM?

Gradient boosting machines build sequential decision trees. Each tree should be built based on the previous tree’s error. Finally, the sum of the predictions of all those trees will be boosted prediction. Besides, these sequential trees will be called as boosted trees.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Problem

We will build a GBM model for the kinship problem. There are 2 different similarity scores for 3 different face recognition models in the data set as a feature and boolean is related information as a label. The data set can be found to here.

Getting h2o framework up

Importing python libraries is mostly enough to get the framework up. Herein, h2o needs to be initialized. This is because h2o gets a java server up in the background.

import h2o
h2o.init()

Sometimes, initialization requires to limit maximum memory and number of threads. This is very popular in Docker containers. You should consider your environmental values. Number of threads are related to your cpu core count and maximum memory size is related to available memory. You can monitor the current values of these requirements as shown below. Then, you should limit memory and threads in initialization step.

import multiprocessing
print("CPU: ",multiprocessing.cpu_count())

import psutil
print("Memory: ",psutil.virtual_memory())

h2o.init(ip="127.0.0.1", max_mem_size_GB = 100, nthreads = 5)

You h2o server might get up on http://localhost:54321 address and initialization command prints java related log file paths. You should look these files if you have some trouble because python api might print unrelated messages.

Data manipulation

Consider common gbm frameworks such as XGBoost or LightGBM. Firstly, you should apply data manipulation with pandas. Pandas comes with single core support. This is similar to that you have a porsche but you stuck in second gear. Herein, h2o frame is a multi-core supporting data manipulation tool h2o frame is equivalent to Pandas.

Positive and negative instances are stored in different files. We will read them separately and merge into a single data frame.

hf_positive = h2o.import_file('dataset/train_true_positive_features.csv')
hf_negative = h2o.import_file('dataset/train_true_negative_features.csv')
hf = hf_positive.rbind(hf_negative)

The data set contains some additional features but we actually do not need them. We can discard unnecessary features in h2o frame similar to pandas.

hf = hf[['vgg_cosine', 'vgg_euclidean_l2'
 , 'facenet_cosine', 'facenet_euclidean_l2'
 , 'openface_cosine', 'openface_euclidean_l2'
 , 'is_related']]

Big Data

Moreover, your data set would not similar to Kaggle data sets in real world. You should merge different very large data sources to obtain your master data. I see terabytes of master data several times. This limits you to work on cpu cores because no gpu have this size of memory. Furthermore, merging this volume of data cannot be handled with a single core performing library like Pandas. PySpark or Dask can be replaced to Pandas because they have a multi-core support. However, the right solution will be scala spark for larger data sets. You can use even 1500 cores of distributed systems with scala spark. PySpark and Dask are convenient for processes that can be handled by 70-80 cores maximum.

Sparkling water enables you to run scala spark code. Also, pysparkling is equivalent to pyspark.

We will discard spark related operations in this post because we will work on a small sized data set.

Train test split

Besides, we split data set into train, test and validation sets to avoid over-fitting. Scikit-learn is the most common way to split the data. Herein, you no longer need scikit-learn to split your data frame or evaluate the model performance.

#70% train, 15% test, 15% validation
train, test, validation = hf.split_frame(ratios=[0.70, 0.15], seed=17)

Regression or classisifacation

The target label is numerical and it would be a regression classifier by default. In this case, evaluation metrics will be regression related. To convert the problem to classification, we need to transform the label’s type to enum.

hf['is_related'] = hf['is_related'].asfactor()

Modelling

GBM frameworks are just responsible for modelling. h2o supports several machine learning algorithms such as linear models, tree-based models including random forest and gradient boosting, deep learning. Besides, it covers both xgboost and its own GBM model. We will build a h2o GBM model.

from h2o.estimators.gbm import H2OGradientBoostingEstimator

GBM models are very successful but dangerous learners. They tend to be over-fitted. We should use early stopping in building trees. The competition evaluates submissions with AUC scores. So, we specify the stopping metric as AUC.

model = H2OGradientBoostingEstimator(
 ntrees = 1000
 , learn_rate = 0.01
 , stopping_rounds = 50
 , stopping_metric = "AUC"
)

We will set the validation frame to test h2o frame. Building will be terminated if there is no improvement for test set in 50 epochs.

model.train(x = hf.names[0:-1], y = hf.names[-1]
 , training_frame = train
 , validation_frame = test
 #, verbose = True
 , model_id = "GBM_Kinship"
)

Model

Trained h2o model is the most comprehensive object I’ve ever seen in my machine learning studies. You should see the model object when training is over.

model

The following calculations are stored for both train and test set in model object.

ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.19726966691087394
RMSE: 0.444150500293396
LogLoss: 0.5786944223627535
Mean Per-Class Error: 0.33191619190721755
AUC: 0.727593424555288
pr_auc: 0.610124995970803
Gini: 0.45518684911057594

h2o-variable-importances — Feature importance

Evaluation

We would also use scikit-learn to find accuracy metrics (confusion matrix, auc score or accuracy) steps in regular frameworks. Herein, h2o offers its own evaluation functionalities.

We fed test data as validation frame in training step. We also have our own validation data not to be fed in train step. We will evaluate built model on this validation data.

val_perf = model.model_performance(validation)
print(val_perf.auc())
print(val_perf.accuracy())

GBM model got 72.22% AUC score and 69.40% accuracy score. It is very satisfactory.

Predictions

Validation object is type of h2o frame. We will pass it to predict method directly.

predictions = model.predict(test_data = validation)
predictions.tail()

Predictions is 3 columned frame. First column is prediction class where others are class probabilities. For example, it is predicted as 46.28% unrelated and 53.71% related for 1st instance.

Saving the model and restoration

We might need to store the built model and restore it later. Force argument in save model function enables to overwrite model file if exists.

saved_model = h2o.save_model(model, path = "", force=True)

restored_model = h2o.load_model(saved_model)

The last few words

Scalability would be the most important problem in production pipelines. Herein, h2o offers any functionality a data scientist may need. This comes with a huge advantage for production based projects.

It is obvious that this platform is developed by data scientist who have developer background. I believe this because it handles all problems I had previously.

The source code of this blog post is pushed to GitHub.

Like this blog? Support me on Patreon