A Gentle Introduction to H2O AutoML

People always have an edge to AI because they have fear to lose their daily jobs. Herein, jobs of AI creators suffer from AI, too. Because, AutoML tools start to dominate Kaggle competitions. State-of-the-art field of machine learning studies is constructing models. Herein, AutoML can do it much better than us.

erin-ledell-automl-tweet
Tweet by Erin LeDell about AutoML

Previously, I’ve mentioned Autokeras – an AutoML tool for image based data. It builds convolutional neural networks but it would not offer any solution for tabular data. Herein, H2O AutoML is very successful at tabular data. As you might guess from its name, it is developed by the leader machine learning company – H2O.ai. I’ve attended H2O World ’19 and see this AutoML tool first time there. Today, we will mention H2O AutoML module for a custom use case scenario.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Vlog

You can either watch the following webinar or follow this blog post. They both cover same steps.

Supported Algorithms

Notice that decision tree based algorithms are much more successful at structured data sets than neural networks. So, it covers the following algorithms:

  • Generalized Linear Model (GLM)
  • Distributed random forest (DRF)
  • XGBoost
  • Its own GBM implementation
  • Deep Learning (fully connected neural networks actually, it is not CNN).

Besides, it builds very successful ensemble methods based on these algorithms.

Use Case: Kinship Prediction

Recently, I’ve enrolled a kaggle competition aiming to determine two individuals are related. I’ve published a dedicated blog post about this and you can also find my public kernel here.

My approach is mainly based on finding the distances including cosine and euclidean of two faces based on 3 different face recognition models. There are 6 features and boolean is related label. I’ve built a GBM model with LightGBM and got 64% accuracy on both public and private test set. Herein, I wonder what would the accuracy be if I run AutoML.

Data set

I’ve already shared the this pre-processed data set in Kaggle.

import pandas as pd

tp_df = pd.read_csv("dataset/train_true_positive_features.csv")
tn_df = pd.read_csv("dataset/train_true_negative_features.csv")
df = pd.concat([tp_df, tn_df])
df = df.reset_index(drop = True)

df = df[['vgg_cosine', 'vgg_euclidean_l2'
, 'facenet_cosine', 'facenet_euclidean_l2'
, 'openface_cosine', 'openface_euclidean_l2'
, 'is_related']]

df.head()

I will separate the data set into train and test set to evaluate the model after the best one is found. BTW, this step is not required because H2O will separate data set into train, validation and cross validation.

from sklearn.model_selection import train_test_split
x_train, x_test = train_test_split(df, test_size=0.15, random_state=17)

Then, we will import the H2O libraries. My testings are run in H2O 3.26.0.3 version.





import h2o
from h2o.automl import H2OAutoML

After then, you need to initialize your H2O engine.

h2o.init()

I can initialize the H2O engine in my local computer well but I had some troubles when run it on a server. You should consider your free memory and core count in this case. Because, you should limit these values in initialization.

import multiprocessing
print(multiprocessing.cpu_count())

import psutil
print(psutil.virtual_memory())

h2o.init(ip="127.0.0.1", max_mem_size_GB = 200, nthreads = 10)

You should see a summary table when initialization is successful.

h2o-init
H2O initialization

We’ve read the data set in pandas data frame format. We must convert it to H2O Frame before starting the AutoML.

hf = h2o.H2OFrame(x_train)
#hf = h2o.import_file('dataset/x_train.csv') #import h2o frame directly

Thereafter, we will initialize H2O AutoML object. Here, we can limit the processing time. I set it to 1 hour to look for the best model.

aml = H2OAutoML(max_runtime_secs=60*60*1)

Now, we can start the searching. Data frame stores both features and target label. That’s why, we should specify the column names for both features and target label.

y_label = "is_related"
x_labels = list(df.drop(columns=[y_label]).columns)

aml.train(x = x_labels, y = y_label, training_frame = hf)

This is a binary classification problem. That’s why, we should convert the type of is_related column to enum. Otherwise, h2o handles this as a regression problem.

#convert target label to enum because this is a classification problem
hf[y_label] = hf[y_label].asfactor()

You can see the success of built models when processing is done.

lb = aml.leaderboard
lb.head()
#lb.head(rows=lb.nrows) #to show all models

This shows classification related metrics. Here, ensemble model got the best result.

h2o-automl-leaderboard-v2
AutoML Leaderboard

We can see the details of the best model. This stores the confusion matrix and some accuracy metrics. You must see this!





aml.leader

We should store the best model to restore it later

saved_model = h2o.save_model(aml.leader, path = "")

Validation

We’ve already separate the data set into train and test set. Model is built by the train data and we can evaluate it with test data. Notice that H2O engine would not see the test data before.

hf_val = h2o.H2OFrame(x_test)

predictions = aml.predict(hf_val)
predictions_pd = predictions.as_data_frame() #h2o frame to pandas
predictions_pd.head()

actuals = hf_val['is_related']
actuals_pd = actuals.as_data_frame() #h2o frame to pandas

Predictions variable is 3 columned data frame. 1st column is its class prediction whereas p0 labeled column is probability of 0 class and p1 labeled column is probability of 1 class.

h2o-automl-predictions
Predictions
perf = aml.model_performance(hf_val)
perf.auc()
perf.accuracy() [0][1]

Area under the ROC curve score is 72.64 for my custom test set. Besides, it got 72.9 on public submission set and 73.5 on private submission set. We can say that it is a robust model.

h2o-automl-submission
Submission

Notice that my previous LightGBM model got 64 roc auc score. Almost 10 point increase is unbelievable. AutoML comes with less effort and higher accuracy.

Just a few words

So, we’ve mentioned one of the strongest AutoML tool in the market. Its main competitor Google AutoML offers neither on-premise (just cloud support) nor open-source. Herein, H2O is free, open-source and it has a huge community.

I pushed the source code of this post to GitHub. There are many ways to support a project – starring the GitHub repos is one.

Finally, I am very grateful to Erin LeDell for a great contribution with her feedback on this post.


Like this blog? Support me on Patreon

Buy me a coffee