People always have an edge to AI because they have fear to lose their daily jobs. Herein, jobs of AI creators suffer from AI, too. Because, AutoML tools start to dominate Kaggle competitions. State-of-the-art field of machine learning studies is constructing models. Herein, AutoML can do it much better than us.
Previously, I’ve mentioned Autokeras – an AutoML tool for image based data. It builds convolutional neural networks but it would not offer any solution for tabular data. Herein, H2O AutoML is very successful at tabular data. As you might guess from its name, it is developed by the leader machine learning company – H2O.ai. I’ve attended H2O World ’19 and see this AutoML tool first time there. Today, we will mention H2O AutoML module for a custom use case scenario.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
Vlog
You can either watch the following webinar or follow this blog post. They both cover same steps.
Supported Algorithms
Notice that decision tree based algorithms are much more successful at structured data sets than neural networks. So, it covers the following algorithms:
- Generalized Linear Model (GLM)
- Distributed random forest (DRF)
- XGBoost
- Its own GBM implementation
- Deep Learning (fully connected neural networks actually, it is not CNN).
Besides, it builds very successful ensemble methods based on these algorithms.
Use Case: Kinship Prediction
Recently, I’ve enrolled a kaggle competition aiming to determine two individuals are related. I’ve published a dedicated blog post about this and you can also find my public kernel here.
My approach is mainly based on finding the distances including cosine and euclidean of two faces based on 3 different face recognition models. There are 6 features and boolean is related label. I’ve built a GBM model with LightGBM and got 64% accuracy on both public and private test set. Herein, I wonder what would the accuracy be if I run AutoML.
Data set
I’ve already shared the this pre-processed data set in Kaggle.
import pandas as pd tp_df = pd.read_csv("dataset/train_true_positive_features.csv") tn_df = pd.read_csv("dataset/train_true_negative_features.csv") df = pd.concat([tp_df, tn_df]) df = df.reset_index(drop = True) df = df[['vgg_cosine', 'vgg_euclidean_l2' , 'facenet_cosine', 'facenet_euclidean_l2' , 'openface_cosine', 'openface_euclidean_l2' , 'is_related']] df.head()
I will separate the data set into train and test set to evaluate the model after the best one is found. BTW, this step is not required because H2O will separate data set into train, validation and cross validation.
from sklearn.model_selection import train_test_split x_train, x_test = train_test_split(df, test_size=0.15, random_state=17)
Then, we will import the H2O libraries. My testings are run in H2O 3.26.0.3 version.
import h2o from h2o.automl import H2OAutoML
After then, you need to initialize your H2O engine.
h2o.init()
I can initialize the H2O engine in my local computer well but I had some troubles when run it on a server. You should consider your free memory and core count in this case. Because, you should limit these values in initialization.
import multiprocessing print(multiprocessing.cpu_count()) import psutil print(psutil.virtual_memory()) h2o.init(ip="127.0.0.1", max_mem_size_GB = 200, nthreads = 10)
You should see a summary table when initialization is successful.
We’ve read the data set in pandas data frame format. We must convert it to H2O Frame before starting the AutoML.
hf = h2o.H2OFrame(x_train) #hf = h2o.import_file('dataset/x_train.csv') #import h2o frame directly
Thereafter, we will initialize H2O AutoML object. Here, we can limit the processing time. I set it to 1 hour to look for the best model.
aml = H2OAutoML(max_runtime_secs=60*60*1)
Now, we can start the searching. Data frame stores both features and target label. That’s why, we should specify the column names for both features and target label.
y_label = "is_related" x_labels = list(df.drop(columns=[y_label]).columns) aml.train(x = x_labels, y = y_label, training_frame = hf)
This is a binary classification problem. That’s why, we should convert the type of is_related column to enum. Otherwise, h2o handles this as a regression problem.
#convert target label to enum because this is a classification problem hf[y_label] = hf[y_label].asfactor()
You can see the success of built models when processing is done.
lb = aml.leaderboard lb.head() #lb.head(rows=lb.nrows) #to show all models
This shows classification related metrics. Here, ensemble model got the best result.
We can see the details of the best model. This stores the confusion matrix and some accuracy metrics. You must see this!
aml.leader
We should store the best model to restore it later
saved_model = h2o.save_model(aml.leader, path = "")
Validation
We’ve already separate the data set into train and test set. Model is built by the train data and we can evaluate it with test data. Notice that H2O engine would not see the test data before.
hf_val = h2o.H2OFrame(x_test) predictions = aml.predict(hf_val) predictions_pd = predictions.as_data_frame() #h2o frame to pandas predictions_pd.head() actuals = hf_val['is_related'] actuals_pd = actuals.as_data_frame() #h2o frame to pandas
Predictions variable is 3 columned data frame. 1st column is its class prediction whereas p0 labeled column is probability of 0 class and p1 labeled column is probability of 1 class.
perf = aml.model_performance(hf_val) perf.auc() perf.accuracy() [0][1]
Area under the ROC curve score is 72.64 for my custom test set. Besides, it got 72.9 on public submission set and 73.5 on private submission set. We can say that it is a robust model.
Notice that my previous LightGBM model got 64 roc auc score. Almost 10 point increase is unbelievable. AutoML comes with less effort and higher accuracy.
Just a few words
So, we’ve mentioned one of the strongest AutoML tool in the market. Its main competitor Google AutoML offers neither on-premise (just cloud support) nor open-source. Herein, H2O is free, open-source and it has a huge community.
I pushed the source code of this post to GitHub. There are many ways to support a project – starring the GitHub repos is one.
Finally, I am very grateful to Erin LeDell for a great contribution with her feedback on this post.
Support this blog if you do like!