Explaining h2o models with Lime - Sefik Ilkin Serengil

Interpretability and accuracy inversely proportional concepts. Models offering higher accuracy such as deep learning or GBM would be lowly interpretable. However, interpretability could not be discarded even if the model will be deployed to production. Being explainable also provides to avoid overfitting in research step. Built h2o models store feature importance values similar to logistic regression. We can also explain custom predictions made by h2o models with Lime.

Pre-trained model

Remember that we’ve built a GBM model for kinship prediction. Pre-trained model is pushed to GitHub already.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

model = h2o.load_model("models/GBM_Kinship")

Data set

We will use same data set over the pre-trained model.

hf_positive = h2o.import_file('dataset/train_true_positive_features.csv')
hf_negative = h2o.import_file('dataset/train_true_negative_features.csv')
hf = hf_positive.rbind(hf_negative)

#discard unnecessary features
hf = hf[['vgg_cosine', 'vgg_euclidean_l2'
, 'facenet_cosine', 'facenet_euclidean_l2'
, 'openface_cosine', 'openface_euclidean_l2'
, 'is_related']]

#convert target label to enum to transform the problem to classification
hf['is_related'] = hf['is_related'].asfactor()
hf.head()

#70% train, 15% test, 15% validation
train, test, validation = hf.split_frame(ratios=[0.70, 0.15], seed=17)

We set the seed value to 17. It is the same value used while building the model. That’s why, we confirm that built model was trained with train h2o frame, test h2o frame was used for early stopping. We’ve confirmed that built model has never seen validation data.

Here, is related column is the boolean target whereas the others are features.

Lime

Lime is available on PyPI python package index. You can install it by calling the command pip install lime. Then, we can import lime in our study.

import lime
import lime.lime_tabular

We firstly define the lime explainer

explainer = lime.lime_tabular.LimeTabularExplainer(
    train_features_numpy
    , feature_names = feature_names
    #, class_names = class_names
)

It expects train set features in numpy format and feature names list. Optionally, we can pass the label class names but this is not a must. Probability index values will be assigned if class names would not be set. Label classes are already 0 and 1 in our data set. That’s why, I would not set class names.

Firstly, we need to find feature names. Columns function of h2o frame will return column names. Target label is on the right of the data set. We can discard the target label with passing index from 0 to -1.

feature_names = train.columns[0: -1]

Secondly, we need to find the train set features in numpy format. Passing feature names in square blanket would return features only of the data set in h2o frame format. Then, as_data_frame function will convert this h2o frame to pandas format. After then, values function will convert pandas data frame to numpy format.

train_features_numpy = train[feature_names].as_data_frame().values

Explaining instance

We’ve already created the explainer. It has an explain instance function. It expects the features of an instance in numpy format, prediction function and total number of features to show.

exp = explainer.explain_instance(
        instance_numpy
        , findPrediction
        , num_features = len(feature_names)
    )

Let’s get the features of an instance first.

idx = 17
validation_df = validation.as_data_frame()
instance_numpy = validation_df.iloc[idx].values[0:-1]

We’ve loaded the pre-trained model in model variable. Prediction function is depended on that object but here we need to create a mediator function. This function expects the features of an instance in numpy format. We can convert that numpy formatted instance to pandas data frame and h2o frame. Model expects input features in h2o frame format. Prediction function will return 3 columned frame. Its first column is prediction class whereas others are prediction class probabilities. Herein, we just need these probabilities. That’s why, we will filter predictions 1 index to end (ignored 0 index column).

def findPrediction(instance):
 #instance will be in type of numpy
 df = pd.DataFrame(data = instance, columns = feature_names)
 hf = h2o.H2OFrame(df)
 predictions = model.predict(hf).as_data_frame()

 #here predictions object is 3 columned data frame. 1st column is class prediction and others are probabilities
 #lime needs just prediction probabilities

 predictions = predictions.iloc[:,1:].values
 return predictions

Now, explaining instance function is ready to call.

exp.show_in_notebook(show_table=True, show_all=False)

Let’s explain some instances. First instance in the validation set is predicted as 0. You can see why this instance is classified as unrelated. Facenet cosine feature plays pivotal role. Prediction closes to 0 when facenet cosine value is greater than 1.

h2o-lime-explaining-0 — Explaining 0th index validation instance

15th index instance in validation set is classified as related. Facenet cosine is the most dominant feature in this case, too. Prediction closes to 1 when facenet cosine value is less than or equal to 0.75.

h2o-lime-explaining-15 — Explaining 15th index validation instance

Lime pretends to explain any machine learning model. In this post, we’ve adapted it for a h2o GBM model. Driveless ai module of h2o supports k-lime by default. In this way, you can confirm the robustness of your model before deploying to production.

The source code is pushed to GitHub as a notebook. Explaining instance graphics disappear when notebook exported. You should run the notebook to see same graphs.

Like this blog? Support me on Patreon

Pre-trained model

Data set

Lime

Explaining instance

Related