Interpretable Machine Learning with H2O and SHAP

Previously, we’ve made explanations for h2o.ai models with lime. Lime enables questioning for made predictions of built models. Herein, SHAP offers some improvements against LIME. For example, you can discover the feature importance values or visualize many instance explanations. Besides, it includes LIME’s single prediction explanation module. We will mention how h2o and SHAP can be used together in this post.

h2o-shap-cover — SHAP enables interpretable h2o models

Vlog

We are going to mention machine learning interpretability in the following video. You can follow this blog post as well.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Naturally explainable algorithms

Some algorithms such as linear regression or decision trees are naturally explainable and we can find feature importance values directly. However, some algorithms such as deep learning or gradient boosted decision trees are totally black boxes. SHAP helps us to explain those black boxes. You should find out those naturally interpretable algorithms.

Feature importance for linear regression

Feature importance for decision trees

Data set

We are going to use Iris data set. Iris is actually a flower. It could be 3 different types: setosa, versicolor and virginica. The data set contains top and bottom leaf sizes as width and length. Leaf sizes will be features whereas its class will be target in our model.

feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
target_name = 'iris_class'

features = h2o.import_file('iris-attr.data', col_names = feature_names)
labels = h2o.import_file('iris-labels.data', col_names = [target_name])

hf = features.cbind(labels)

#this is a classification problem
hf['iris_class'] = hf['iris_class'].asfactor()

hf.tail(5)

We are going to work on the following data set

Model

We are going to build shallow GBM model. As you know, GBM has low interpretability but high accuracy by default.

from h2o.estimators.gbm import H2OGradientBoostingEstimator

model = H2OGradientBoostingEstimator(
ntrees = 10
, learn_rate = 0.01
, stopping_metric = "Logloss"
)

model.train(x = feature_names, y = target_name
, training_frame = hf
, model_id = "GBM_for_iris"
)

model

Shallow boosted tree got a success accuracy.

h2o-iris-model-detail — Built model metrics

Feature Importance

Model already stores feature importance values.

h2o-iris-importance — Feature importance

We can plot these importance values.

model.varimp_plot()

h2o-iris-importance-plot — Plotting global feature importance

Predictions

Built model has a prediction function. It expects h2o frame as input and it returns predictions in h2o frame format, too.

predictions = model.predict(test_data = hf)
predictions.tail(5)
predictions_pd = predictions['predict'].as_data_frame() #h2o frame to pandas

Predictions are in type of h2o frame. It stores class probabilitiy as p0, p1, p2 and the highest valued class in predict column.

Bridge between h2o and SHAP

SHAP expects the prediction function and test frame as input. We pass model.predict function directly in Keras because the API expects input in numpy type also returns predictions in numpy type. However, h2o expects h2o frame as input and output. That’s why, we are going to build a mediator function. It is basically responsible for converting numpy to h2o frame and vice versa.

def h2opredict(nf): #nf is numpy array
df = pd.DataFrame(nf, columns = feature_names) #numpy to pandas
df[target_name] = 0 #initialize target
hf = h2o.H2OFrame(df) #pandas to h2o frame
predictions = model.predict(test_data = hf) #predictions in h2o frame type
predictions_pd = predictions[predictions.columns[1:]].as_data_frame() #h2o frame to pandas
return predictions_pd.values

We firstly convert input numpy object to pandas and then h2o frame because prediction function of the built model expects input in h2o frame type. Then, we will convert predictions from h2o frame to pandas and then numpy. Here, we just retrieved class probabilities with predictions.columns[1:] command. This discards the first predict column in prediction frame.

SHAP

We are going to use Kernel Explainer. We pass the mediator function and pandas data frame as input.

import shap
df = hf.as_data_frame(); df = df.drop(columns = ['iris_class'])
explainer = shap.KernelExplainer(h2opredict, df, link="logit")
shap_values = explainer.shap_values(df, nsamples=100)
shap.initjs()

Calculating the shap values is the most costly operation.

Single Explanation

You can question why a decision is made by built model. Let’s focus on the 17th indexed instance in the data set.

sample = 17 #explain 17th instance in the data set

labels_pd = labels.as_data_frame()
actual = labels_pd.iloc[sample].values[0]
prediction = predictions_pd.iloc[sample]['predict']
print("Prediction for ",sample,"th instance is ",prediction," whereas its actual value is ",actual)

shap.force_plot(explainer.expected_value[prediction], shap_values[prediction][sample,:], df.iloc[sample])

Here, expected value of the explainer has 3 items. Each item refers to a class. We just need the classified class. Also, 17th instance is predicted as 0 whereas its actual value is 0. shap_values variable is size of 3 which is equal to number of classes. Here, we need to focus on the predicted class. Then, shap_values[0] has a shape of (150, 4). It is the same size of the features. Here, we should focus on 17th instance. That’s why, I passed shap_values[prediction][sample,:] as second parameter.

h2o-iris-single-explanation — Single explanation

This single item explanation is very similar to LIME but SHAP is beyond it.

Many Explanation

We can evaluate the built model for all candidates. This provides to monitor the model deeply.

for idx in range(0, len(class_names)):
print(class_names[idx])
shap.force_plot(explainer.expected_value[idx], shap_values[idx], df, link="logit")

This offers a deeply analysis for all classes. In this way, you would have an idea to output class for any input set.

h2o-iris-setosa-analysis — Setosa class analysis

h2o-iris-versicolor-analysis — Versicolor class analysis

h2o-iris-virginica-analysis — Virginica class analysis

Feature Importance

Built model already stores a feature importance but SHAP stores feature importance values for output class level.

shap.summary_plot(shap_values, features.as_data_frame(), plot_type="bar")

This shows an integrated graph for all class candidates.

h2o-iris-feature-importance-integrated — Integrated feature importance graph

You can plot feature importance values for custom classes as well.

for i in range(0, len(class_names)):
current_class = class_names[i]
print("Feature importances for ",current_class)
shap.summary_plot(shap_values[i], features.as_data_frame(), plot_type="bar")

In this way, you can see relation between features and classes. Some features might not be important for a class but it might be very important for another class.

h2o-iris-feature-importance-custom — Plotting feature importance values for custom classes

So, we’ve mentioned how to enable SHAP for h2o models. To be honest, SHAP offers much deeply explanation against LIME. On the other hand, its time cost is much more than LIME. No matter which interpretability framework you use, you can just trust complex machine learning models when they are explainable. I pushed the source code of this post to GitHub. Some blocks disappear because of javascript. You should run the same code on your environment to have same graphs.

Support this blog financially if you do like!

3 Comments

Ankur Kumar says:

May 18, 2021 at 3:28 am

Thanks! for the wonderful explanation. I have one doubt here, How can we calculate Shap values in H2O models if the dataset contains non-numeric type column?

Log in to Reply
1. Sefik Serengil says:
  
  June 9, 2021 at 7:05 pm
  
  In the background, it applies one hot encoding for categorical features. In other words, a categorical feature has multiple classes and each class has an importance value.
  
  Log in to Reply

Interpretable Machine Learning with H2O and SHAP

Vlog

Naturally explainable algorithms

Data set

Model

Feature Importance

Predictions

Bridge between h2o and SHAP

SHAP

Single Explanation

Many Explanation

Feature Importance

Related

3 Comments

Leave a Reply Cancel reply

Vlog

Naturally explainable algorithms

Data set

Model

Feature Importance

Predictions

Bridge between h2o and SHAP

SHAP

Single Explanation

Many Explanation

Feature Importance

Related

3 Comments

Leave a Reply Cancel reply

Discover more from Sefik Ilkin Serengil