Interpretable Machine Learning with H2O and SHAP

Previously, we’ve made explanations for h2o.ai models with lime. Lime enables questioning for made predictions of built models. Herein, SHAP offers some improvements against LIME. For example, you can discover the feature importance values or visualize many instance explanations.ย Besides, it includes LIME’s single prediction explanation module. We will mention how h2o and SHAP can be used together in this post.

h2o-shap-cover
SHAP enables interpretable h2o models

Vlog

We are going to mention machine learning interpretability in the following video. You can follow this blog post as well.


๐Ÿ™‹โ€โ™‚๏ธ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Naturally explainable algorithms

Some algorithms such as linear regression or decision trees are naturally explainable and we can find feature importance values directly. However, some algorithms such as deep learning or gradient boosted decision trees are totally black boxes. SHAP helps us to explain those black boxes. You should find out those naturally interpretable algorithms.

Feature importance for linear regression

Feature importance for decision trees

Data set

We are going to use Iris data set. Iris is actually a flower. It could be 3 different types: setosa, versicolor and virginica. The data set contains top and bottom leaf sizes as width and length. Leaf sizes will be features whereas its class will be target in our model.

feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
target_name = 'iris_class'

features = h2o.import_file('iris-attr.data', col_names = feature_names)
labels = h2o.import_file('iris-labels.data', col_names = [target_name])

hf = features.cbind(labels)

#this is a classification problem
hf['iris_class'] = hf['iris_class'].asfactor()

hf.tail(5)

We are going to work on the following data set

iris-tail
Iris data set sample

Model

We are going to build shallow GBM model. As you know, GBM has low interpretability but high accuracy by default.

from h2o.estimators.gbm import H2OGradientBoostingEstimator

model = H2OGradientBoostingEstimator(
ntrees = 10
, learn_rate = 0.01
, stopping_metric = "Logloss"
)

model.train(x = feature_names, y = target_name
, training_frame = hf
, model_id = "GBM_for_iris"
)

model

Shallow boosted tree got a success accuracy.

h2o-iris-model-detail
Built model metrics

Feature Importance

Model already stores feature importance values.





h2o-iris-importance
Feature importance

We can plot these importance values.

model.varimp_plot()
h2o-iris-importance-plot
Plotting global feature importance

Predictions

Built model has a prediction function. It expects h2o frame as input and it returns predictions in h2o frame format, too.

predictions = model.predict(test_data = hf)
predictions.tail(5)
predictions_pd = predictions['predict'].as_data_frame() #h2o frame to pandas

Predictions are in type of h2o frame. It stores class probabilitiy as p0, p1, p2 and the highest valued class in predict column.

h2o-iris-predictions
Predictions

Bridge between h2o and SHAP

SHAP expects the prediction function and test frame as input. We pass model.predict function directly in Keras because the API expects input in numpy type also returns predictions in numpy type. However, h2o expects h2o frame as input and output. That’s why, we are going to build a mediator function. It is basically responsible for converting numpy to h2o frame and vice versa.

def h2opredict(nf): #nf is numpy array
df = pd.DataFrame(nf, columns = feature_names) #numpy to pandas
df[target_name] = 0 #initialize target
hf = h2o.H2OFrame(df) #pandas to h2o frame
predictions = model.predict(test_data = hf) #predictions in h2o frame type
predictions_pd = predictions[predictions.columns[1:]].as_data_frame() #h2o frame to pandas
return predictions_pd.values

We firstly convert input numpy object to pandas and then h2o frame because prediction function of the built model expects input in h2o frame type. Then, we will convert predictions from h2o frame to pandas and then numpy. Here, we just retrieved class probabilities with predictions.columns[1:] command. This discards the first predict column in prediction frame.

SHAP

We are going to use Kernel Explainer. We pass the mediator function and pandas data frame as input.

import shap
df = hf.as_data_frame(); df = df.drop(columns = ['iris_class'])
explainer = shap.KernelExplainer(h2opredict, df, link="logit")
shap_values = explainer.shap_values(df, nsamples=100)
shap.initjs()

Calculating the shap values is the most costly operation.

Single Explanation

You can question why a decision is made by built model. Let’s focus on the 17th indexed instance in the data set.

sample = 17 #explain 17th instance in the data set

labels_pd = labels.as_data_frame()
actual = labels_pd.iloc[sample].values[0]
prediction = predictions_pd.iloc[sample]['predict']
print("Prediction for ",sample,"th instance is ",prediction," whereas its actual value is ",actual)

shap.force_plot(explainer.expected_value[prediction], shap_values[prediction][sample,:], df.iloc[sample])

Here, expected value of the explainer has 3 items. Each item refers to a class. We just need the classified class. Also, 17th instance is predicted as 0 whereas its actual value is 0. shap_values variable is size of 3 which is equal to number of classes. Here, we need to focus on the predicted class. Then, shap_values[0] has a shape of (150, 4). It is the same size of the features. Here, we should focus on 17th instance. That’s why, I passed shap_values[prediction][sample,:] as second parameter.

h2o-iris-single-explanation
Single explanation

This single item explanation is very similar to LIME but SHAP is beyond it.





Many Explanation

We can evaluate the built model for all candidates. This provides to monitor the model deeply.

for idx in range(0, len(class_names)):
print(class_names[idx])
shap.force_plot(explainer.expected_value[idx], shap_values[idx], df, link="logit")

This offers a deeply analysis for all classes. In this way, you would have an idea to output class for any input set.

h2o-iris-setosa-analysis
Setosa class analysis
h2o-iris-versicolor-analysis
Versicolor class analysis
h2o-iris-virginica-analysis
Virginica class analysis

Feature Importance

Built model already stores a feature importance but SHAP stores feature importance values for output class level.

shap.summary_plot(shap_values, features.as_data_frame(), plot_type="bar")

This shows an integrated graph for all class candidates.

h2o-iris-feature-importance-integrated
Integrated feature importance graph

You can plot feature importance values for custom classes as well.

for i in range(0, len(class_names)):
current_class = class_names[i]
print("Feature importances for ",current_class)
shap.summary_plot(shap_values[i], features.as_data_frame(), plot_type="bar")

In this way, you can see relation between features and classes. Some features might not be important for a class but it might be very important for another class.

h2o-iris-feature-importance-custom
Plotting feature importance values for custom classes

So, we’ve mentioned how to enable SHAP for h2o models. To be honest, SHAP offers much deeply explanation against LIME. On the other hand, its time cost is much more than LIME. No matter which interpretability framework you use, you can just trust complex machine learning models when they are explainable. I pushed the source code of this post to GitHub. Some blocks disappear because of javascript. You should run the same code on your environment to have same graphs.


Like this blog? Support me on Patreon

Buy me a coffee


3 Comments

  1. Thanks! for the wonderful explanation. I have one doubt here, How can we calculate Shap values in H2O models if the dataset contains non-numeric type column?

    1. In the background, it applies one hot encoding for categorical features. In other words, a categorical feature has multiple classes and each class has an importance value.

Comments are closed.