Feature Importance in Logistic Regression for Machine Learning Interpretability

Feature importance is a common way to make interpretable machine learning models and also explain existing models. That enables to see the big picture while taking decisions and avoid black box models. We’ve mentioned feature importance for linear regression and decision trees before. Besides, we’ve mentioned SHAP and LIME libraries to explain high level models such as deep learning or gradient boosting. In this post, we will find feature importance for logistic regression algorithm from scratch.

Sigmoid function by Ian Goodfellow

Ian Goodfellow shows the sigmoid function in this PhD defense very funnily.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Vlog

You can either watch the following video or read this tutorial. They both cover the feature importance of logistic regression algorithm within python for machine learning interpretability and explainable ai.

Linear regression

Logistic regression is derived from linear regression. The main difference between those algorithms is that linear regression produces continuous outputs whereas logistic regression produces binary outputs for classification with probabilities. Remembering the feature importance in logistic regression is important to understand feature importance in logistic regression. I summarized the feature importance in logistic regression in the following video.

Logistic regression is linear

Logistic regression is mainly based on sigmoid function. The graph of sigmoid has a S-shape. That might confuse you and you may assume it as non-linear funtion. But that is not true. Logistic regression is just a linear model. That’s why, Most resources mention it as generalized linear model (GLM).

Actually, logistic regression is very similar to the perceptron. We just used the identity function in perceptron as an activation. You can drop the activation layer in perceptron because it is a dummy layer. In logistic regression, activation function becomes sigmoid function.

Logistic regression schema

Remember that hidden layers make multilayer perceptrons (or neural networks) non-linear. Notice that there is no hidden layer in logistic regression. In other words, we cannot summarize the output of a neural networks in terms of a linear function but we can do it for logistic regression.

So, it is easy to explain linear functions naturally.

Data set

We are going to build a logistic regression model for iris data set. Its features are sepal length, sepal width, petal length, petal width. Besides, its target classes are setosa, versicolor and virginica. However, it has 3 classes in the target and this causes to build 3 different binary classification models with logistic regression. To make it simple, I will drop virginica classes in the data set and make it to binary data set.

Iris data set
Pre-processing

Let’s remember the logistic regression equation first.





z = w0 + w1x1+ w2x2+ w3x3 + w4x4

y = 1 / (1 + e-z)

x1 stands for sepal length; x2 stands for sepal width; x3 stands for petal length; x4 stands for petal width.

The output y is the probability of a class. If it gets closer to 1, then the instance will be versicolor whereas it becomes setosa when the proba gets closer to 0.

The output is unitless. If the left side of the equation has no unit, then its right side must not have unit as well. Let’s focus on the z equation. x1 term stands for sepal length and its unit is centimeters. To make the equation z unitless, the multiplication of x1 and w1 has to be unitless as well.

We can divide the x1 term to the standard deviation to get rid of the unit because the unit of standard deviation is same with its feature. Alternatively, we can feed x1 as is and find w1 first. We know that its unit becomes 1/centimeters in this case. If we multiply the w1 term to the standard deviation of the x1 then it works as well. I prefer to apply first one in this study.

Loading the data set

Luckily, sklearn offers iris data set as an out-of-the-box function.

from sklearn.datasets import load_iris
import pandas as pd

feature_names = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

x, y = load_iris(return_X_y=True)
df = pd.DataFrame(x, columns = feature_names)
df['target'] = y
print(df.head())

As I mentioned before, I’m going to drop the virginica classes in the data set to make it binary classification problem.

#0: setosa, 1: versicolor, 2: virginica
df = df[df['target'] != 2]
Normalize inputs

I’m going to walk over the columns, and divide each instance to the standard deviation of the column. In this way, features becomes unitless.

for feature_name in feature_names:
    df[feature_name] = df[feature_name] / df[feature_name].std()

Some researchers subtracts the mean of the column to each instance first, then divide it to the standard deviation. They will both work.





Modelling

We’ve finished the pre-processing the data set. We have the unitless features and binary class values in the target. We can build logistic regression model now.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=0).fit(df[feature_names].values, df['target'].values)

score = model.score(df[feature_names].values, df['target'].values)
print(score)

I got 100% accuracy for 100 instances. Of course, that’s the training set accuracy and I should split the data set into train, test and validation but this is an experimental study and I skip those stages. I am not interested in overfitting in this study.

Prediction

Built model stores intercept and coefficients already. Let’s focus on those parameters to understand the algorithm well.

w0 = model.intercept_[0]
w = w1, w2, w3, w4 = model.coef_[0]

equation = "y = %f + (%f * x1) + (%f * x2) + (%f * x3) + (%f * x4)" % (w0, w1, w2, w3, w4)
print(equation)

Logistic regression model has the following equation:

y = -0.102763 + (0.444753 * x1) + (-1.371312 * x2) + (1.544792 * x3) + (1.590001 * x4)

Let’s predict an instance based on the built model.

idx = 99
x = df.iloc[idx][feature_names].values
y = model.predict_proba(x.reshape(1, -1))[0]
print(y[1])

Prediction of the 100th instance (notice that index starts with 0) is 0.9782192589879745 based on the predict proba function. We can find the same value based on the equation.

import math
def sigmoid(x):
    return 1 / (1 + pow(math.e, -x))

result = 0
result += w0
for i in range(0, 4):
    result += x[i] * w[i]
result = sigmoid(result)
print(result)

This calculates the result 0.9782192589879745 as well.

Interpretability

We will use coefficient values to explain the logistic regression model. However, coefficients are not directly related to importance instead of linear regression.

Being positive class over being negative class could be expressed as below for a binary classification task.





P(y = 1) / P(y = 0) = P(y = 1) / (1 – P(y = 1))

Remember that we express the probability with logistic function

P(y = 1) / (1 – P(y = 1)) = [ 1 / (1 + e-z) ] / [1 – (1 / (1 + e-z))]

P(y = 1) / (1 – P(y = 1)) = [ 1 / (1 + e-z) ] / [(1 + e-z – 1) / (1 + e-z)] = 1 / e-z = e+z

Let’s put the z term in the equation

P(y = 1) / P(y = 0) = e^(w0 + w1x1+ w2x2+ w3x3 + w4x4)

BTW, we call the left side of this equation odds.

Let’s focus on a specific feature. E.g. x3. What happens to prediction when you make a change on x3 by 1 unit. I mean that I will change x3 to (x3 + 1). This is very similar to the definition of derivative.

odd(x3 -> x3+1) / odd = e^(w0 + w1x1+ w2x2+ w3(x3+1) + w4x4) / e^(w0 + w1x1+ w2x2+ w3x3 + w4x4)

Remember that e^a / e^b = e^(a-b). I will apply this rule to the equation above.





odd(x3 -> x3+1) / odd = e^(w0 + w1x1+ w2x2+ w3(x3+1) + w4x4 – (w0 + w1x1+ w2x2+ w3x3 + w4x4))

odd(x3 -> x3+1) / odd = e^(w0 + w1x1+ w2x2+ w3(x3+1) + w4x4 – w0 – w1x1– w2x2 – w3x3 – w4x4)

odd(x3 -> x3+1) / odd = e^(w3(x3+1) – w3x3) = e^(w3x3+w3 – w3x3)

odd(x3 -> x3+1) = ew3

So, if we increase the x3 feature one unit, then the prediction will change e to the power of its weight. We can apply this rule to the all weights to find the feature importance.

Feature importance

So, we will calculate the Euler number to the power of its coefficient to find the importance.

feature_importance = pd.DataFrame(feature_names, columns = ["feature"])
feature_importance["importance"] = pow(math.e, w)
feature_importance = feature_importance.sort_values(by = ["importance"], ascending=False)

from sklearn.linear_model import LogisticRegression
ax = feature_importance.plot.barh(x='feature', y='importance')
plt.show()
Feature importance

To sum up, the strongest feature in iris data set is petal width. An increase of the petal width feature by one unit increases the odds of being versicolor class by a factor of 4.90 when all other features remain the same.

Conclusion

So, we’ve mentioned how to explain built logistic regression models in this post. Even though its equation is very similar to linear regression, we can co-relate weights as power of e number.

Special thanks to Christoph Molnar, the author of the book – Interpretable Machine Learning: A Guide for Making Black Box Models Explainable to help me to understand this calculation.

I pushed the source code of this study to GitHub. You can support this study if you star⭐️ the repo.





Bonus

If you like this content, feature importance in decision trees might attract your attention, too. Unsimilar to linear regression and logistic regression, decision trees do this in their own way. I summarized the feature importance in decision trees topic in the following video.


Support this blog if you do like!

Buy me a coffee      Buy me a coffee


3 Comments

  1. Hi! I have a doubt about interpretability and feature importance. In this case, as possitive values of w_n tends to classify as versicolor (because is the possitive target), and negative values of w_n tends to classify as setosa (because is the negative target), petal width is the strongest feature to classify versicolor because it has the most possitive w_n value, and sepal_width is the strongest feature to classify setosa, because it has the most negative w_n value, so the feature importance order depends on which number we assign to each type and this does not seem to be right. Is it correct? Thanks for the great article!

Comments are closed.