A Gentle Introduction to Feature Importance in Machine Learning

Machine learning interpretability and explainable AI are hottest topics in data world nowadays. Even though linear regression is ignored by most machine learning practitioners, the algorithm still provides the strongest explainabilty about data sets. Today, even the most complex models including deep learning and GBM can be explained by feeding input and prediction pair to the linear regression algorithm – even an overfitting cheat applied. Besides, feature importance values help data scientists to feature selection process. We can drop or ignore some unimportant features to speed model training up. In this post, we are going to mention how to calculate feature importance values of a data set with linear regression from scracth.

justice statue with sword and scale. cloudy sky in the backgroun — Justice blind

Vlog

Here, you can either watch the following video or follow this blog post. They both cover the feature importance for linear regression.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Linear Regression

Remember the basic linear regression formula.

y = β₀ + β₁X₁ + β₂X₂ + … + β_PX_P

Here, x values are input values whereas beta values are their coefficients. Suppose that if X₁ and y were education time and income respectively , then β₁ would quantify the effect of education on income. So, these coefficients will lead us to have an idea about feature importances.

Let’s build a simple linear regression model for a real world example. We are going to predict the house price based on its space features. You can find the raw data set here.

import pandas as pd
df = pd.read_csv("kc_house_data.csv")

Even though the data set has several features, we will focus on just a few of features. Because our intent is not to develop the best model here. Number of bedrooms, square feet of living area and built year will be features whereas price will be the target.

x = df[["bedrooms", "sqft_living", "yr_built"]]
y = df["price"]

house-price-data-set-final — Final data set

Model

Now, we can build the linear regression model. Scikit learn implementation of linear regression is very pretty. We just want to know beta 1 to p coefficients and beta 0 intercept.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x, y)

Accuracy

Built regressor model provides a predict function. I will call this function and pass features during training. I also have the actual values in y dataframe. So, I can find the mean absolute error of built model on train set.

predictions =  regressor.predict(x)

mae = 0
for i in range(0, len(predictions)):
    prediction = predictions[i]
    actual = y.iloc[i]

    error = abs(actual - prediction)
    mae = mae + error

mae = mae / len(predictions)

Built model has 30% error. It is not the best but it explains feature importance concept.

mae: 163801.15090700495
mean: 540088.1417665294
mae / mean ratio: 30.328596064198223

Coefficients

In the linear regression equation beta 0 was intercept whereas beta 1 to p were coefficients of features. I just interest in coefficients here because they will lead us to have an idea about feature importances.

intercept = regressor.intercept_
features = pd.DataFrame(regressor.coef_, x.columns, columns=['coefficient'])
features.head()

So, I have the beta coefficients of all feature fed to the model.

house-price-coefficients — Coefficients of features

Here, we can ignore the sign values because negative sign states an inversely proportional correlation.

features.coefficient = features.coefficient.abs()

house-price-coefficients-abs — Absolute coefficients

It seems that we can sort the coefficients of features as shown below.

coefficient of number of bedrooms > coefficient of year built > coefficient of square feet living area

Comparing coefficients

Could I say that square feet of living area is more important than year built and year built is more important than bedrooms? The answer is absolutely no!

Let’s focus on the equation of linear regression again.

y = β₀ + β₁X₁ + β₂X₂ + β₃X₃

target y was the house price amounts and its unit is dollars. If the term in the left side has units of dollars, then the right side of the equation must have units of dollars. Intercept beta 0 is a single value and its unit is dollars. Similarly, the unit of the term β₁X₁ must be dollars, too. I fed number of bedrooms to X1. So, the unit of β₁ must be dollars / number of bedrooms to satisfy the equation. Similarly, units of β₂ and β₃ must be dollars / meter squared and dollars / year respectively.

As we’ve learnt in the elemantary school, we can’t compare magnitudes that are different units. So, single coefficients mean nothing about feature importances.

Standard deviation

Standard deviation could help us to convert the units of coefficients to same. Remember its formula.

σ = √[Σ(x_i – x_avg)²/ (n – 1)]

You might think to apply following formula for number of rooms. Its unit is number of rooms. So, unit of x_i, x_avg and the their difference are number of rooms, too.Formula expects to calculate squared value of number of rooms. The unit of the dividend becomes number of rooms squared. n is unitless here. Dividing number of rooms squared to an unitless term becomes number of rooms squared, too. Finally, standard deviation formula expects to find square root value of number of rooms squared. It would be number of rooms.

So, unit of the standard deviation would be the units of corresponding data.

Transforming the units of coefficients

The unit of the coefficient in the linear regression equation was dollars / corresponding data. If we multiply each coefficient to corresponding standard deviation, then all have dollars unit. So, we can compare standard deviation times coefficients.

stdevs = []
for i in x.columns:
    stdev = df[i].std()
    stdevs.append(stdev)

import numpy as np
features["stdev"] = np.array(stdevs).reshape(-1,1)
features["importance"] = features["coefficient"] * features["stdev"]

house-price-importance-v2 — Feature Importance

Let’s compare coefficient and importance columns. We can sort the coefficient values as shown below.

coefficient of number of bedrooms > coefficient of built year > coefficient of square feet living area.

However, importance values of feature could be sorted in a different order.

importance of square feet living area > importance of built year > importance of number of bedrooms.

We can also normalize the importance column in range of [0, 100].

features['importance_normalized'] = 100*features['importance'] / features['importance'].max()

house-price-importance-normalized — Feature importance normalized

In this way, we can plot the normalized importance values

import matplotlib.pyplot as plt
plt.barh(features.index, features.importance_normalized)

house-price-importance-plot — Plotting feature importance percentages

To sum up, comparing coefficients to find the importance would misguide you.

So, we’ve mentioned the feature importance concept on a basic linear regression example. Even though, we would mostly not use linear regression for daily problems, the algorithm still lead us to explain machine learning models and build interpretable machine learning models.

Future work

Finding feature importance in linear regression is easy but life is mostly non-linear. Herein, decision tree algorithms are naturally explainable non-linear algorithms. Besides, we can find feature importance to explain the model well.

We will mention Feature Importance in Decision Trees in the following posts.

Support this blog financially if you do like!

5 Comments

Rebecca says:

November 5, 2020 at 6:33 pm

Hi, thanks for the great post! I am wondering if there are any paper backed up the idea of determining feature importance by multiplying coefficient and std? If so, could you point me to the page for me to have a further reading? Thank you.

Log in to Reply
1. Sefik Serengil says:
  
  November 9, 2020 at 7:50 am
  
  Nope, I wrote this post from the source code of some open source libraries.
  
  Log in to Reply

A Gentle Introduction to Feature Importance in Machine Learning

Vlog

Linear Regression

Model

Accuracy

Coefficients

Comparing coefficients

Standard deviation

Transforming the units of coefficients

Future work

Related

5 Comments

Leave a Reply Cancel reply

Vlog

Linear Regression

Model

Accuracy

Coefficients

Comparing coefficients

Standard deviation

Transforming the units of coefficients

Future work

Related

5 Comments

Leave a Reply Cancel reply

Discover more from Sefik Ilkin Serengil