A Gentle Introduction to Feature Importance in Machine Learning

Machine learning interpretability and explainable AI are hottest topics in data world nowadays. Even though linear regression is ignored by most machine learning practitioners, the algorithm still provides the strongest explainabilty about data sets. Today, even the most complex models including deep learning and GBM can be explained by feeding input and prediction pair to the linear regression algorithm – even an overfitting cheat applied. Besides, feature importance values help data scientists to feature selection process. We can drop or ignore some unimportant features to speed model training up. In this post, we are going to mention how to calculate feature importance values of a data set with linear regression from scracth.

justice statue with sword and scale. cloudy sky in the backgroun
Justice blind

Vlog

Here, you can either watch the following video or follow this blog post. They both cover the feature importance for linear regression.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Linear Regression

Remember the basic linear regression formula.

y = β0 + β1X1 + β2X2 + … + βPXP

Here, x values are input values whereas beta values are their coefficients. Suppose that if X1 and y were education time and income respectively , then β1 would quantify the effect of education on income. So, these coefficients will lead us to have an idea about feature importances.

Let’s build a simple linear regression model for a real world example. We are going to predict the house price based on its space features. You can find the raw data set here.

import pandas as pd
df = pd.read_csv("kc_house_data.csv")
house-price-data-set
Raw data set

Even though the data set has several features, we will focus on just a few of features. Because our intent is not to develop the best model here. Number of bedrooms, square feet of living area and built year will be features whereas price will be the target.

x = df[["bedrooms", "sqft_living", "yr_built"]]
y = df["price"]
house-price-data-set-final
Final data set

Model

Now, we can build the linear regression model. Scikit learn implementation of linear regression is very pretty. We just want to know beta 1 to p coefficients and beta 0 intercept.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x, y)

Accuracy

Built regressor model provides a predict function. I will call this function and pass features during training. I also have the actual values in y dataframe. So, I can find the mean absolute error of built model on train set.

predictions =  regressor.predict(x)

mae = 0
for i in range(0, len(predictions)):
    prediction = predictions[i]
    actual = y.iloc[i]

    error = abs(actual - prediction)
    mae = mae + error

mae = mae / len(predictions)

Built model has 30% error. It is not the best but it explains feature importance concept.





mae: 163801.15090700495
mean: 540088.1417665294
mae / mean ratio: 30.328596064198223

Coefficients

In the linear regression equation beta 0 was intercept whereas beta 1 to p were coefficients of features. I just interest in coefficients here because they will lead us to have an idea about feature importances.

intercept = regressor.intercept_
features = pd.DataFrame(regressor.coef_, x.columns, columns=['coefficient'])
features.head()

So, I have the beta coefficients of all feature fed to the model.

house-price-coefficients
Coefficients of features

Here, we can ignore the sign values because negative sign states an inversely proportional correlation.

features.coefficient = features.coefficient.abs()
house-price-coefficients-abs
Absolute coefficients

It seems that we can sort the coefficients of features as shown below.

coefficient of number of bedrooms > coefficient of year built > coefficient of square feet living area

Comparing coefficients

Could I say that square feet of living area is more important than year built and year built is more important than bedrooms? The answer is absolutely no!

Let’s focus on the equation of linear regression again.

y = β0 + β1X1 + β2X2 + β3X3

target y was the house price amounts and its unit is dollars. If the term in the left side has units of dollars, then the right side of the equation must have units of dollars. Intercept beta 0 is a single value and its unit is dollars. Similarly, the unit of the term β1X1 must be dollars, too. I fed number of bedrooms to X1. So, the unit of β1 must be dollars / number of bedrooms to satisfy the equation. Similarly, units of β2 and β3 must be dollars / meter squared and dollars / year respectively.





As we’ve learnt in the elemantary school, we can’t compare magnitudes that are different units. So, single coefficients mean nothing about feature importances.

Standard deviation

Standard deviation could help us to convert the units of coefficients to same. Remember its formula.

σ = √[Σ(xi – x_avg)2/ (n – 1)]

You might think to apply following formula for number of rooms. Its unit is number of rooms. So, unit of xi, x_avg and the their difference are number of rooms, too.Formula expects to calculate squared value of number of rooms. The unit of the dividend becomes number of rooms squared. n is unitless here. Dividing number of rooms squared to an unitless term becomes number of rooms squared, too. Finally, standard deviation formula expects to find square root value of number of rooms squared. It would be number of rooms.

So, unit of the standard deviation would be the units of corresponding data.

Transforming the units of coefficients

The unit of the coefficient in the linear regression equation was dollars / corresponding data. If we multiply each coefficient to corresponding standard deviation, then all have dollars unit. So, we can compare standard deviation times coefficients.

stdevs = []
for i in x.columns:
    stdev = df[i].std()
    stdevs.append(stdev)

import numpy as np
features["stdev"] = np.array(stdevs).reshape(-1,1)
features["importance"] = features["coefficient"] * features["stdev"]
house-price-importance-v2
Feature Importance

Let’s compare coefficient and importance columns. We can sort the coefficient values as shown below.

coefficient of number of bedrooms > coefficient of built year > coefficient of square feet living area.

However, importance values of feature could be sorted in a different order.

importance of square feet living area > importance of built year > importance of number of bedrooms.





We can also normalize the importance column in range of [0, 100].

features['importance_normalized'] = 100*features['importance'] / features['importance'].max()
house-price-importance-normalized
Feature importance normalized

In this way, we can plot the normalized importance values

import matplotlib.pyplot as plt
plt.barh(features.index, features.importance_normalized)
house-price-importance-plot
Plotting feature importance percentages

To sum up, comparing coefficients to find the importance would misguide you.

So, we’ve mentioned the feature importance concept on a basic linear regression example. Even though, we would mostly not use linear regression for daily problems, the algorithm still lead us to explain machine learning models and build interpretable machine learning models.

Future work

Finding feature importance in linear regression is easy but life is mostly non-linear. Herein, decision tree algorithms are naturally explainable non-linear algorithms. Besides, we can find feature importance to explain the model well. 

We will mention Feature Importance in Decision Trees in the following posts.


Like this blog? Support me on Patreon

Buy me a coffee


5 Comments

  1. Can you please explain how to get feature variable importance when we have categorical variables.

    1. You cannot feed categorical features to linear regression algorithm. That’s why, you have to apply one-hot encoding to categorical features. On the other hand, decision tree algorithms offer feature importance as well and it’s ok to feed categorical features to these algorithms. Here you can find the feature importance for decision tree algorithms: https://sefiks.com/2020/04/06/feature-importance-in-decision-trees/

  2. Since the normalized importance should be between 0 to 100 I think there is an error in calculations.
    The normalized feature importance should be 71%,16%, and 12%.
    In this line of code:
    features[‘importance_normalized’] = 100*features[‘importance’] / features[‘importance’].max()
    you could use .sum() instead of .max()

  3. Hi, thanks for the great post! I am wondering if there are any paper backed up the idea of determining feature importance by multiplying coefficient and std? If so, could you point me to the page for me to have a further reading? Thank you.

Comments are closed.