Permutation Importance is a way to better understand what features in your model have the most impact when predicting the target variable. In other words, it is a way to measure feature importance.

Conceptually, it is easy to understand and can be applied to any model. There is also a nice Python package, eli5 to calculate it.

Create Datasets¶

This isn't the point of this post, but we'll be using the California Housing Dataset. Let's quickly create the necessary datasets we'll be needing.

In [1]:

import warnings
warnings.filterwarnings('ignore')

import sklearn
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

def init_data():
    """Fetch the California data and create Pandas DataFrame."""
    raw_data = fetch_california_housing()
    data = pd.DataFrame(raw_data['data'], columns=raw_data['feature_names'])
    data['Price'] = raw_data['target']
    return data

def create_datasets(data, validation_frac=0.2):
    """Create training and validation datasets."""
    y = data['Price']
    X = data.drop(columns=['Price'])
    return train_test_split(X, y, test_size=validation_frac, shuffle=True)

data = init_data()
X_train, X_val, y_train, y_val = create_datasets(data)

print(X_train.shape)
X_train.head()

(16512, 8)

Out[1]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
20307	2.6098	18.0	3.706056	1.016248	3129.0	4.621861	34.15	-119.17
2005	0.7990	25.0	3.645435	1.150743	1343.0	2.851380	36.74	-119.80
8296	4.3889	52.0	4.290064	1.028846	1144.0	1.833333	33.76	-118.14
3476	8.1248	18.0	7.851309	1.021990	3189.0	3.339267	34.32	-118.52
5901	2.4074	24.0	3.333333	1.050401	2522.0	2.888889	34.17	-118.31

In [2]:

y_train.head()

Out[2]:

20307    1.461
2005     0.518
8296     3.780
3476     3.740
5901     2.194
Name: Price, dtype: float64

Train a RandomForest Model¶

Also not the point of this post, but we need a trained model for which we want to evaluate the permutation importance for. In this case, we'll be using sklearn's RandomForestRegressor, and evaluating its predictions using mean_absolute_error().

In [3]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def train_model_and_predict(X_train, y_train, X_val):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    return model, model.predict(X_val)

baseline_pred = y_train.mean()
model, preds = train_model_and_predict(X_train, y_train, X_val)

# Evaluate MAE of predictions
baseline_mae = mean_absolute_error([baseline_pred] * len(y_val), y_val)
mae = mean_absolute_error(preds, y_val)
print(f'Baseline MAE on validation dataset: {baseline_mae:.3f}.')
print(f'RF MAE on validation dataset: {mae:.3f}')

Baseline MAE on validation dataset: 0.922.
RF MAE on validation dataset: 0.328

Permutation Importance¶

Permtuation Importance allows us to answer the question, "how would our model have performed if it didn't have access to a particular feature?". Conceptually, more important features will be the ones that have the largest impact when removed from the modeling process.

To actually do this, we will do the following:

Take the trained model
Randomly shuffle 1 column
Create predictions using this new dataset
Calculate the MAE of our predictions
Subtract the new MAE (which should be higher) from our original MAE

Randomly shuffling the feature has the same impact as removing it, since we are completely destroying the relationship with the outcome variable of interest.

In [4]:

import numpy as np

def shuffle_column(df, column):
    """Shuffle a single column of a DataFrame, return a copy."""
    result = df.copy()
    result[column] = np.random.permutation(df[column].values)
    return result


def calc_feature_permutation_mae(model, X, y, feature):
    """Calculate the MAE after permuting a particular feature."""
    X_shuffled = shuffle_column(X, feature)
    preds_shuffled = model.predict(X_shuffled)
    return mean_absolute_error(preds_shuffled, y)


def calc_permutation_importance(model, X, y):
    """Calculate Permutation Importance for all model features."""
    importances = {}
    preds = model.predict(X)
    mae = mean_absolute_error(preds, y)
    for feature in X:
        importances[feature] = calc_feature_permutation_mae(model, X, y, feature) - mae
    return sorted(importances.items(), key=lambda x: x[1], reverse=True)

In [5]:

import matplotlib.pyplot as plt

result = calc_permutation_importance(model, X_val, y_val)
print(result)

features = [x[0] for x in result]
importances = [x[1] for x in result]
plt.figure(figsize=(8, 6))
plt.bar(features, importances)
plt.xticks(rotation=45)
plt.title('Permutation Importance')
plt.ylabel('Change in MAE')
plt.show()

[('MedInc', 0.4656078229006018), ('Latitude', 0.3681427987533668), ('Longitude', 0.313516140948717), ('AveOccup', 0.15820252426438614), ('HouseAge', 0.05195238744685643), ('AveRooms', 0.03337708491198049), ('AveBedrms', 0.009278008643218094), ('Population', 0.006155723338973851)]

We see that for our model, the most importance features are:

MedInc
Latitude
Longitude

Permutation Importance found that if we removed the above features, the model's performance degrades the most. Conversely, it found we could drop the Population feature and not lose any predictive power.

ELI5 and Permutation Importance¶

eli5 is a Python package that makes it simple to calculate permutation importance (amongst other things). If we use neg_mean_absolute_error as our scoring function, you'll see that we get values very similar to the ones we calcualted above. It also includes a measure of uncertainty, since it repated the permutation process multiple times.

In [6]:

import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model, scoring='neg_mean_absolute_error').fit(X_val, y_val)
eli5.show_weights(perm, feature_names=X_val.columns.tolist())

Out[6]:

Weight	Feature
0.4718 ± 0.0103	MedInc
0.3635 ± 0.0079	Latitude
0.3101 ± 0.0111	Longitude
0.1623 ± 0.0077	AveOccup
0.0566 ± 0.0036	HouseAge
0.0355 ± 0.0021	AveRooms
0.0091 ± 0.0005	AveBedrms
0.0078 ± 0.0022	Population

Comments

Permutation Importance

Create Datasets¶

Train a RandomForest Model¶

Permutation Importance¶

ELI5 and Permutation Importance¶

Published

Category

Tags