Permutation Importance is a way to better understand what features in your model have the most impact when predicting the target variable. In other words, it is a way to measure feature importance.
Conceptually, it is easy to understand and can be applied to any model. There is also a nice Python package, eli5 to calculate it.
Create Datasets¶
This isn't the point of this post, but we'll be using the California Housing Dataset. Let's quickly create the necessary datasets we'll be needing.
import warnings
warnings.filterwarnings('ignore')
import sklearn
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
def init_data():
"""Fetch the California data and create Pandas DataFrame."""
raw_data = fetch_california_housing()
data = pd.DataFrame(raw_data['data'], columns=raw_data['feature_names'])
data['Price'] = raw_data['target']
return data
def create_datasets(data, validation_frac=0.2):
"""Create training and validation datasets."""
y = data['Price']
X = data.drop(columns=['Price'])
return train_test_split(X, y, test_size=validation_frac, shuffle=True)
data = init_data()
X_train, X_val, y_train, y_val = create_datasets(data)
print(X_train.shape)
X_train.head()
y_train.head()
Train a RandomForest Model¶
Also not the point of this post, but we need a trained model for which we want to evaluate the permutation importance for. In this case, we'll be using sklearn's RandomForestRegressor
, and evaluating its predictions using mean_absolute_error()
.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
def train_model_and_predict(X_train, y_train, X_val):
model = RandomForestRegressor()
model.fit(X_train, y_train)
return model, model.predict(X_val)
baseline_pred = y_train.mean()
model, preds = train_model_and_predict(X_train, y_train, X_val)
# Evaluate MAE of predictions
baseline_mae = mean_absolute_error([baseline_pred] * len(y_val), y_val)
mae = mean_absolute_error(preds, y_val)
print(f'Baseline MAE on validation dataset: {baseline_mae:.3f}.')
print(f'RF MAE on validation dataset: {mae:.3f}')
Permutation Importance¶
Permtuation Importance allows us to answer the question, "how would our model have performed if it didn't have access to a particular feature?". Conceptually, more important features will be the ones that have the largest impact when removed from the modeling process.
To actually do this, we will do the following:
- Take the trained model
- Randomly shuffle 1 column
- Create predictions using this new dataset
- Calculate the MAE of our predictions
- Subtract the new MAE (which should be higher) from our original MAE
Randomly shuffling the feature has the same impact as removing it, since we are completely destroying the relationship with the outcome variable of interest.
import numpy as np
def shuffle_column(df, column):
"""Shuffle a single column of a DataFrame, return a copy."""
result = df.copy()
result[column] = np.random.permutation(df[column].values)
return result
def calc_feature_permutation_mae(model, X, y, feature):
"""Calculate the MAE after permuting a particular feature."""
X_shuffled = shuffle_column(X, feature)
preds_shuffled = model.predict(X_shuffled)
return mean_absolute_error(preds_shuffled, y)
def calc_permutation_importance(model, X, y):
"""Calculate Permutation Importance for all model features."""
importances = {}
preds = model.predict(X)
mae = mean_absolute_error(preds, y)
for feature in X:
importances[feature] = calc_feature_permutation_mae(model, X, y, feature) - mae
return sorted(importances.items(), key=lambda x: x[1], reverse=True)
import matplotlib.pyplot as plt
result = calc_permutation_importance(model, X_val, y_val)
print(result)
features = [x[0] for x in result]
importances = [x[1] for x in result]
plt.figure(figsize=(8, 6))
plt.bar(features, importances)
plt.xticks(rotation=45)
plt.title('Permutation Importance')
plt.ylabel('Change in MAE')
plt.show()
We see that for our model, the most importance features are:
- MedInc
- Latitude
- Longitude
Permutation Importance found that if we removed the above features, the model's performance degrades the most. Conversely, it found we could drop the Population feature and not lose any predictive power.
ELI5 and Permutation Importance¶
eli5 is a Python package that makes it simple to calculate permutation importance (amongst other things). If we use neg_mean_absolute_error
as our scoring function, you'll see that we get values very similar to the ones we calcualted above. It also includes a measure of uncertainty, since it repated the permutation process multiple times.
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(model, scoring='neg_mean_absolute_error').fit(X_val, y_val)
eli5.show_weights(perm, feature_names=X_val.columns.tolist())