This post introduced the idea behind Permutation Importance. Here, we will work through an example to further illustrate why permutation importance can give us a measure of feature importance.
Example Dataset¶
We'll construct a toy example where one of our features (x1) has a strong, linear relationship with our outcome variable. The other feature (x2) has no relationship.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x1 = np.random.random(size=100)
x2 = np.random.random(size=100)
y = 2 * x1 + np.random.normal(scale=0.01, size=100)
df = pd.DataFrame({
'x1': x1,
'x2': x2,
'y': y
})
df.sort_values('x1', inplace=True)
df
fig, (ax1, ax2) = plt.subplots(figsize=(10, 6), nrows=1, ncols=2)
df.plot(kind='scatter', x='x1', y='y', ax=ax1)
df.plot(kind='scatter', x='x2', y='y', ax=ax2)
plt.show()
Permuting x1¶
To determine the Permutation Importance, we shuffle one column at a time, and see what impact that has on our ability to predict our target variable.
In this case, we would expect that shuffling x1 would have a large impact because, after permutating the data, x1 no longer has any predictive power.
df_shuffled = df.copy()
df_shuffled['x1'] = np.random.permutation(df['x1'])
df_shuffled
Instead of a nice line, we now just have a blob, which is expected because we just randomly shuffled the data.
df_shuffled.plot(x='x1', y='y', kind='scatter')
plt.show()
Train a Model¶
To calculate the Permutation Importance, we must first have a trained model (BEFORE we do the shuffling). Below, we see that our model has an R^2
of 99.7%, which makes sense because, based on the plot of x1 vs y
, there is a strong, linear relationship between the two.
(RandomForestRegressor
is overkill in this particular case since a Linear model would have worked just as well).
# construct training and validation datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
# create datasets
X = df[['x1', 'x2']]
y = df['y']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
# train the model
model = RandomForestRegressor()
model.fit(X_train, y_train)
# make predictions on the validation set
predictions = model.predict(X_val)
# evaluate r2
r2 = r2_score(predictions, y_val)
print(f'R^2: {r2}')
Identify Important Features¶
Since we have a trained model, we can use eli5
to evaluate the Permutation Importance.
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(model, scoring='r2').fit(X_val, y_val)
eli5.show_weights(perm, feature_names=X_val.columns.tolist())
As expected, x1
comes out as the most important feature.