Skip to content

Commit

Permalink
Merge remote-tracking branch 'refs/remotes/origin/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
christophM committed Nov 11, 2021
2 parents e7d12ed + 2f4afd6 commit a5f0a6b
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 36 deletions.
10 changes: 5 additions & 5 deletions manuscript/04.5-interpretable-tree.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ dat_sim = data.frame(feature_x1 = rep(c(3,3,4,4), times = n), feature_x2 = rep(c
dat_sim = dat_sim[sample(1:nrow(dat_sim), size = 0.9 * nrow(dat_sim)), ]
dat_sim$y = dat_sim$y + rnorm(nrow(dat_sim), sd = 0.2)
ct = ctree(y ~ feature_x1 + feature_x2, dat_sim)
plot(ct, inner_panel = node_inner(ct, pval = FALSE, id = FALSE),
plot(ct, inner_panel = node_inner(ct, pval = FALSE, id = FALSE),
terminal_panel = node_boxplot(ct, id = FALSE))
```

Expand Down Expand Up @@ -118,7 +118,7 @@ The feature importance measure shows that the time trend is far more important t
imp = round(100 * x$variable.importance / sum(x$variable.importance),0)
imp.df = data.frame(feature = names(imp), importance = imp)
imp.df$feature = factor(imp.df$feature, levels = as.character(imp.df$feature)[order(imp.df$importance)])
ggplot(imp.df) + geom_point(aes(x = importance, y = feature)) +
ggplot(imp.df) + geom_point(aes(x = importance, y = feature)) +
scale_y_discrete("")
```

Expand All @@ -140,8 +140,8 @@ A tree with a depth of three requires a maximum of three features and split poin
The truthfulness of the prediction depends on the predictive performance of the tree.
The explanations for short trees are very simple and general, because for each split the instance falls into either one or the other leaf, and binary decisions are easy to understand.

There is no need to transform features.
In linear models, it is sometimes necessary to take the logarithm of a feature.
There is no need to transform features.
In linear models, it is sometimes necessary to take the logarithm of a feature.
A decision tree works equally well with any monotonic transformation of a feature.


Expand Down Expand Up @@ -173,7 +173,7 @@ Decision trees are very interpretable -- as long as they are short.
**The number of terminal nodes increases quickly with depth.**
The more terminal nodes and the deeper the tree, the more difficult it becomes to understand the decision rules of a tree.
A depth of 1 means 2 terminal nodes.
Depth of 2 means max. 4 nodes.
Depth of 2 means max. 4 nodes.
Depth of 3 means max. 8 nodes.
The maximum number of terminal nodes in a tree is 2 to the power of the depth.

Expand Down
81 changes: 50 additions & 31 deletions manuscript/05.6-agnostic-permfeatimp.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ set.seed(42)

## Permutation Feature Importance {#feature-importance}

Permutation feature importance measures the increase in the prediction error of the model after we permuted the feature's values, which breaks the relationship between the feature and the true outcome.
Permutation feature importance measures the increase in the prediction error of the model after we permuted the feature's values, which breaks the relationship between the feature and the true outcome.

### Theory

The concept is really straightforward:
The concept is really straightforward:
We measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature.
A feature is "important" if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.
A feature is "unimportant" if shuffling its values leaves the model error unchanged, because in this case the model ignored the feature for the prediction.
The permutation feature importance measurement was introduced by Breiman (2001)[^Breiman2001] for random forests.
Based on this idea, Fisher, Rudin, and Dominici (2018)[^Fisher2018] proposed a model-agnostic version of the feature importance and called it model reliance.
Based on this idea, Fisher, Rudin, and Dominici (2018)[^Fisher2018] proposed a model-agnostic version of the feature importance and called it model reliance.
They also introduced more advanced ideas about feature importance, for example a (model-specific) version that takes into account that many prediction models may predict the data well.
Their paper is worth reading.

Expand All @@ -31,10 +31,10 @@ Input: Trained model $\hat{f}$, feature matrix $X$, target vector $y$, error mea
- Calculate permutation feature importance as quotient $FI_j= e_{perm}/e_{orig}$ or difference $FI_j = e_{perm}- e_{orig}$
3. Sort features by descending FI.

Fisher, Rudin, and Dominici (2018) suggest in their paper to split the dataset in half and swap the values of feature j of the two halves instead of permuting feature j.
This is exactly the same as permuting feature j, if you think about it.
If you want a more accurate estimate, you can estimate the error of permuting feature j by pairing each instance with the value of feature j of each other instance (except with itself).
This gives you a dataset of size `n(n-1)` to estimate the permutation error, and it takes a large amount of computation time.
Fisher, Rudin, and Dominici (2018) suggest in their paper to split the dataset in half and swap the values of feature j of the two halves instead of permuting feature j.
This is exactly the same as permuting feature j, if you think about it.
If you want a more accurate estimate, you can estimate the error of permuting feature j by pairing each instance with the value of feature j of each other instance (except with itself).
This gives you a dataset of size `n(n-1)` to estimate the permutation error, and it takes a large amount of computation time.
I can only recommend using the `n(n-1)` -method if you are serious about getting extremely accurate estimates.


Expand Down Expand Up @@ -66,7 +66,7 @@ perf2 = mlr::performance(pred2, measures = list(mlr::mae))
```


*tl;dr: I do not have a definite answer.*
*tl;dr: You should probably use test data.*

Answering the question about training or test data touches the fundamental question of what feature importance is.
The best way to understand the difference between feature importance based on training vs. based on test data is an "extreme" example.
Expand Down Expand Up @@ -94,18 +94,18 @@ imp2$results$dat.type = "Test data"
imp.dat = rbind(imp$results, imp2$results)
ggplot(imp.dat) + geom_boxplot(aes(x = dat.type, y = importance)) +
scale_y_continuous("Feature importance of all features") +
ggplot(imp.dat) + geom_boxplot(aes(x = dat.type, y = importance)) +
scale_y_continuous("Feature importance of all features") +
scale_x_discrete("")
```

It is unclear to me which of the two results is more desirable.
So I will try to make a case for both versions and let you decide for yourself.
So I will try to make a case for both versions.

**The case for test data**

This is a simple case:
Model error estimates based on training data are garbage -> feature importance relies on model error estimates -> feature importance based on training data is garbage.
This is a simple case:
Model error estimates based on training data are garbage -> feature importance relies on model error estimates -> feature importance based on training data is garbage.
Really, it is one of the first things you learn in machine learning:
If you measure the model error (or performance) on the same data on which the model was trained, the measurement is usually too optimistic, which means that the model seems to work much better than it does in reality.
And since the permutation feature importance relies on measurements of the model error, we should use unseen test data.
Expand Down Expand Up @@ -139,17 +139,15 @@ This means no unused test data is left to compute the feature importance.
You have the same problem when you want to estimate the generalization error of your model.
If you would use (nested) cross-validation for the feature importance estimation, you would have the problem that the feature importance is not calculated on the final model with all the data, but on models with subsets of the data that might behave differently.

In the end, you need to decide whether you want to know how much the model relies on each feature for making predictions (-> training data) or how much the feature contributes to the performance of the model on unseen data (-> test data).
To the best of my knowledge, there is no research addressing the question of training vs. test data.
It will require more thorough examination than my "garbage-SVM" example.
We need more research and more experience with these tools to gain a better understanding.
However, in the end I recommend to use test data for permutation feature importance.
Because if you are interested in how much the model's predictions are influenced by a feature, you should use other importance measures such as [SHAP importance](#shap-feature-importance).

Next, we will look at some examples.
I based the importance computation on the training data, because I had to choose one and using the training data needed a few lines less code.

### Example and Interpretation

I show examples for classification and regression.
I show examples for classification and regression.

**Cervical cancer (classification)**

Expand Down Expand Up @@ -195,7 +193,7 @@ predictor = Predictor$new(mod, data = bike[-which(names(bike) == "cnt")], y = bi
importance = FeatureImp$new(predictor, loss = 'mae')
imp.dat = importance$results
best = which(imp.dat$importance == max(imp.dat$importance))
worst = which(imp.dat$importance == min(imp.dat$importance))
worst = which(imp.dat$importance == min(imp.dat$importance))
```


Expand All @@ -209,12 +207,12 @@ plot(importance) +

**Nice interpretation**: Feature importance is the increase in model error when the feature's information is destroyed.

Feature importance provides a **highly compressed, global insight** into the model's behavior.
Feature importance provides a **highly compressed, global insight** into the model's behavior.

A positive aspect of using the error ratio instead of the error difference is that the feature importance measurements are **comparable across different problems**.
A positive aspect of using the error ratio instead of the error difference is that the feature importance measurements are **comparable across different problems**.

The importance measure automatically **takes into account all interactions** with other features.
By permuting the feature you also destroy the interaction effects with other features.
By permuting the feature you also destroy the interaction effects with other features.
This means that the permutation feature importance takes into account both the main feature effect and the interaction effects on model performance.
This is also a disadvantage because the importance of the interaction between two features is included in the importance measurements of both features.
This means that the feature importances do not add up to the total drop in performance, but the sum is larger.
Expand All @@ -233,26 +231,26 @@ You remove the feature and retrain the model.
The model performance remains the same because another equally good feature gets a non-zero weight and your conclusion would be that the feature was not important.
Another example:
The model is a decision tree and we analyze the importance of the feature that was chosen as the first split.
You remove the feature and retrain the model.
You remove the feature and retrain the model.
Since another feature is chosen as the first split, the whole tree can be very different, which means that we compare the error rates of (potentially) completely different trees to decide how important that feature is for one of the trees.

### Disadvantages

Permutation feature importance is **linked to the error of the model**.
This is not inherently bad, but in some cases not what you need.
In some cases, you might prefer to know how much the model's output varies for a feature without considering what it means for performance.
For example, you want to find out how robust your model's output is when someone manipulates the features.
In this case, you would not be interested in how much the model performance decreases when a feature is permuted, but how much of the model's output variance is explained by each feature.
For example, you want to find out how robust your model's output is when someone manipulates the features.
In this case, you would not be interested in how much the model performance decreases when a feature is permuted, but how much of the model's output variance is explained by each feature.
Model variance (explained by the features) and feature importance correlate strongly when the model generalizes well (i.e. it does not overfit).

You **need access to the true outcome**.
You **need access to the true outcome**.
If someone only provides you with the model and unlabeled data -- but not the true outcome -- you cannot compute the permutation feature importance.

The permutation feature importance depends on shuffling the feature, which adds randomness to the measurement.
When the permutation is repeated, the **results might vary greatly**.
Repeating the permutation and averaging the importance measures over repetitions stabilizes the measure, but increases the time of computation.

If features are correlated, the permutation feature importance **can be biased by unrealistic data instances**.
If features are correlated, the permutation feature importance **can be biased by unrealistic data instances**.
The problem is the same as with [partial dependence plots](#pdp):
The permutation of features produces unlikely data instances when two or more features are correlated.
When they are positively correlated (like height and weight of a person) and I shuffle one of the features, I create new instances that are unlikely or even physically impossible (2 meter person weighing 30 kg for example), yet I use these new instances to measure the importance.
Expand All @@ -269,10 +267,10 @@ Now imagine another scenario in which I additionally include the temperature at
The temperature at 9:00 AM does not give me much additional information if I already know the temperature at 8:00 AM.
But having more features is always good, right?
I train a random forest with the two temperature features and the uncorrelated features.
Some of the trees in the random forest pick up the 8:00 AM temperature, others the 9:00 AM temperature, again others both and again others none.
Some of the trees in the random forest pick up the 8:00 AM temperature, others the 9:00 AM temperature, again others both and again others none.
The two temperature features together have a bit more importance than the single temperature feature before, but instead of being at the top of the list of important features, each temperature is now somewhere in the middle.
By introducing a correlated feature, I kicked the most important feature from the top of the importance ladder to mediocrity.
On one hand this is fine, because it simply reflects the behavior of the underlying machine learning model, here the random forest.
On one hand this is fine, because it simply reflects the behavior of the underlying machine learning model, here the random forest.
The 8:00 AM temperature has simply become less important because the model can now rely on the 9:00 AM measurement as well.
On the other hand, it makes the interpretation of the feature importance considerably more difficult.
Imagine you want to check the features for measurement errors.
Expand All @@ -281,15 +279,36 @@ In the first case you would check the temperature, in the second case you would
Even though the importance values might make sense at the level of model behavior, it is confusing if you have correlated features.


### Software and Alternatives
### Alternatives

An algorithm called [PIMP](https://academic.oup.com/bioinformatics/article/26/10/1340/193348) adapts the permutation feature importance algorithm to provide p-values for the importances.
Another loss-based alternative is to omit the feature from the training data, retrain the model and measuring the increase in loss.
Permuting a feature and measuring the increase in loss is not the only way to measure the importance of a feature.
The different importance measures can be divided into model-specific and model-agnostic methods.
The Gini importance for random forests or standardized regression coefficients for regression models are examples of model-specific importance measures.

A model-agnostic alternative to permutation feature importance are variance-based measures.
Variance-based feature importance measures such as Sobol's indices or [functional ANOVA](#decomposition) give higher importance to features that cause high variance in the prediction function.
Also [SHAP importance](#shap) has similarities to a variance-based importance measure.
If changing a feature greatly changes the output, then it is important.
This definition of importance differs from the loss-based definition as in the case of permutation feature importance.
This is evident in cases where a model overfitted.
If a model overfits and uses a feature that is unrelated to the output, then the permutation feature importance would assign an importance of zero because this feature does not contribute to producing correct predictions.
A variance-based importance measure, on the other hand, might assign the feature high importance as the prediction can change a lot when the feature is changed.

A good overview of various importance techniques is provided in the paper by Wei (2015) [^overview-imp].

### Software


The `iml` R package was used for the examples.
The R packages `DALEX` and `vip`, as well as the Python library `alibi`, `scikit-learn` and `rfpimp`, also implement model-agnostic permutation feature importance.

An algorithm called [PIMP](https://academic.oup.com/bioinformatics/article/26/10/1340/193348) adapts the feature importance algorithm to provide p-values for the importances.



[^Breiman2001]: Breiman, Leo.“Random Forests.” Machine Learning 45 (1). Springer: 5-32 (2001).

[^Fisher2018]: Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. “All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously.” http://arxiv.org/abs/1801.01489 (2018).

[^overview-imp]: Wei, Pengfei, Zhenzhou Lu, and Jingwen Song. "Variable importance analysis: a comprehensive review." Reliability Engineering & System Safety 142 (2015): 399-432.

0 comments on commit a5f0a6b

Please sign in to comment.