I found SHAP to be a very helpful and interesting tool, and this post is mainly my understanding of the two papers about SHAP by Scott M. Lundberg. The first paper is about general use of SHAP A Unified Approach to Interpreting Model Predictions, and the second paper is more about TreeExplainer Consistent Individualized Feature Attribution for Tree Ensembles. If you find my understanding incorrect, please contact me by email and I am very happy to discuss it with you.
A Unified Approach to Interpreting Model Predictions
This paper is the first paper about SHAP. It provides a unified framework for concurrent explanation methods: additive feature attribution methods. And it justifies the theoretical validity of SHAP in terms of the unique solution with properties: consistency, local accuracy, and missingness.
Additive Feature Attribution Methods
For a simple model, the model itself can provide explanation. But for complex models, such as deep learning models and tree-based models, the models themselves are hard to understand. So a notion of explanation model comes into place. It is defined as any interpretable approximation of the original model.
Notations
- \(f\) is the original model, \(x\) is a single input, a feature vector. For example, a two-dimensional input, \(x = (3.5, 4.6)\).
- \(g\) is the explanation model
- \(x^{\prime}\) is a simplified input, a binary vector. \(h_{x}(x^{\prime})\) is a mapping function which maps from a binary pattern of missing features represented by \(x^{\prime}\) to the original function input space. Note that \(h_{x}\) is specifice to the current input \(x\). For example, if the first feature is missing, then \(x^{\prime} = (0, 1)\); if neither of the features is missing, then \(x^{\prime} = (1, 1)\). The introduction of this notion is for the case when there are missing features. In a usual case when all features are available, \(x^{\prime} = (1, 1)\), and the mapping function \(h_x\) would map it back to \(x\). In the case where there are missing features, for example, \(x^{\prime} = (1, 0)\), the first element will be mapped to the original \(x_1\), and the second element might be mapped to 0, or to some reference value, or to the average of neighboring values, which depends on the choice of \(h_x\).
The goal of the explanation model is to ensure \(g(z^{\prime}) \approx f(h_x(z^\prime))\).
Definition
Additive feature attribution methods have an explanation model with the following format:
\[g(z^{\prime}) = \phi_0 + \sum_{i=1}^M \phi_iz_i^\prime,\]where \(z^\prime \in \{0, 1\}^M\), \(M\) is the number of simplied input features, and \(\phi_i \in \mathbb{R}\).
Shapley regression values
Shapley regression values is used to evaluate feature importances of linear models in the presence of multicollinearity. The idea is as follows: Suppose we want to get the feature importance of the \(i\)th feature. Let \(F\) denote the set of all features. Then for all possible subsets \(S \in F \backslash \{i\}\), retrain two models: \(f_{S \cup \{i\}}\) and \(f_S\), and compute the difference in the fitted value on the current input, namely, \(f_{S \cup \{i\}}(x_{S \cup \{i\}}) - f_S(x_S)\). Finally, use the average on all possible \(S\) as the feature importance value.
\[\phi_i = \sum_{S \subseteq F \backslash \{i\}} \frac{|S|! (|F| - |S| - 1)!}{|F|!} [f_{S \bigcup \{i\}}(x_{S \bigcup \{i\}}) - f_S(x_S)],\]where \(\frac{\|S\|! (\|F\| - \|S\| - 1)!}{\|F\|!} = \frac{1}{\|F\|} \cdot \frac{1}{\|F\|-1 \choose \|S\|}\), and \(\frac{1}{\|F\|}\) is because we have in total \(\|F\|\) features, and to ensure the feature importance values is of the same magnitude as the fitted values.
For Shapley regression values, a good thing is that there is a unique solution with three desirable properties: local accuracy, missingness, and consistency.
Three properties
Local accuracy
Suppose we want to explain at a point \(x\), we then build an explanation model for this \(x\), denoted as \(g_x\). Local accuracy then requires that the output value of the explanation model should match the original model at least at point \(x\), meaning that it would fine if these two models don’t match at other points. Also, local accuracy is critical in consistency. If local accuracy is not satisfied, then having a larger difference in fitted values doesn’t necessarily mean larger impact.
We can consider a similar example. Suppose we want to use a linear line to approximate a quadratic function, and we care about point \(x\), then we would like to ensure that the linear line and the quadractic function have the same value at the point \(x\) for a good approximation locally.
Note that SHAP is actually an individualized interpretation tool, so it makes sense to consider local accuracy.
Missingness
When one feature is missing, the corresponding feature attribution value is 0.
Consistency
When one feature is more important and its corresponding feature attribution value is higher, then we say the method is consistent in assigning feature attribution value.
Consistent Individualized Feature Attribution for Tree Ensembles
Abstract
Main problems of previous tree ensemble feature attribution methods:
- heuristic
- not individualized for each prediction(why stress indiviudalized? If we only consider explanation, then yes, individualized is better; when we try to use the feature attribution values to do feature selection, then maybe global is better.)
- inconsistent – the feature attribution value decreases when the impact of this feature actually increases(in terms of human intuition? How to determine the actual impact?).
What are the advantages of SHAP?
- consistent
- locally accurate(need to talk more later)
- rich visualization
- better agreement with human intuition
- run time
- clustering performance
- better identification of influential features(how to define influential features?)
1 Introduction
Importance values
- individualized. Computed for a single prediction.
- global. Computed for the entire dataset to explain a model’s overall behaviour.
Inconsistency issue
Why is consistency an issue? When the feature attribution method is inconsistent, the values cannot represent feature importance. In this case, these values would be useless.
The author thinks inconsistency is a big issue. Indeed, when the goal is to rank features, inconsistency will give us un-trustworthy result. But when the goal is to select features, there are four possible cases:
- inconsistency happens between two similarly important features;
- inconsistency happens also between two similarly un-important features;
- inconsistency happens between two similarly medium-important features(close to the boarder to select of drop);
- inconsistency could happen between any two features.
Case 1, 2 would be fine, since the two inconsistent features will be selected or dropped together. But under case 3 and 4, there might be some issues. Lundberg used a very simple case, from which we cannot see which case it fell into. (Some study may need to be done.)
contribution
SHAP is theoretically optimal. It is a model agnostic method. (Is it the same notion as distrinbution-free?)
This paper derives an algorithm that can compute SHAP very quickly, proposes SHAP dependence plos and SHAP summary plots.
2 Inconsistency in current feature attribution methods
Global feature importance values:
- Gain: the total reduction of loss or impurity contributed by all splits for a given feature. It is largely heuristic.
- Split Count: simply to count how many times a feature is used to split. (used in XGBoost? How about LGBM?)
- Permutation: randomly permutate the values of a feature in the test set, and observe the change in teh model’s error. The idea is: if one certain feature is important, then such permutation would largely increase the model’s error. (Looks similar to that of LOCO)
- Mean Magnitude of SHAP