--- title: Shapley, LIME and SHAP date: 2018-12-02 template: post comments: true --- In this post I explain LIME (Ribeiro et. al. 2016), the Shapley values (Shapley, 1953) and the SHAP values (Lundberg-Lee, 2017). Shapley values -------------- A coalitional game $(v, N)$ of $n$ players involves - The set $N = \{1, 2, ..., n\}$ that represents the players. - A function $v: 2^N \to \mathbb R$, where $v(S)$ is the worth of coalition $S \subset N$. The Shapley values $\phi_i(v)$ of such a game specify a fair way to distribute the total worth $v(N)$ to the players. It is defined as (in the following, for a set $S \subset N$ we use the convention $s = |S|$ to be the number of elements of set $S$ and the shorthand $S - i := S \setminus \{i\}$ and $S + i := S \cup \{i\}$) $$\phi_i(v) = \sum_{S: i \in S} {(n - s)! (s - 1)! \over n!} (v(S) - v(S - i)).$$ $\phi_i(v)$ is an expectation: $$\phi_i(v) = \mathbb E_{S \sim \nu_i} (v(S) - v(S - i))$$ where $\nu_i(S) = n^{-1} {n - 1 \choose s - 1}^{-1} 1_{i \in S}$, that is, first pick the size $s$ uniformly from $\{1, 2, ..., n\}$, then pick $S$ uniformly from the subsets of $N$ that has size $s$ and contains $i$. The Shapley values satisfy some nice properties which are readily verified, including: - **Efficiency**. $\sum_i \phi_i(v) = v(N) - v(\emptyset)$. - **Symmetry**. If for some $i, j \in N$, for all $S \subset N$, we have $v(S + i) = v(S + j)$, then $\phi_i(v) = \phi_j(v)$. - **Null player**. If for some $i \in N$, for all $S \subset N$, we have $v(S + i) = v(S)$, then $\phi_i(v) = 0$. - **Linearity**. $\phi_i$ is linear in games. That is $\phi_i(v) + \phi_i(w) = \phi_i(v + w)$, where $v + w$ is defined by $(v + w)(S) := v(S) + w(S)$. In the literature, an added assumption $v(\emptyset) = 0$ is often given, in which case the Efficiency property is defined as $\sum_i \phi_i(v) = v(N)$. Here I discard this assumption to avoid minor inconsistencies across different sources. For example, in the LIME paper, the local model is defined without an intercept, even though the underlying $v(\emptyset)$ may not be $0$. In the SHAP paper, an intercept $\phi_0 = v(\emptyset)$ is added which fixes this problem when making connections to the Shapley values. Conversely, according to Strumbelj-Kononenko (2010), it was shown in Shapley\'s original paper (Shapley, 1953) that these four properties together with $v(\emptyset) = 0$ defines the Shapley values. LIME ---- LIME (Ribeiro et. al. 2016) is a model that offers a way to explain feature contributions of supervised learning models locally. Let $f: X_1 \times X_2 \times ... \times X_n \to \mathbb R$ be a function. We can think of $f$ as a model, where $X_j$ is the space of $j$th feature. For example, in a language model, $X_j$ may be the count of the $j$th word in the vocabulary. The output may be something like housing price, or log-probability of something. LIME tries to assign a value to each feature *locally*. By locally, we mean that given a specific sample $x \in X := \prod_{i = 1}^n X_i$, we want to fit a model around it. More specifically, let $h_x: 2^N \to X$ be a function defined by $$(h_x(S))_i = \begin{cases} x_i, & \text{if }i \in S; \\ 0, & \text{otherwise.} \end{cases}$$ That is, $h_x(S)$ masks the features that are not in $S$, or in other words, we are perturbing the sample $x$. Specifically, $h_x(N) = x$. Alternatively, the $0$ in the \"otherwise\" case can be replaced by some kind of default value (see the last section of this post). For a set $S \subset N$, let us denote $1_S \in \{0, 1\}^n$ to be an $n$-bit where the $k$th bit is $1$ if and only if $k \in S$. Basically, LIME samples $S_1, S_2, ..., S_m \subset N$ to obtain a set of perturbed samples $x_i = h_x(S_i)$ in the $X$ space, and then fits a linear model $g$ using $1_{S_i}$ as the input samples and $f(h_x(S_i))$ as the output samples: **Problem**(LIME). Find $w = (w_1, w_2, ..., w_n)$ that minimises $$\sum_i (w \cdot 1_{S_i} - f(h_x(S_i)))^2 \pi_x(h_x(S_i))$$ where $\pi_x(x')$ is a function that penalises $x'$s that are far away from $x$. In the LIME paper the Gaussian kernel was used: $$\pi_x(x') = \exp\left({- \|x - x'\|^2 \over \sigma^2}\right).$$ Then $w_i$ represents the importance of the $i$th feature. The LIME model has a more general framework, but the specific model considered in the paper is the one described above, with a Lasso for feature selection. Shapley values and LIME ----------------------- The connection between the Shapley values and LIME is noted in Lundberg-Lee (2016), but the underlying connection goes back to 1988 (Charnes et. al.). To see the connection, we need to modify LIME a bit. First, we need to make LIME less efficient by considering *all* the $2^n$ subsets instead of the $m$ samples $S_1, S_2, ..., S_{m}$. Then we need to relax the definition of $\pi_x$. It no longer needs to penalise samples that are far away from $x$. In fact, we will see later than the choice of $\pi_x(x')$ that yields the Shapley values is high when $x'$ is very close or very far away from $x$, and low otherwise. We further add the restriction that $\pi_x(h_x(S))$ only depends on the size of $S$, thus we rewrite it as $q(s)$ instead. We also denote $v(S) := f(h_x(S))$ and $w(S) = \sum_{i \in S} w_i$. Finally, we add the Efficiency property as a constraint: $\sum_{i = 1}^n w_i = f(x) - f(h_x(\emptyset)) = v(N) - v(\emptyset)$. Then the problem becomes a weighted linear regression: **Problem**. minimises $\sum_{S \subset N} (w(S) - v(S))^2 q(s)$ over $w$ subject to $w(N) = v(N) - v(\emptyset)$. **Claim** (Charnes et. al. 1988). The solution to this problem is $$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s)\right)^{-1} \sum_{S \subset N: i \in S} \left({n - s \over n} q(s) v(S) - {s - 1 \over n} q(s - 1) v(S - i)\right). \qquad (-1)$$ Specifically, if we choose $$q(s) = c {n - 2 \choose s - 1}^{-1}$$ for any constant $c$, then $w_i = \phi_i(v)$ are the Shapley values. **Remark**. Don\'t worry about this specific choice of $q(s)$ when $s = 0$ or $n$, because $q(0)$ and $q(n)$ do not appear on the right hand side of (-1). Therefore they can be defined to be of any value. A common convention of the binomial coefficients is to set ${\ell \choose k} = 0$ if $k < 0$ or $k > \ell$, in which case $q(0) = q(n) = \infty$. In Lundberg-Lee (2017), $c$ is chosen to be $1 / n$, see Theorem 2 there. **Proof**. The Lagrangian is $$L(w, \lambda) = \sum_{S \subset N} (v(S) - w(S))^2 q(s) - \lambda(w(N) - v(N) + v(\emptyset)).$$ and by making $\partial_{w_i} L(w, \lambda) = 0$ we have $${1 \over 2} \lambda = \sum_{S \subset N: i \in S} (w(S) - v(S)) q(s). \qquad (0)$$ Summing (0) over $i$ and divide by $n$, we have $${1 \over 2} \lambda = {1 \over n} \sum_i \sum_{S: i \in S} (w(S) q(s) - v(S) q(s)). \qquad (1)$$ We examine each of the two terms on the right hand side. Counting the terms involving $w_i$ and $w_j$ for $j \neq i$, and using $w(N) = v(N) - v(\emptyset)$ we have $$\begin{aligned} &\sum_{S \subset N: i \in S} w(S) q(s) \\ &= \sum_{s = 1}^n {n - 1 \choose s - 1} q(s) w_i + \sum_{j \neq i}\sum_{s = 2}^n {n - 2 \choose s - 2} q(s) w_j \\ &= q(1) w_i + \sum_{s = 2}^n q(s) \left({n - 1 \choose s - 1} w_i + \sum_{j \neq i} {n - 2 \choose s - 2} w_j\right) \\ &= q(1) w_i + \sum_{s = 2}^n \left({n - 2 \choose s - 1} w_i + {n - 2 \choose s - 2} (v(N) - v(\emptyset))\right) q(s) \\ &= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) w_i + \sum_{s = 2}^n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset)). \qquad (2) \end{aligned}$$ Summing (2) over $i$, we obtain $$\begin{aligned} &\sum_i \sum_{S: i \in S} w(S) q(s)\\ &= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) (v(N) - v(\emptyset)) + \sum_{s = 2}^n n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset))\\ &= \sum_{s = 1}^n s{n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset)). \qquad (3) \end{aligned}$$ For the second term in (1), we have $$\sum_i \sum_{S: i \in S} v(S) q(s) = \sum_{S \subset N} s v(S) q(s). \qquad (4)$$ Plugging (3)(4) in (1), we have $${1 \over 2} \lambda = {1 \over n} \left(\sum_{S \subset N} s q(s) v(S) - \sum_{s = 1}^n s {n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset))\right). \qquad (5)$$ Plugging (5)(2) in (0) and solve for $w_i$, we have $$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) \right)^{-1} \left( \sum_{S: i \in S} q(s) v(S) - {1 \over n} \sum_{S \subset N} s q(s) v(S) \right). \qquad (6)$$ By splitting all subsets of $N$ into ones that contain $i$ and ones that do not and pair them up, we have $$\sum_{S \subset N} s q(s) v(S) = \sum_{S: i \in S} (s q(s) v(S) + (s - 1) q(s - 1) v(S - i)).$$ Plugging this back into (6) we get the desired result. $\square$ SHAP ---- The SHAP paper (Lundberg-Lee 2017) is not clear in its definition of the \"SHAP values\" and its relation to LIME, so the following is my interpretation of their interpretation model. Recall that we want to calculate feature contributions to a model $f$ at a sample $x$. Let $\mu$ be a probability density function over the input space $X = X_1 \times ... \times X_n$. A natural choice would be the density that generates the data, or one that approximates such density (e.g. empirical distribution). The feature contribution (SHAP value) is thus defined as the Shapley value $\phi_i(v)$, where $$v(S) = \mathbb E_{z \sim \mu} (f(z) | z_S = x_S). \qquad (7)$$ So it is a conditional expectation where $z_i$ is clamped for $i \in S$. One simplification is to assume the $n$ features are independent, thus $\mu = \mu_1 \times \mu_2 \times ... \times \mu_n$. In this case, (7) becomes $$v(S) = \mathbb E_{z_{N \setminus S} \sim \mu_{N \setminus S}} f(x_S, z_{N \setminus S}) \qquad (8)$$ For example, Strumbelj-Kononenko (2010) considers this where $\mu$ is the uniform distribution over $X$, see Definition 4 there. A further simplification is model linearity, which means $f$ is linear. In this case, (8) becomes $$v(S) = f(x_S, \mathbb E_{\mu_{N \setminus S}} z_{N \setminus S}). \qquad (9)$$ It is worth noting that to make the modified LIME model considered in the previous section fall under the linear SHAP framework (9), we need to make a further specialisation, that is, change the definition of $h_x(S)$ to $$(h_x(S))_i = \begin{cases} x_i, & \text{if }i \in S; \\ \mathbb E_{\mu_i} z_i, & \text{otherwise.} \end{cases}$$ How about DeepLIFT? ------------------- TODO References ---------- - Charnes, A., B. Golany, M. Keane, and J. Rousseau. "Extremal Principle Solutions of Games in Characteristic Function Form: Core, Chebychev and Shapley Value Generalizations." In Econometrics of Planning and Efficiency, edited by Jati K. Sengupta and Gopal K. Kadekodi, 123--33. Dordrecht: Springer Netherlands, 1988. . - Lundberg, Scott, and Su-In Lee. "A Unified Approach to Interpreting Model Predictions." ArXiv:1705.07874 \[Cs, Stat\], May 22, 2017. . - Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "'Why Should I Trust You?': Explaining the Predictions of Any Classifier." ArXiv:1602.04938 \[Cs, Stat\], February 16, 2016. . - Shapley, L. S. "17. A Value for n-Person Games." In Contributions to the Theory of Games (AM-28), Volume II, Vol. 2. Princeton: Princeton University Press, 1953. . - Strumbelj, Erik, and Igor Kononenko. "An Efficient Explanation of Individual Classifications Using Game Theory." J. Mach. Learn. Res. 11 (March 2010): 1--18.