aboutsummaryrefslogtreecommitdiff
path: root/posts/2018-12-02-lime-shapley.md
blob: c78a1e355d5c983e44477392b57ae299829e487e (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
---
title: Shapley, LIME and SHAP
date: 2018-12-02
template: post
comments: true
---

In this post I explain LIME (Ribeiro et. al. 2016), the Shapley values
(Shapley, 1953) and the SHAP values (Strumbelj-Kononenko, 2014; Lundberg-Lee, 2017).

__Acknowledgement__. Thanks to Josef Lindman Hörnlund for bringing the LIME
and SHAP papers to my attention. The research was done while working at KTH 
mathematics department.

_If you are reading on a mobile device, you may need to "request desktop site"
for the equations to be properly displayed. This post is licensed under CC BY-SA
and GNU FDL._

Shapley values
--------------

A coalitional game $(v, N)$ of $n$ players involves

-   The set $N = \{1, 2, ..., n\}$ that represents the players.
-   A function $v: 2^N \to \mathbb R$, where $v(S)$ is the worth of
    coalition $S \subset N$.

The Shapley values $\phi_i(v)$ of such a game specify a fair way to
distribute the total worth $v(N)$ to the players. It is defined as (in
the following, for a set $S \subset N$ we use the convention $s = |S|$
to be the number of elements of set $S$ and the shorthand
$S - i := S \setminus \{i\}$ and $S + i := S \cup \{i\}$)

$$\phi_i(v) = \sum_{S: i \in S} {(n - s)! (s - 1)! \over n!} (v(S) - v(S - i)).$$

It is not hard to see that $\phi_i(v)$ can be viewed as an expectation:

$$\phi_i(v) = \mathbb E_{S \sim \nu_i} (v(S) - v(S - i))$$

where $\nu_i(S) = n^{-1} {n - 1 \choose s - 1}^{-1} 1_{i \in S}$, that
is, first pick the size $s$ uniformly from $\{1, 2, ..., n\}$, then pick
$S$ uniformly from the subsets of $N$ that has size $s$ and contains
$i$.

The Shapley values satisfy some nice properties which are readily
verified, including:

-   **Efficiency**.
    $\sum_i \phi_i(v) = v(N) - v(\emptyset)$.
-   **Symmetry**. If for some $i, j \in N$, for all
    $S \subset N$, we have $v(S + i) = v(S + j)$, then
    $\phi_i(v) = \phi_j(v)$.
-   **Null player**. If for some $i \in N$, for all
    $S \subset N$, we have $v(S + i) = v(S)$, then $\phi_i(v) = 0$.
-   **Linearity**. $\phi_i$ is linear in games. That is
    $\phi_i(v) + \phi_i(w) = \phi_i(v + w)$, where $v + w$ is defined by
    $(v + w)(S) := v(S) + w(S)$.

In the literature, an added assumption $v(\emptyset) = 0$ is often
given, in which case the Efficiency property is defined as
$\sum_i \phi_i(v) = v(N)$. Here I discard this assumption to avoid minor
inconsistencies across different sources. For example, in the LIME
paper, the local model is defined without an intercept, even though the
underlying $v(\emptyset)$ may not be $0$. In the SHAP paper, an
intercept $\phi_0 = v(\emptyset)$ is added which fixes this problem when
making connections to the Shapley values.

Conversely, according to Strumbelj-Kononenko (2010), it was shown in
Shapley\'s original paper (Shapley, 1953) that these four properties
together with $v(\emptyset) = 0$ defines the Shapley values.

LIME
----

LIME (Ribeiro et. al. 2016) is a model that offers a way to explain
feature contributions of supervised learning models locally.

Let $f: X_1 \times X_2 \times ... \times X_n \to \mathbb R$ be a
function. We can think of $f$ as a model, where $X_j$ is the space of
$j$th feature. For example, in a language model, $X_j$ may correspond to
the count of the $j$th word in the vocabulary, i.e. the bag-of-words model.

The output may be something like housing price, or log-probability of
something.

LIME tries to assign a value to each feature *locally*. By locally, we
mean that given a specific sample $x \in X := \prod_{i = 1}^n X_i$, we
want to fit a model around it.

More specifically, let $h_x: 2^N \to X$ be a function defined by

$$(h_x(S))_i = 
\begin{cases}
x_i, & \text{if }i \in S; \\
0, & \text{otherwise.}
\end{cases}$$

That is, $h_x(S)$ masks the features that are not in $S$, or in other
words, we are perturbing the sample $x$. Specifically, $h_x(N) = x$.
Alternatively, the $0$ in the \"otherwise\" case can be replaced by some
kind of default value (see the section titled SHAP in this post).

For a set $S \subset N$, let us denote $1_S \in \{0, 1\}^n$ to be an
$n$-bit where the $k$th bit is $1$ if and only if $k \in S$.

Basically, LIME samples $S_1, S_2, ..., S_m \subset N$ to obtain a set
of perturbed samples $x_i = h_x(S_i)$ in the $X$ space, and then fits a
linear model $g$ using $1_{S_i}$ as the input samples and $f(h_x(S_i))$
as the output samples:

**Problem**(LIME). Find $w = (w_1, w_2, ..., w_n)$ that
minimises

$$\sum_i (w \cdot 1_{S_i} - f(h_x(S_i)))^2 \pi_x(h_x(S_i))$$

where $\pi_x(x')$ is a function that penalises $x'$s that are far away
from $x$. In the LIME paper the Gaussian kernel was used:

$$\pi_x(x') = \exp\left({- \|x - x'\|^2 \over \sigma^2}\right).$$

Then $w_i$ represents the importance of the $i$th feature.

The LIME model has a more general framework, but the specific model
considered in the paper is the one described above, with a Lasso for
feature selection.

**Remark**. One difference between our account here and the one in the LIME paper
is: the dimension of the data space may differ from $n$ (see Section 3.1 of that paper).
But in the case of text data, they do use bag-of-words (our $X$) for an "intermediate"
representation. So my understanding is, in their context, there is an
"original" data space (let's call it $X'$). And there is a one-one correspondence
between $X'$ and $X$ (let's call it $r: X' \to X$), so that given a 
sample $x' \in X'$, we can compute the output of $S$ in the local model 
with $f(r^{-1}(h_{r(x')}(S)))$.
As an example, in the example of $X$ being the bag of words, $X'$ may be 
the embedding vector space, so that $r(x') = A^{-1} x'$, where $A$ 
is the word embedding matrix.
Therefore, without loss of generality, we assume the input space to be
$X$ which is of dimension $n$.

Shapley values and LIME
-----------------------

The connection between the Shapley values and LIME is noted in
Lundberg-Lee (2017), but the underlying connection goes back to 1988
(Charnes et. al.).

To see the connection, we need to modify LIME a bit.

First, we need to make LIME less efficient by considering *all* the
$2^n$ subsets instead of the $m$ samples $S_1, S_2, ..., S_{m}$.

Then we need to relax the definition of $\pi_x$. It no longer needs to
penalise samples that are far away from $x$. In fact, we will see later
than the choice of $\pi_x(x')$ that yields the Shapley values is high
when $x'$ is very close or very far away from $x$, and low otherwise. We
further add the restriction that $\pi_x(h_x(S))$ only depends on the
size of $S$, thus we rewrite it as $q(s)$ instead.

We also denote $v(S) := f(h_x(S))$ and $w(S) = \sum_{i \in S} w_i$.

Finally, we add the Efficiency property as a constraint:
$\sum_{i = 1}^n w_i = f(x) - f(h_x(\emptyset)) = v(N) - v(\emptyset)$.

Then the problem becomes a weighted linear regression:

**Problem**. minimises
$\sum_{S \subset N} (w(S) - v(S))^2 q(s)$ over $w$ subject to
$w(N) = v(N) - v(\emptyset)$.

**Claim** (Charnes et. al. 1988). The solution to this problem
is

$$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s)\right)^{-1} \sum_{S \subset N: i \in S} \left({n - s \over n} q(s) v(S) - {s - 1 \over n} q(s - 1) v(S - i)\right). \qquad (-1)$$

Specifically, if we choose

$$q(s) = c {n - 2 \choose s - 1}^{-1}$$

for any constant $c$, then $w_i = \phi_i(v)$ are the Shapley values.

**Remark**. Don\'t worry about this specific choice of $q(s)$
when $s = 0$ or $n$, because $q(0)$ and $q(n)$ do not appear on the
right hand side of (-1). Therefore they can be defined to be of any
value. A common convention of the binomial coefficients is to set
${\ell \choose k} = 0$ if $k < 0$ or $k > \ell$, in which case
$q(0) = q(n) = \infty$.

In Lundberg-Lee (2017), $c$ is chosen to be $1 / n$, see Theorem 2
there.

In Charnes et. al. 1988, the $w_i$s defined in (-1) are called the 
generalised Shapley values.

**Proof**. The Lagrangian is

$$L(w, \lambda) = \sum_{S \subset N} (v(S) - w(S))^2 q(s) - \lambda(w(N) - v(N) + v(\emptyset)).$$

and by making $\partial_{w_i} L(w, \lambda) = 0$ we have

$${1 \over 2} \lambda = \sum_{S \subset N: i \in S} (w(S) - v(S)) q(s). \qquad (0)$$

Summing (0) over $i$ and divide by $n$, we have

$${1 \over 2} \lambda = {1 \over n} \sum_i \sum_{S: i \in S} (w(S) q(s) - v(S) q(s)). \qquad (1)$$

We examine each of the two terms on the right hand side.

Counting the terms involving $w_i$ and $w_j$ for $j \neq i$, and using
$w(N) = v(N) - v(\emptyset)$ we have

$$\begin{aligned}
&\sum_{S \subset N: i \in S} w(S) q(s) \\
&= \sum_{s = 1}^n {n - 1 \choose s - 1} q(s) w_i + \sum_{j \neq i}\sum_{s = 2}^n {n - 2 \choose s - 2} q(s) w_j \\
&= q(1) w_i + \sum_{s = 2}^n q(s) \left({n - 1 \choose s - 1} w_i + \sum_{j \neq i} {n - 2 \choose s - 2} w_j\right) \\
&= q(1) w_i + \sum_{s = 2}^n \left({n - 2 \choose s - 1} w_i + {n - 2 \choose s - 2} (v(N) - v(\emptyset))\right) q(s) \\
&= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) w_i + \sum_{s = 2}^n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset)). \qquad (2)
\end{aligned}$$

Summing (2) over $i$, we obtain

$$\begin{aligned}
&\sum_i \sum_{S: i \in S} w(S) q(s)\\
&= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) (v(N) - v(\emptyset)) + \sum_{s = 2}^n n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset))\\
&= \sum_{s = 1}^n s{n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset)). \qquad (3)
\end{aligned}$$

For the second term in (1), we have

$$\sum_i \sum_{S: i \in S} v(S) q(s) = \sum_{S \subset N} s v(S) q(s). \qquad (4)$$

Plugging (3)(4) in (1), we have

$${1 \over 2} \lambda = {1 \over n} \left(\sum_{S \subset N} s q(s) v(S) - \sum_{s = 1}^n s {n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset))\right). \qquad (5)$$

Plugging (5)(2) in (0) and solve for $w_i$, we have

$$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) \right)^{-1} \left( \sum_{S: i \in S} q(s) v(S) - {1 \over n} \sum_{S \subset N} s q(s) v(S) \right). \qquad (6)$$

By splitting all subsets of $N$ into ones that contain $i$ and ones that
do not and pair them up, we have

$$\sum_{S \subset N} s q(s) v(S) = \sum_{S: i \in S} (s q(s) v(S) + (s - 1) q(s - 1) v(S - i)).$$

Plugging this back into (6) we get the desired result. $\square$

SHAP
----

The paper that coined the term \"SHAP values\" (Lundberg-Lee 2017) 
is not clear in its definition of the \"SHAP values\" and its 
relation to LIME, so the following is my
interpretation of their interpretation model, which coincide with a
model studied in Strumbelj-Kononenko 2014.

Recall that we want to calculate feature contributions to a model $f$ at
a sample $x$.

Let $\mu$ be a probability density function over the input space
$X = X_1 \times ... \times X_n$. A natural choice would be the density
that generates the data, or one that approximates such density (e.g.
empirical distribution).

The feature contribution (SHAP value) is thus defined as the Shapley
value $\phi_i(v)$, where

$$v(S) = \mathbb E_{z \sim \mu} (f(z) | z_S = x_S). \qquad (7)$$

So it is a conditional expectation where $z_i$ is clamped for $i \in S$.
In fact, the definition of feature contributions in this form predates 
Lundberg-Lee 2017. For example, it can be found in Strumbelj-Kononenko 2014.

One simplification is to assume the $n$ features are independent, thus
$\mu = \mu_1 \times \mu_2 \times ... \times \mu_n$. In this case, (7)
becomes

$$v(S) = \mathbb E_{z_{N \setminus S} \sim \mu_{N \setminus S}} f(x_S, z_{N \setminus S}) \qquad (8)$$

For example, Strumbelj-Kononenko (2010) considers this scenario where $\mu$ is
the uniform distribution over $X$, see Definition 4 there.

A further simplification is model linearity, which means $f$ is linear.
In this case, (8) becomes

$$v(S) = f(x_S, \mathbb E_{\mu_{N \setminus S}} z_{N \setminus S}). \qquad (9)$$

It is worth noting that to make the modified LIME model considered in
the previous section fall under the linear SHAP framework (9), we need
to make two further specialisations, the first is rather cosmetic: we need to
change the definition of $h_x(S)$ to

$$(h_x(S))_i = 
\begin{cases}
x_i, & \text{if }i \in S; \\
\mathbb E_{\mu_i} z_i, & \text{otherwise.}
\end{cases}$$

But we also need to boldly assume the original $f$ to be linear, which in
my view, defeats the purpose of interpretability, because linear models are
interpretable by themselves.

One may argue that perhaps we do not need linearity to define $v(S)$ as in (9).
If we do so, however, then (9) loses mathematical meaning.
A bigger question is: how effective is SHAP?
An even bigger question: in general, how do we evaluate models of interpretation?

Evaluating SHAP
---------------

The quest of the SHAP paper can be decoupled into two independent components:
showing the niceties of Shapley values and choosing the coalitional game $v$.

The SHAP paper argues that Shapley values $\phi_i(v)$ are a good measurement because they
are the only values satisfying the some nice properties including the Efficiency
property mentioned at the beginning of the post, invariance under permutation
and monotonicity, see the paragraph below Theorem 1 there, which refers to Theorem 2 of 
Young (1985).

Indeed, both efficiency (the "additive feature attribution methods" in the paper) 
and monotonicity are meaningful when considering $\phi_i(v)$ as the 
feature contribution of the $i$th feature.

The question is thus reduced to the second component: what constitutes 
a nice choice of $v$?

The SHAP paper answers this question with 3 options with increasing simplification:
(7)(8)(9) in the previous section of this post (corresponding to (9)(11)(12) in the paper).
They are intuitive, but it will be interesting to see more concrete (or even mathematical)
justifications of such choices.

References
----------

-   Charnes, A., B. Golany, M. Keane, and J. Rousseau. "Extremal
    Principle Solutions of Games in Characteristic Function Form: Core,
    Chebychev and Shapley Value Generalizations." In Econometrics of
    Planning and Efficiency, edited by Jati K. Sengupta and Gopal K.
    Kadekodi, 123--33. Dordrecht: Springer Netherlands, 1988.
    <https://doi.org/10.1007/978-94-009-3677-5_7>.
-   Lundberg, Scott, and Su-In Lee. "A Unified Approach to Interpreting
    Model Predictions." ArXiv:1705.07874 \[Cs, Stat\], May 22, 2017.
    <http://arxiv.org/abs/1705.07874>.
-   Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "'Why
    Should I Trust You?': Explaining the Predictions of Any Classifier."
    ArXiv:1602.04938 \[Cs, Stat\], February 16, 2016.
    <http://arxiv.org/abs/1602.04938>.
-   Shapley, L. S. "17. A Value for n-Person Games." In Contributions to
    the Theory of Games (AM-28), Volume II, Vol. 2. Princeton: Princeton
    University Press, 1953. <https://doi.org/10.1515/9781400881970-018>.
-   Strumbelj, Erik, and Igor Kononenko. "An Efficient Explanation of
    Individual Classifications Using Game Theory." J. Mach. Learn. Res.
    11 (March 2010): 1--18.
-   Strumbelj, Erik, and Igor Kononenko. “Explaining Prediction Models and Individual Predictions with Feature Contributions.” Knowledge and Information Systems 41, no. 3 (December 2014): 647–65. <https://doi.org/10.1007/s10115-013-0679-x>.
-   Young, H. P. “Monotonic Solutions of Cooperative Games.” International 
    Journal of Game Theory 14, no. 2 (June 1, 1985): 65–72. 
    <https://doi.org/10.1007/BF01769885>.