1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
|
---
title: Discriminant analysis
date: 2019-01-03
template: post
comments: true
---
In this post I talk about the theory and implementation of linear and
quadratic discriminant analysis, classical methods in statistical
learning.
**Acknowledgement**. Various sources were of great
help to my understanding of the subject, including Chapter 4 of [The
Elements of Statistical
Learning](https://web.stanford.edu/~hastie/ElemStatLearn/), [Stanford
CS229 Lecture notes](http://cs229.stanford.edu/notes/cs229-notes2.pdf),
and [the scikit-learn
code](https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/discriminant_analysis.py).
Research was done while working at KTH mathematics department.
_If you are reading on a mobile device, you may need to "request desktop site"
for the equations to be properly displayed. This post is licensed under CC BY-SA
and GNU FDL._
Theory
------
Quadratic discriminant analysis (QDA) is a classical classification
algorithm. It assumes that the data is generated by Gaussian
distributions, where each class has its own mean and covariance.
$$(x | y = i) \sim N(\mu_i, \Sigma_i).$$
It also assumes a categorical class prior:
$$\mathbb P(y = i) = \pi_i$$
The log-likelihood is thus
$$\begin{aligned}
\log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\
&= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0)
\end{aligned}$$
where $C$ and $C'$ are constants.
Thus the prediction is done by taking argmax of the above formula.
In training, let $X$, $y$ be the input data, where $X$ is of shape
$m \times n$, and $y$ of shape $m$. We adopt the convention that each
row of $X$ is a sample $x^{(i)T}$. So there are $m$ samples and $n$
features. Denote by $m_i = \#\{j: y_j = i\}$ be the number of samples in
class $i$. Let $n_c$ be the number of classes.
We estimate $\mu_i$ by the sample means, and $\pi_i$ by the frequencies:
$$\begin{aligned}
\mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\
\pi_i &:= \mathbb P(y = i) = {m_i \over m}.
\end{aligned}$$
Linear discriminant analysis (LDA) is a specialisation of QDA: it
assumes all classes share the same covariance, i.e. $\Sigma_i = \Sigma$
for all $i$.
Guassian Naive Bayes is a different specialisation of QDA: it assumes
that all $\Sigma_i$ are diagonal, since all the features are assumed to
be independent.
### QDA
We look at QDA.
We estimate $\Sigma_i$ by the variance mean:
$$\begin{aligned}
\Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}.
\end{aligned}$$
where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ are the centred $x^{(j)}$.
Plugging this into (0) we are done.
There are two problems that can break the algorithm. First, if one of
the $m_i$ is $1$, then $\Sigma_i$ is ill-defined. Second, one of
$\Sigma_i$\'s might be singular.
In either case, there is no way around it, and the implementation should
throw an exception.
This won\'t be a problem of the LDA, though, unless there is only one
sample for each class.
### Vanilla LDA
Now let us look at LDA.
Since all classes share the same covariance, we estimate $\Sigma$ using
sample variance
$$\begin{aligned}
\Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T},
\end{aligned}$$
where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ and ${1 \over m - n_c}$ comes
from [Bessel\'s
Correction](https://en.wikipedia.org/wiki/Bessel%27s_correction).
Let us write down the decision function (0). We can remove the first
term on the right hand side, since all $\Sigma_i$ are the same, and we
only care about argmax of that equation. Thus it becomes
$$- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)$$
Notice that we just walked around the problem of figuring out
$\log \det \Sigma$ when $\Sigma$ is singular.
But how about $\Sigma^{-1}$?
We sidestep this problem by using the pseudoinverse of $\Sigma$ instead.
This can be seen as applying a linear transformation to $X$ to turn its
covariance matrix to identity. And thus the model becomes a sort of a
nearest neighbour classifier.
### Nearest neighbour classifier
More specifically, we want to transform the first term of (0) to a norm
to get a classifier based on nearest neighbour modulo $\log \pi_i$:
$$- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i$$
To compute $A$, we denote
$$X_c = X - M,$$
where the $i$th row of $M$ is $\mu_{y_i}^T$, the mean of the class $x_i$
belongs to, so that $\Sigma = {1 \over m - n_c} X_c^T X_c$.
Let
$${1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T$$
be the SVD of ${1 \over \sqrt{m - n_c}}X_c$. Let
$D_x = \text{diag} (s_1, ..., s_r)$ be the diagonal matrix with all the
nonzero singular values, and rewrite $V_x$ as an $n \times r$ matrix
consisting of the first $r$ columns of $V_x$.
Then with an abuse of notation, the pseudoinverse of $\Sigma$ is
$$\Sigma^{-1} = V_x D_x^{-2} V_x^T.$$
So we just need to make $A = D_x^{-1} V_x^T$. When it comes to
prediction, just transform $x$ with $A$, and find the nearest centroid
$A \mu_i$ (again, modulo $\log \pi_i$) and label the input with $i$.
### Dimensionality reduction
We can further simplify the prediction by dimensionality reduction.
Assume $n_c \le n$. Then the centroid spans an affine space of dimension
$p$ which is at most $n_c - 1$. So what we can do is to project both the
transformed sample $Ax$ and centroids $A\mu_i$ to the linear subspace
parallel to the affine space, and do the nearest neighbour
classification there.
So we can perform SVD on the matrix $(M - \bar x) V_x D_x^{-1}$ where
$\bar x$, a row vector, is the sample mean of all data i.e. average of
rows of $X$:
$$(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.$$
Again, we let $V_m$ be the $r \times p$ matrix by keeping the first $p$
columns of $V_m$.
The projection operator is thus $V_m$. And so the final transformation
is $V_m^T D_x^{-1} V_x^T$.
There is no reason to stop here, and we can set $p$ even smaller, which
will result in a lossy compression / regularisation equivalent to doing
[principle component
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) on
$(M - \bar x) V_x D_x^{-1}$.
Note that as of 2019-01-04, in the [scikit-learn implementation of LDA](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py), the prediction is done without
any lossy compression, even if the parameter `n_components` is set to be smaller
than dimension of the affine space spanned by the centroids.
In other words, the prediction does not change regardless of `n_components`.
### Fisher discriminant analysis
The Fisher discriminant analysis involves finding an $n$-dimensional
vector $a$ that maximises between-class covariance with respect to
within-class covariance:
$${a^T M_c^T M_c a \over a^T X_c^T X_c a},$$
where $M_c = M - \bar x$ is the centred sample mean matrix.
As it turns out, this is (almost) equivalent to the derivation above,
modulo a constant. In particular, $a = c V_m^T D_x^{-1} V_x^T$ where
$p = 1$ for arbitrary constant $c$.
To see this, we can first multiply the denominator with a constant
${1 \over m - n_c}$ so that the matrix in the denominator becomes the
covariance estimate $\Sigma$.
We decompose $a$: $a = V_x D_x^{-1} b + \tilde V_x \tilde b$, where
$\tilde V_x$ consists of column vectors orthogonal to the column space
of $V_x$.
We ignore the second term in the decomposition. In other words, we only
consider $a$ in the column space of $V_x$.
Then the problem is to find an $r$-dimensional vector $b$ to maximise
$${b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.$$
This is the problem of principle component analysis, and so $b$ is first
column of $V_m$.
Therefore, the solution to Fisher discriminant analysis is
$a = c V_x D_x^{-1} V_m$ with $p = 1$.
### Linear model
The model is called linear discriminant analysis because it is a linear
model. To see this, let $B = V_m^T D_x^{-1} V_x^T$ be the matrix of
transformation. Now we are comparing
$$- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k$$
across all $k$s. Expanding the norm and removing the common term
$\|B x\|^2$, we see a linear form:
$$\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k$$
So the prediction for $X_{\text{new}}$ is
$$\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)$$
thus the decision boundaries are linear.
This is how scikit-learn implements LDA, by inheriting from
`LinearClassifierMixin` and redirecting the classification there.
Implementation
--------------
This is where things get interesting. How do I validate my understanding
of the theory? By implementing and testing the algorithm.
I try to implement it as close as possible to the natural language /
mathematical descriptions of the model, which means clarity over
performance.
How about testing? Numerical experiments are harder to test than
combinatorial / discrete algorithms in general because the output is
less verifiable by hand. My shortcut solution to this problem is to test
against output from the scikit-learn package.
It turned out to be harder than expected, as I had to dig into the code
of scikit-learn when the outputs mismatch. Their code is quite
well-written though.
The result is
[here](https://github.com/ycpei/machine-learning/tree/master/discriminant-analysis).
### Fun facts about LDA
One property that can be used to test the LDA implementation is the fact
that the scatter matrix $B(X - \bar x)^T (X - \bar X) B^T$ of the
transformed centred sample is diagonal.
This can be derived by using another fun fact that the sum of the
in-class scatter matrix and the between-class scatter matrix is the
sample scatter matrix:
$$X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).$$
The verification is not very hard and left as an exercise.
|