# Copyright (C) 2013-2021 Yuchen Pei.

# Permission is granted to copy, distribute and/or modify this
# document under the terms of the GNU Free Documentation License,
# Version 1.3 or any later version published by the Free Software
# Foundation; with no Invariant Sections, no Front-Cover Texts, and
# no Back-Cover Texts. You should have received a copy of the GNU
# Free Documentation License. If not, see <https://www.gnu.org/licenses/>.

# This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

#+title: Discriminant analysis

#+DATE: <2019-01-03>

In this post I talk about the theory and implementation of linear and
quadratic discriminant analysis, classical methods in statistical
learning.

*Acknowledgement*. Various sources were of great help to my
understanding of the subject, including Chapter 4 of
[[https://web.stanford.edu/~hastie/ElemStatLearn/][The Elements of
Statistical Learning]],
[[http://cs229.stanford.edu/notes/cs229-notes2.pdf][Stanford CS229
Lecture notes]], and
[[https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/discriminant_analysis.py][the
scikit-learn code]]. Research was done while working at KTH mathematics
department.

/If you are reading on a mobile device, you may need to "request desktop
site" for the equations to be properly displayed. This post is licensed
under CC BY-SA and GNU FDL./

** Theory
   :PROPERTIES:
   :CUSTOM_ID: theory
   :ID:       69be3baf-7f60-42f2-9184-ee8840eea554
   :END:
Quadratic discriminant analysis (QDA) is a classical classification
algorithm. It assumes that the data is generated by Gaussian
distributions, where each class has its own mean and covariance.

$$(x | y = i) \sim N(\mu_i, \Sigma_i).$$

It also assumes a categorical class prior:

$$\mathbb P(y = i) = \pi_i$$

The log-likelihood is thus

$$\begin{aligned}
\log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\
&= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0)
\end{aligned}$$

where $C$ and $C'$ are constants.

Thus the prediction is done by taking argmax of the above formula.

In training, let $X$, $y$ be the input data, where $X$ is of shape
$m \times n$, and $y$ of shape $m$. We adopt the convention that each
row of $X$ is a sample $x^{(i)T}$. So there are $m$ samples and $n$
features. Denote by $m_i = \#\{j: y_j = i\}$ be the number of samples in
class $i$. Let $n_c$ be the number of classes.

We estimate $\mu_i$ by the sample means, and $\pi_i$ by the frequencies:

$$\begin{aligned}
\mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\
\pi_i &:= \mathbb P(y = i) = {m_i \over m}.
\end{aligned}$$

Linear discriminant analysis (LDA) is a specialisation of QDA: it
assumes all classes share the same covariance, i.e. $\Sigma_i = \Sigma$
for all $i$.

Guassian Naive Bayes is a different specialisation of QDA: it assumes
that all $\Sigma_i$ are diagonal, since all the features are assumed to
be independent.

*** QDA
    :PROPERTIES:
    :CUSTOM_ID: qda
    :ID:       f6e95892-01cf-4569-b01e-22ed238d0577
    :END:
We look at QDA.

We estimate $\Sigma_i$ by the variance mean:

$$\begin{aligned}
\Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}.
\end{aligned}$$

where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ are the centred $x^{(j)}$.
Plugging this into (0) we are done.

There are two problems that can break the algorithm. First, if one of
the $m_i$ is $1$, then $\Sigma_i$ is ill-defined. Second, one of
$\Sigma_i$'s might be singular.

In either case, there is no way around it, and the implementation should
throw an exception.

This won't be a problem of the LDA, though, unless there is only one
sample for each class.

*** Vanilla LDA
    :PROPERTIES:
    :CUSTOM_ID: vanilla-lda
    :ID:       5a6ca0ca-f385-4054-9b19-9cac69b1a59a
    :END:
Now let us look at LDA.

Since all classes share the same covariance, we estimate $\Sigma$ using
sample variance

$$\begin{aligned}
\Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T},
\end{aligned}$$

where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ and ${1 \over m - n_c}$ comes
from [[https://en.wikipedia.org/wiki/Bessel%27s_correction][Bessel's
Correction]].

Let us write down the decision function (0). We can remove the first
term on the right hand side, since all $\Sigma_i$ are the same, and we
only care about argmax of that equation. Thus it becomes

$$- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)$$

Notice that we just walked around the problem of figuring out
$\log \det \Sigma$ when $\Sigma$ is singular.

But how about $\Sigma^{-1}$?

We sidestep this problem by using the pseudoinverse of $\Sigma$ instead.
This can be seen as applying a linear transformation to $X$ to turn its
covariance matrix to identity. And thus the model becomes a sort of a
nearest neighbour classifier.

*** Nearest neighbour classifier
    :PROPERTIES:
    :CUSTOM_ID: nearest-neighbour-classifier
    :ID:       8880764c-6fbe-4023-97dd-9711c7c50ea9
    :END:
More specifically, we want to transform the first term of (0) to a norm
to get a classifier based on nearest neighbour modulo $\log \pi_i$:

$$- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i$$

To compute $A$, we denote

$$X_c = X - M,$$

where the $i$th row of $M$ is $\mu_{y_i}^T$, the mean of the class $x_i$
belongs to, so that $\Sigma = {1 \over m - n_c} X_c^T X_c$.

Let

$${1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T$$

be the SVD of ${1 \over \sqrt{m - n_c}}X_c$. Let
$D_x = \text{diag} (s_1, ..., s_r)$ be the diagonal matrix with all the
nonzero singular values, and rewrite $V_x$ as an $n \times r$ matrix
consisting of the first $r$ columns of $V_x$.

Then with an abuse of notation, the pseudoinverse of $\Sigma$ is

$$\Sigma^{-1} = V_x D_x^{-2} V_x^T.$$

So we just need to make $A = D_x^{-1} V_x^T$. When it comes to
prediction, just transform $x$ with $A$, and find the nearest centroid
$A \mu_i$ (again, modulo $\log \pi_i$) and label the input with $i$.

*** Dimensionality reduction
    :PROPERTIES:
    :CUSTOM_ID: dimensionality-reduction
    :ID:       70e1afc1-9c45-4a35-a842-48573e077b36
    :END:
We can further simplify the prediction by dimensionality reduction.
Assume $n_c \le n$. Then the centroid spans an affine space of dimension
$p$ which is at most $n_c - 1$. So what we can do is to project both the
transformed sample $Ax$ and centroids $A\mu_i$ to the linear subspace
parallel to the affine space, and do the nearest neighbour
classification there.

So we can perform SVD on the matrix $(M - \bar x) V_x D_x^{-1}$ where
$\bar x$, a row vector, is the sample mean of all data i.e. average of
rows of $X$:

$$(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.$$

Again, we let $V_m$ be the $r \times p$ matrix by keeping the first $p$
columns of $V_m$.

The projection operator is thus $V_m$. And so the final transformation
is $V_m^T D_x^{-1} V_x^T$.

There is no reason to stop here, and we can set $p$ even smaller, which
will result in a lossy compression / regularisation equivalent to doing
[[https://en.wikipedia.org/wiki/Principal_component_analysis][principle
component analysis]] on $(M - \bar x) V_x D_x^{-1}$.

Note that as of 2019-01-04, in the
[[https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py][scikit-learn
implementation of LDA]], the prediction is done without any lossy
compression, even if the parameter =n_components= is set to be smaller
than dimension of the affine space spanned by the centroids. In other
words, the prediction does not change regardless of =n_components=.

*** Fisher discriminant analysis
    :PROPERTIES:
    :CUSTOM_ID: fisher-discriminant-analysis
    :ID:       05ff25da-8c52-4f20-a0ac-4422f19e10ce
    :END:
The Fisher discriminant analysis involves finding an $n$-dimensional
vector $a$ that maximises between-class covariance with respect to
within-class covariance:

$${a^T M_c^T M_c a \over a^T X_c^T X_c a},$$

where $M_c = M - \bar x$ is the centred sample mean matrix.

As it turns out, this is (almost) equivalent to the derivation above,
modulo a constant. In particular, $a = c V_m^T D_x^{-1} V_x^T$ where
$p = 1$ for arbitrary constant $c$.

To see this, we can first multiply the denominator with a constant
${1 \over m - n_c}$ so that the matrix in the denominator becomes the
covariance estimate $\Sigma$.

We decompose $a$: $a = V_x D_x^{-1} b + \tilde V_x \tilde b$, where
$\tilde V_x$ consists of column vectors orthogonal to the column space
of $V_x$.

We ignore the second term in the decomposition. In other words, we only
consider $a$ in the column space of $V_x$.

Then the problem is to find an $r$-dimensional vector $b$ to maximise

$${b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.$$

This is the problem of principle component analysis, and so $b$ is first
column of $V_m$.

Therefore, the solution to Fisher discriminant analysis is
$a = c V_x D_x^{-1} V_m$ with $p = 1$.

*** Linear model
    :PROPERTIES:
    :CUSTOM_ID: linear-model
    :ID:       feb827b6-0064-4192-b96b-86a942c8839e
    :END:
The model is called linear discriminant analysis because it is a linear
model. To see this, let $B = V_m^T D_x^{-1} V_x^T$ be the matrix of
transformation. Now we are comparing

$$- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k$$

across all $k$s. Expanding the norm and removing the common term
$\|B x\|^2$, we see a linear form:

$$\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k$$

So the prediction for $X_{\text{new}}$ is

$$\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)$$

thus the decision boundaries are linear.

This is how scikit-learn implements LDA, by inheriting from
=LinearClassifierMixin= and redirecting the classification there.

** Implementation
   :PROPERTIES:
   :CUSTOM_ID: implementation
   :ID:       b567283c-20ee-41a8-8216-7392066a5ac5
   :END:
This is where things get interesting. How do I validate my understanding
of the theory? By implementing and testing the algorithm.

I try to implement it as close as possible to the natural language /
mathematical descriptions of the model, which means clarity over
performance.

How about testing? Numerical experiments are harder to test than
combinatorial / discrete algorithms in general because the output is
less verifiable by hand. My shortcut solution to this problem is to test
against output from the scikit-learn package.

It turned out to be harder than expected, as I had to dig into the code
of scikit-learn when the outputs mismatch. Their code is quite
well-written though.

The result is
[[https://github.com/ycpei/machine-learning/tree/master/discriminant-analysis][here]].

*** Fun facts about LDA
    :PROPERTIES:
    :CUSTOM_ID: fun-facts-about-lda
    :ID:       f1d47f43-27f6-49dd-bd0d-2e685c38e241
    :END:
One property that can be used to test the LDA implementation is the fact
that the scatter matrix $B(X - \bar x)^T (X - \bar X) B^T$ of the
transformed centred sample is diagonal.

This can be derived by using another fun fact that the sum of the
in-class scatter matrix and the between-class scatter matrix is the
sample scatter matrix:

$$X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).$$

The verification is not very hard and left as an exercise.