aboutsummaryrefslogtreecommitdiff
path: root/posts/2019-01-03-discriminant-analysis.md
diff options
context:
space:
mode:
Diffstat (limited to 'posts/2019-01-03-discriminant-analysis.md')
-rw-r--r--posts/2019-01-03-discriminant-analysis.md278
1 files changed, 0 insertions, 278 deletions
diff --git a/posts/2019-01-03-discriminant-analysis.md b/posts/2019-01-03-discriminant-analysis.md
deleted file mode 100644
index 47e2cc8..0000000
--- a/posts/2019-01-03-discriminant-analysis.md
+++ /dev/null
@@ -1,278 +0,0 @@
----
-title: Discriminant analysis
-date: 2019-01-03
-template: post
-comments: true
----
-
-In this post I talk about the theory and implementation of linear and
-quadratic discriminant analysis, classical methods in statistical
-learning.
-
-**Acknowledgement**. Various sources were of great
-help to my understanding of the subject, including Chapter 4 of [The
-Elements of Statistical
-Learning](https://web.stanford.edu/~hastie/ElemStatLearn/), [Stanford
-CS229 Lecture notes](http://cs229.stanford.edu/notes/cs229-notes2.pdf),
-and [the scikit-learn
-code](https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/discriminant_analysis.py).
-Research was done while working at KTH mathematics department.
-
-_If you are reading on a mobile device, you may need to "request desktop site"
-for the equations to be properly displayed. This post is licensed under CC BY-SA
-and GNU FDL._
-
-Theory
-------
-
-Quadratic discriminant analysis (QDA) is a classical classification
-algorithm. It assumes that the data is generated by Gaussian
-distributions, where each class has its own mean and covariance.
-
-$$(x | y = i) \sim N(\mu_i, \Sigma_i).$$
-
-It also assumes a categorical class prior:
-
-$$\mathbb P(y = i) = \pi_i$$
-
-The log-likelihood is thus
-
-$$\begin{aligned}
-\log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\
-&= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0)
-\end{aligned}$$
-
-where $C$ and $C'$ are constants.
-
-Thus the prediction is done by taking argmax of the above formula.
-
-In training, let $X$, $y$ be the input data, where $X$ is of shape
-$m \times n$, and $y$ of shape $m$. We adopt the convention that each
-row of $X$ is a sample $x^{(i)T}$. So there are $m$ samples and $n$
-features. Denote by $m_i = \#\{j: y_j = i\}$ be the number of samples in
-class $i$. Let $n_c$ be the number of classes.
-
-We estimate $\mu_i$ by the sample means, and $\pi_i$ by the frequencies:
-
-$$\begin{aligned}
-\mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\
-\pi_i &:= \mathbb P(y = i) = {m_i \over m}.
-\end{aligned}$$
-
-Linear discriminant analysis (LDA) is a specialisation of QDA: it
-assumes all classes share the same covariance, i.e. $\Sigma_i = \Sigma$
-for all $i$.
-
-Guassian Naive Bayes is a different specialisation of QDA: it assumes
-that all $\Sigma_i$ are diagonal, since all the features are assumed to
-be independent.
-
-### QDA
-
-We look at QDA.
-
-We estimate $\Sigma_i$ by the variance mean:
-
-$$\begin{aligned}
-\Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}.
-\end{aligned}$$
-
-where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ are the centred $x^{(j)}$.
-Plugging this into (0) we are done.
-
-There are two problems that can break the algorithm. First, if one of
-the $m_i$ is $1$, then $\Sigma_i$ is ill-defined. Second, one of
-$\Sigma_i$\'s might be singular.
-
-In either case, there is no way around it, and the implementation should
-throw an exception.
-
-This won\'t be a problem of the LDA, though, unless there is only one
-sample for each class.
-
-### Vanilla LDA
-
-Now let us look at LDA.
-
-Since all classes share the same covariance, we estimate $\Sigma$ using
-sample variance
-
-$$\begin{aligned}
-\Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T},
-\end{aligned}$$
-
-where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ and ${1 \over m - n_c}$ comes
-from [Bessel\'s
-Correction](https://en.wikipedia.org/wiki/Bessel%27s_correction).
-
-Let us write down the decision function (0). We can remove the first
-term on the right hand side, since all $\Sigma_i$ are the same, and we
-only care about argmax of that equation. Thus it becomes
-
-$$- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)$$
-
-Notice that we just walked around the problem of figuring out
-$\log \det \Sigma$ when $\Sigma$ is singular.
-
-But how about $\Sigma^{-1}$?
-
-We sidestep this problem by using the pseudoinverse of $\Sigma$ instead.
-This can be seen as applying a linear transformation to $X$ to turn its
-covariance matrix to identity. And thus the model becomes a sort of a
-nearest neighbour classifier.
-
-### Nearest neighbour classifier
-
-More specifically, we want to transform the first term of (0) to a norm
-to get a classifier based on nearest neighbour modulo $\log \pi_i$:
-
-$$- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i$$
-
-To compute $A$, we denote
-
-$$X_c = X - M,$$
-
-where the $i$th row of $M$ is $\mu_{y_i}^T$, the mean of the class $x_i$
-belongs to, so that $\Sigma = {1 \over m - n_c} X_c^T X_c$.
-
-Let
-
-$${1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T$$
-
-be the SVD of ${1 \over \sqrt{m - n_c}}X_c$. Let
-$D_x = \text{diag} (s_1, ..., s_r)$ be the diagonal matrix with all the
-nonzero singular values, and rewrite $V_x$ as an $n \times r$ matrix
-consisting of the first $r$ columns of $V_x$.
-
-Then with an abuse of notation, the pseudoinverse of $\Sigma$ is
-
-$$\Sigma^{-1} = V_x D_x^{-2} V_x^T.$$
-
-So we just need to make $A = D_x^{-1} V_x^T$. When it comes to
-prediction, just transform $x$ with $A$, and find the nearest centroid
-$A \mu_i$ (again, modulo $\log \pi_i$) and label the input with $i$.
-
-### Dimensionality reduction
-
-We can further simplify the prediction by dimensionality reduction.
-Assume $n_c \le n$. Then the centroid spans an affine space of dimension
-$p$ which is at most $n_c - 1$. So what we can do is to project both the
-transformed sample $Ax$ and centroids $A\mu_i$ to the linear subspace
-parallel to the affine space, and do the nearest neighbour
-classification there.
-
-So we can perform SVD on the matrix $(M - \bar x) V_x D_x^{-1}$ where
-$\bar x$, a row vector, is the sample mean of all data i.e. average of
-rows of $X$:
-
-$$(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.$$
-
-Again, we let $V_m$ be the $r \times p$ matrix by keeping the first $p$
-columns of $V_m$.
-
-The projection operator is thus $V_m$. And so the final transformation
-is $V_m^T D_x^{-1} V_x^T$.
-
-There is no reason to stop here, and we can set $p$ even smaller, which
-will result in a lossy compression / regularisation equivalent to doing
-[principle component
-analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) on
-$(M - \bar x) V_x D_x^{-1}$.
-
-Note that as of 2019-01-04, in the [scikit-learn implementation of LDA](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py), the prediction is done without
-any lossy compression, even if the parameter `n_components` is set to be smaller
-than dimension of the affine space spanned by the centroids.
-In other words, the prediction does not change regardless of `n_components`.
-
-### Fisher discriminant analysis
-
-The Fisher discriminant analysis involves finding an $n$-dimensional
-vector $a$ that maximises between-class covariance with respect to
-within-class covariance:
-
-$${a^T M_c^T M_c a \over a^T X_c^T X_c a},$$
-
-where $M_c = M - \bar x$ is the centred sample mean matrix.
-
-As it turns out, this is (almost) equivalent to the derivation above,
-modulo a constant. In particular, $a = c V_m^T D_x^{-1} V_x^T$ where
-$p = 1$ for arbitrary constant $c$.
-
-To see this, we can first multiply the denominator with a constant
-${1 \over m - n_c}$ so that the matrix in the denominator becomes the
-covariance estimate $\Sigma$.
-
-We decompose $a$: $a = V_x D_x^{-1} b + \tilde V_x \tilde b$, where
-$\tilde V_x$ consists of column vectors orthogonal to the column space
-of $V_x$.
-
-We ignore the second term in the decomposition. In other words, we only
-consider $a$ in the column space of $V_x$.
-
-Then the problem is to find an $r$-dimensional vector $b$ to maximise
-
-$${b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.$$
-
-This is the problem of principle component analysis, and so $b$ is first
-column of $V_m$.
-
-Therefore, the solution to Fisher discriminant analysis is
-$a = c V_x D_x^{-1} V_m$ with $p = 1$.
-
-### Linear model
-
-The model is called linear discriminant analysis because it is a linear
-model. To see this, let $B = V_m^T D_x^{-1} V_x^T$ be the matrix of
-transformation. Now we are comparing
-
-$$- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k$$
-
-across all $k$s. Expanding the norm and removing the common term
-$\|B x\|^2$, we see a linear form:
-
-$$\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k$$
-
-So the prediction for $X_{\text{new}}$ is
-
-$$\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)$$
-
-thus the decision boundaries are linear.
-
-This is how scikit-learn implements LDA, by inheriting from
-`LinearClassifierMixin` and redirecting the classification there.
-
-Implementation
---------------
-
-This is where things get interesting. How do I validate my understanding
-of the theory? By implementing and testing the algorithm.
-
-I try to implement it as close as possible to the natural language /
-mathematical descriptions of the model, which means clarity over
-performance.
-
-How about testing? Numerical experiments are harder to test than
-combinatorial / discrete algorithms in general because the output is
-less verifiable by hand. My shortcut solution to this problem is to test
-against output from the scikit-learn package.
-
-It turned out to be harder than expected, as I had to dig into the code
-of scikit-learn when the outputs mismatch. Their code is quite
-well-written though.
-
-The result is
-[here](https://github.com/ycpei/machine-learning/tree/master/discriminant-analysis).
-
-### Fun facts about LDA
-
-One property that can be used to test the LDA implementation is the fact
-that the scatter matrix $B(X - \bar x)^T (X - \bar X) B^T$ of the
-transformed centred sample is diagonal.
-
-This can be derived by using another fun fact that the sum of the
-in-class scatter matrix and the between-class scatter matrix is the
-sample scatter matrix:
-
-$$X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).$$
-
-The verification is not very hard and left as an exercise.