From be0f45c7b29fc3c89bca4c43afec4d01ef9be8a5 Mon Sep 17 00:00:00 2001 From: Yuchen Pei Date: Thu, 3 Jan 2019 10:57:46 +0100 Subject: added a post on discriminant analysis --- posts/2019-01-03-discriminant-analysis.md | 268 ++++++++++++++++++++++++++++++ 1 file changed, 268 insertions(+) create mode 100644 posts/2019-01-03-discriminant-analysis.md (limited to 'posts') diff --git a/posts/2019-01-03-discriminant-analysis.md b/posts/2019-01-03-discriminant-analysis.md new file mode 100644 index 0000000..f019454 --- /dev/null +++ b/posts/2019-01-03-discriminant-analysis.md @@ -0,0 +1,268 @@ +--- +title: Discriminant analysis +date: 2019-01-03 +template: post +comments: true +--- + +In this post I will talk about theory and implementation of linear and +quadratic discriminant analysis, classical methods in statistical +learning. + +[]{#Acknowledgement}**Acknowledgement**. Various sources were of great +help to my understanding of the subject, including Chapter 4 of [The +Elements of Statistical +Learning](https://web.stanford.edu/~hastie/ElemStatLearn/), [Stanford +CS229 Lecture notes](http://cs229.stanford.edu/notes/cs229-notes2.pdf), +and [the scikit-learn +code](https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/discriminant_analysis.py). + +Theory {#Theory} +------ + +Quadratic discriminant analysis (QDA) is a classical classification +algorithm. It assumes that the data is generated by Gaussian +distributions, where each class has its own mean and covariance. + +$$(x | y = i) \sim N(\mu_i, \Sigma_i).$$ + +It also assumes a categorical class prior: + +$$\mathbb P(y = i) = \pi_i$$ + +The log-likelihood is thus + +$$\begin{aligned} +\log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\ +&= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0) +\end{aligned}$$ + +where $C$ and $C'$ are constants. + +Thus the prediction is done by taking argmax of the above formula. + +In training, let $X$, $y$ be the input data, where $X$ is of shape +$m \times n$, and $y$ of shape $m$. We adopt the convention that each +row of $X$ is a sample $x^{(i)T}$. So there are $m$ samples and $n$ +features. Denote by $m_i = \#\{j: y_j = i\}$ be the number of samples in +class $i$. Let $n_c$ be the number of classes. + +We estimate $\mu_i$ by the sample means, and $\pi_i$ by the frequencies: + +$$\begin{aligned} +\mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\ +\pi_i &:= \mathbb P(y = i) = {m_i \over m}. +\end{aligned}$$ + +Linear discriminant analysis (LDA) is a specialisation of QDA: it +assumes all classes share the same covariance, i.e. $\Sigma_i = \Sigma$ +for all $i$. + +Guassian Naive Bayes is a different specialisation of QDA: it assumes +that all $\Sigma_i$ are diagonal, since all the features are assumed to +be independent. + +### QDA {#QDA} + +We look at QDA. + +We estimate $\Sigma_i$ by the variance mean: + +$$\begin{aligned} +\Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}. +\end{aligned}$$ + +where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ are the centred $x^{(j)}$. +Plugging this into (0) we are done. + +There are two problems that can break the algorithm. First, if one of +the $m_i$ is $1$, then $\Sigma_i$ is ill-defined. Second, one of +$\Sigma_i$\'s might be singular. + +In either case, there is no way around it, and the implementation should +throw an exception. + +This won\'t be a problem of the LDA, though, unless there is only one +sample for each class. + +### Vanilla LDA {#Vanilla LDA} + +Now let us look at LDA. + +Since all classes share the same covariance, we estimate $\Sigma$ using +sample variance + +$$\begin{aligned} +\Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T}, +\end{aligned}$$ + +where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ and ${1 \over m - n_c}$ comes +from [Bessel\'s +Correction](https://en.wikipedia.org/wiki/Bessel%27s_correction). + +Let us write down the decision function (0). We can remove the first +term on the right hand side, since all $\Sigma_i$ are the same, and we +only care about argmax of that equation. Thus it becomes + +$$- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)$$ + +Notice that we just walked around the problem of figuring out +$\log \det \Sigma$ when $\Sigma$ is singular. + +But how about $\Sigma^{-1}$? + +We sidestep this problem by using the pseudoinverse of $\Sigma$ instead. +This can be seen as applying a linear transformation to $X$ to turn its +covariance matrix to identity. And thus the model becomes a sort of a +nearest neighbour classifier. + +### Nearest neighbour classifier {#Nearest neighbour classifier} + +More specifically, we want to transform the first term of (0) to a norm +to get a classifier based on nearest neighbour modulo $\log \pi_i$: + +$$- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i$$ + +To compute $A$, we denote + +$$X_c = X - M,$$ + +where the $i$th row of $M$ is $\mu_{y_i}^T$, the mean of the class $x_i$ +belongs to, so that $\Sigma = {1 \over m - n_c} X_c^T X_c$. + +Let + +$${1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T$$ + +be the SVD of ${1 \over \sqrt{m - n_c}}X_c$. Let +$D_x = \text{diag} (s_1, ..., s_r)$ be the diagonal matrix with all the +nonzero singular values, and rewrite $V_x$ as an $n \times r$ matrix +consisting of the first $r$ columns of $V_x$. + +Then with an abuse of notation, the pseudoinverse of $\Sigma$ is + +$$\Sigma^{-1} = V_x D_x^{-2} V_x^T.$$ + +So we just need to make $A = D_x^{-1} V_x^T$. When it comes to +prediction, just transform $x$ with $A$, and find the nearest centroid +$A \mu_i$ (again, modulo $\log \pi_i$) and label the input with $i$. + +### Dimensionality reduction {#Dimensionality reduction} + +We can further simplify the prediction by dimensionality reduction. +Assume $n_c \le n$. Then the centroid spans an affine space of dimension +$p$ which is at most $n_c - 1$. So what we can do is to project both the +transformed sample $Ax$ and centroids $A\mu_i$ to the linear subspace +parallel to the affine space, and do the nearest neighbour +classification there. + +So we can perform SVD on the matrix $(M - \bar x) V_x D_x^{-1}$ where +$\bar x$, a row vector, is the sample mean of all data i.e. average of +rows of $X$: + +$$(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.$$ + +Again, we let $V_m$ be the $r \times p$ matrix by keeping the first $p$ +columns of $V_m$. + +The projection operator is thus $V_m$. And so the final transformation +is $V_m^T D_x^{-1} V_x^T$. + +There is no reason to stop here, and we can set $p$ even smaller, which +will result in a lossy compression / regularisation equivalent to doing +[principle component +analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) on +$(M - \bar x) V_x D_x^{-1}$. + +### Fisher discriminant analysis {#Fisher discriminant analysis} + +The Fisher discriminant analysis involves finding an $n$-dimensional +vector $a$ that maximises between-class covariance with respect to +within-class covariance: + +$${a^T M_c^T M_c a \over a^T X_c^T X_c a},$$ + +where $M_c = M - \bar x$ is the centred sample mean matrix. + +As it turns out, this is (almost) equivalent to the derivation above, +modulo a constant. In particular, $a = c V_m^T D_x^{-1} V_x^T$ where +$p = 1$ for arbitrary constant $c$. + +To see this, we can first multiply the denominator with a constant +${1 \over m - n_c}$ so that the matrix in the denominator becomes the +covariance estimate $\Sigma$. + +We decompose $a$: $a = V_x D_x^{-1} b + \tilde V_x \tilde b$, where +$\tilde V_x$ consists of column vectors orthogonal to the column space +of $V_x$. + +We ignore the second term in the decomposition. In other words, we only +consider $a$ in the column space of $V_x$. + +Then the problem is to find an $r$-dimensional vector $b$ to maximise + +$${b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.$$ + +This is the problem of principle component analysis, and so $b$ is first +column of $V_m$. + +Therefore, the solution to Fisher discriminant analysis is +$a = c V_x D_x^{-1} V_m$ with $p = 1$. + +### Linear model {#Linear model} + +The model is called linear discriminant analysis because it is a linear +model. To see this, let $B = V_m^T D_x^{-1} V_x^T$ be the matrix of +transformation. Now we are comparing + +$$- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k$$ + +across all $k$s. Expanding the norm and removing the common term +$\|B x\|^2$, we see a linear form: + +$$\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k$$ + +So the prediction for $X_{\text{new}}$ is + +$$\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)$$ + +thus the decision boundaries are linear. + +This is how scikit-learn implements LDA, by inheriting from +`LinearClassifierMixin` and redirecting the classification there. + +Implementation {#Implementation} +-------------- + +This is where things get interesting. How do I validate my understanding +of the theory? By implementing and testing the algorithm. + +I try to implement it as close as possible to the natural language / +mathematical descriptions of the model, which means clarity over +performance. + +How about testing? Numerical experiments are harder to test than +combinatorial / discrete algorithms in general because the output is +less verifiable by hand. My shortcut solution to this problem is to test +against output from the scikit-learn package. + +It turned out to be harder than expected, as I had to dig into the code +of scikit-learn when the outputs mismatch. Their code is quite +well-written though. + +The result is +[here](https://github.com/ycpei/machine-learning/tree/master/discriminant-analysis). + +### Fun facts about LDA {#Fun facts about LDA} + +One property that can be used to test the LDA implementation is the fact +that the scatter matrix $B(X - \bar x)^T (X - \bar X) B^T$ of the +transformed centred sample is diagonal. + +This can be derived by using another fun fact that the sum of the +in-class scatter matrix and the between-class scatter matrix is the +sample scatter matrix: + +$$X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).$$ + +The verification is not very hard and left as an exercise. -- cgit v1.2.3