From 147a19e84a743f1379f05bf2f444143b4afd7bd6 Mon Sep 17 00:00:00 2001 From: Yuchen Pei Date: Fri, 18 Jun 2021 12:58:44 +1000 Subject: Updated. --- .../posts/2019-01-03-discriminant-analysis.html | 177 +++++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 site-from-md/posts/2019-01-03-discriminant-analysis.html (limited to 'site-from-md/posts/2019-01-03-discriminant-analysis.html') diff --git a/site-from-md/posts/2019-01-03-discriminant-analysis.html b/site-from-md/posts/2019-01-03-discriminant-analysis.html new file mode 100644 index 0000000..c28df60 --- /dev/null +++ b/site-from-md/posts/2019-01-03-discriminant-analysis.html @@ -0,0 +1,177 @@ + + + + + Discriminant analysis + + + + + + +
+ + +
+ +
+
+

Discriminant analysis

+

Posted on 2019-01-03 | Comments

+ + + + + + + Untitled + + + + + + +

In this post I talk about the theory and implementation of linear and quadratic discriminant analysis, classical methods in statistical learning.

+

Acknowledgement. Various sources were of great help to my understanding of the subject, including Chapter 4 of The Elements of Statistical Learning, Stanford CS229 Lecture notes, and the scikit-learn code. Research was done while working at KTH mathematics department.

+

If you are reading on a mobile device, you may need to “request desktop site” for the equations to be properly displayed. This post is licensed under CC BY-SA and GNU FDL.

+

Theory

+

Quadratic discriminant analysis (QDA) is a classical classification algorithm. It assumes that the data is generated by Gaussian distributions, where each class has its own mean and covariance.

+

\[(x | y = i) \sim N(\mu_i, \Sigma_i).\]

+

It also assumes a categorical class prior:

+

\[\mathbb P(y = i) = \pi_i\]

+

The log-likelihood is thus

+

\[\begin{aligned} +\log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\ +&= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0) +\end{aligned}\]

+

where \(C\) and \(C'\) are constants.

+

Thus the prediction is done by taking argmax of the above formula.

+

In training, let \(X\), \(y\) be the input data, where \(X\) is of shape \(m \times n\), and \(y\) of shape \(m\). We adopt the convention that each row of \(X\) is a sample \(x^{(i)T}\). So there are \(m\) samples and \(n\) features. Denote by \(m_i = \#\{j: y_j = i\}\) be the number of samples in class \(i\). Let \(n_c\) be the number of classes.

+

We estimate \(\mu_i\) by the sample means, and \(\pi_i\) by the frequencies:

+

\[\begin{aligned} +\mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\ +\pi_i &:= \mathbb P(y = i) = {m_i \over m}. +\end{aligned}\]

+

Linear discriminant analysis (LDA) is a specialisation of QDA: it assumes all classes share the same covariance, i.e. \(\Sigma_i = \Sigma\) for all \(i\).

+

Guassian Naive Bayes is a different specialisation of QDA: it assumes that all \(\Sigma_i\) are diagonal, since all the features are assumed to be independent.

+

QDA

+

We look at QDA.

+

We estimate \(\Sigma_i\) by the variance mean:

+

\[\begin{aligned} +\Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}. +\end{aligned}\]

+

where \(\hat x^{(j)} = x^{(j)} - \mu_{y_j}\) are the centred \(x^{(j)}\). Plugging this into (0) we are done.

+

There are two problems that can break the algorithm. First, if one of the \(m_i\) is \(1\), then \(\Sigma_i\) is ill-defined. Second, one of \(\Sigma_i\)'s might be singular.

+

In either case, there is no way around it, and the implementation should throw an exception.

+

This won't be a problem of the LDA, though, unless there is only one sample for each class.

+

Vanilla LDA

+

Now let us look at LDA.

+

Since all classes share the same covariance, we estimate \(\Sigma\) using sample variance

+

\[\begin{aligned} +\Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T}, +\end{aligned}\]

+

where \(\hat x^{(j)} = x^{(j)} - \mu_{y_j}\) and \({1 \over m - n_c}\) comes from Bessel's Correction.

+

Let us write down the decision function (0). We can remove the first term on the right hand side, since all \(\Sigma_i\) are the same, and we only care about argmax of that equation. Thus it becomes

+

\[- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)\]

+

Notice that we just walked around the problem of figuring out \(\log \det \Sigma\) when \(\Sigma\) is singular.

+

But how about \(\Sigma^{-1}\)?

+

We sidestep this problem by using the pseudoinverse of \(\Sigma\) instead. This can be seen as applying a linear transformation to \(X\) to turn its covariance matrix to identity. And thus the model becomes a sort of a nearest neighbour classifier.

+

Nearest neighbour classifier

+

More specifically, we want to transform the first term of (0) to a norm to get a classifier based on nearest neighbour modulo \(\log \pi_i\):

+

\[- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i\]

+

To compute \(A\), we denote

+

\[X_c = X - M,\]

+

where the \(i\)th row of \(M\) is \(\mu_{y_i}^T\), the mean of the class \(x_i\) belongs to, so that \(\Sigma = {1 \over m - n_c} X_c^T X_c\).

+

Let

+

\[{1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T\]

+

be the SVD of \({1 \over \sqrt{m - n_c}}X_c\). Let \(D_x = \text{diag} (s_1, ..., s_r)\) be the diagonal matrix with all the nonzero singular values, and rewrite \(V_x\) as an \(n \times r\) matrix consisting of the first \(r\) columns of \(V_x\).

+

Then with an abuse of notation, the pseudoinverse of \(\Sigma\) is

+

\[\Sigma^{-1} = V_x D_x^{-2} V_x^T.\]

+

So we just need to make \(A = D_x^{-1} V_x^T\). When it comes to prediction, just transform \(x\) with \(A\), and find the nearest centroid \(A \mu_i\) (again, modulo \(\log \pi_i\)) and label the input with \(i\).

+

Dimensionality reduction

+

We can further simplify the prediction by dimensionality reduction. Assume \(n_c \le n\). Then the centroid spans an affine space of dimension \(p\) which is at most \(n_c - 1\). So what we can do is to project both the transformed sample \(Ax\) and centroids \(A\mu_i\) to the linear subspace parallel to the affine space, and do the nearest neighbour classification there.

+

So we can perform SVD on the matrix \((M - \bar x) V_x D_x^{-1}\) where \(\bar x\), a row vector, is the sample mean of all data i.e. average of rows of \(X\):

+

\[(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.\]

+

Again, we let \(V_m\) be the \(r \times p\) matrix by keeping the first \(p\) columns of \(V_m\).

+

The projection operator is thus \(V_m\). And so the final transformation is \(V_m^T D_x^{-1} V_x^T\).

+

There is no reason to stop here, and we can set \(p\) even smaller, which will result in a lossy compression / regularisation equivalent to doing principle component analysis on \((M - \bar x) V_x D_x^{-1}\).

+

Note that as of 2019-01-04, in the scikit-learn implementation of LDA, the prediction is done without any lossy compression, even if the parameter n_components is set to be smaller than dimension of the affine space spanned by the centroids. In other words, the prediction does not change regardless of n_components.

+

Fisher discriminant analysis

+

The Fisher discriminant analysis involves finding an \(n\)-dimensional vector \(a\) that maximises between-class covariance with respect to within-class covariance:

+

\[{a^T M_c^T M_c a \over a^T X_c^T X_c a},\]

+

where \(M_c = M - \bar x\) is the centred sample mean matrix.

+

As it turns out, this is (almost) equivalent to the derivation above, modulo a constant. In particular, \(a = c V_m^T D_x^{-1} V_x^T\) where \(p = 1\) for arbitrary constant \(c\).

+

To see this, we can first multiply the denominator with a constant \({1 \over m - n_c}\) so that the matrix in the denominator becomes the covariance estimate \(\Sigma\).

+

We decompose \(a\): \(a = V_x D_x^{-1} b + \tilde V_x \tilde b\), where \(\tilde V_x\) consists of column vectors orthogonal to the column space of \(V_x\).

+

We ignore the second term in the decomposition. In other words, we only consider \(a\) in the column space of \(V_x\).

+

Then the problem is to find an \(r\)-dimensional vector \(b\) to maximise

+

\[{b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.\]

+

This is the problem of principle component analysis, and so \(b\) is first column of \(V_m\).

+

Therefore, the solution to Fisher discriminant analysis is \(a = c V_x D_x^{-1} V_m\) with \(p = 1\).

+

Linear model

+

The model is called linear discriminant analysis because it is a linear model. To see this, let \(B = V_m^T D_x^{-1} V_x^T\) be the matrix of transformation. Now we are comparing

+

\[- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k\]

+

across all \(k\)s. Expanding the norm and removing the common term \(\|B x\|^2\), we see a linear form:

+

\[\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k\]

+

So the prediction for \(X_{\text{new}}\) is

+

\[\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)\]

+

thus the decision boundaries are linear.

+

This is how scikit-learn implements LDA, by inheriting from LinearClassifierMixin and redirecting the classification there.

+

Implementation

+

This is where things get interesting. How do I validate my understanding of the theory? By implementing and testing the algorithm.

+

I try to implement it as close as possible to the natural language / mathematical descriptions of the model, which means clarity over performance.

+

How about testing? Numerical experiments are harder to test than combinatorial / discrete algorithms in general because the output is less verifiable by hand. My shortcut solution to this problem is to test against output from the scikit-learn package.

+

It turned out to be harder than expected, as I had to dig into the code of scikit-learn when the outputs mismatch. Their code is quite well-written though.

+

The result is here.

+

Fun facts about LDA

+

One property that can be used to test the LDA implementation is the fact that the scatter matrix \(B(X - \bar x)^T (X - \bar X) B^T\) of the transformed centred sample is diagonal.

+

This can be derived by using another fun fact that the sum of the in-class scatter matrix and the between-class scatter matrix is the sample scatter matrix:

+

\[X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).\]

+

The verification is not very hard and left as an exercise.

+ + + +
+
+
+ + -- cgit v1.2.3