From 6c8e5849392cc2541bbdb84d43ce4be2d7fe4319 Mon Sep 17 00:00:00 2001 From: Yuchen Pei Date: Thu, 1 Jul 2021 12:20:22 +1000 Subject: Removed files no longer in use. Also renamed agpl license file. --- .../posts/2019-01-03-discriminant-analysis.html | 177 --------------------- 1 file changed, 177 deletions(-) delete mode 100644 site-from-md/posts/2019-01-03-discriminant-analysis.html (limited to 'site-from-md/posts/2019-01-03-discriminant-analysis.html') diff --git a/site-from-md/posts/2019-01-03-discriminant-analysis.html b/site-from-md/posts/2019-01-03-discriminant-analysis.html deleted file mode 100644 index c28df60..0000000 --- a/site-from-md/posts/2019-01-03-discriminant-analysis.html +++ /dev/null @@ -1,177 +0,0 @@ - - - - - Discriminant analysis - - - - - - -
- - -
- -
-
-

Discriminant analysis

-

Posted on 2019-01-03 | Comments

- - - - - - - Untitled - - - - - - -

In this post I talk about the theory and implementation of linear and quadratic discriminant analysis, classical methods in statistical learning.

-

Acknowledgement. Various sources were of great help to my understanding of the subject, including Chapter 4 of The Elements of Statistical Learning, Stanford CS229 Lecture notes, and the scikit-learn code. Research was done while working at KTH mathematics department.

-

If you are reading on a mobile device, you may need to “request desktop site” for the equations to be properly displayed. This post is licensed under CC BY-SA and GNU FDL.

-

Theory

-

Quadratic discriminant analysis (QDA) is a classical classification algorithm. It assumes that the data is generated by Gaussian distributions, where each class has its own mean and covariance.

-

\[(x | y = i) \sim N(\mu_i, \Sigma_i).\]

-

It also assumes a categorical class prior:

-

\[\mathbb P(y = i) = \pi_i\]

-

The log-likelihood is thus

-

\[\begin{aligned} -\log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\ -&= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0) -\end{aligned}\]

-

where \(C\) and \(C'\) are constants.

-

Thus the prediction is done by taking argmax of the above formula.

-

In training, let \(X\), \(y\) be the input data, where \(X\) is of shape \(m \times n\), and \(y\) of shape \(m\). We adopt the convention that each row of \(X\) is a sample \(x^{(i)T}\). So there are \(m\) samples and \(n\) features. Denote by \(m_i = \#\{j: y_j = i\}\) be the number of samples in class \(i\). Let \(n_c\) be the number of classes.

-

We estimate \(\mu_i\) by the sample means, and \(\pi_i\) by the frequencies:

-

\[\begin{aligned} -\mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\ -\pi_i &:= \mathbb P(y = i) = {m_i \over m}. -\end{aligned}\]

-

Linear discriminant analysis (LDA) is a specialisation of QDA: it assumes all classes share the same covariance, i.e. \(\Sigma_i = \Sigma\) for all \(i\).

-

Guassian Naive Bayes is a different specialisation of QDA: it assumes that all \(\Sigma_i\) are diagonal, since all the features are assumed to be independent.

-

QDA

-

We look at QDA.

-

We estimate \(\Sigma_i\) by the variance mean:

-

\[\begin{aligned} -\Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}. -\end{aligned}\]

-

where \(\hat x^{(j)} = x^{(j)} - \mu_{y_j}\) are the centred \(x^{(j)}\). Plugging this into (0) we are done.

-

There are two problems that can break the algorithm. First, if one of the \(m_i\) is \(1\), then \(\Sigma_i\) is ill-defined. Second, one of \(\Sigma_i\)'s might be singular.

-

In either case, there is no way around it, and the implementation should throw an exception.

-

This won't be a problem of the LDA, though, unless there is only one sample for each class.

-

Vanilla LDA

-

Now let us look at LDA.

-

Since all classes share the same covariance, we estimate \(\Sigma\) using sample variance

-

\[\begin{aligned} -\Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T}, -\end{aligned}\]

-

where \(\hat x^{(j)} = x^{(j)} - \mu_{y_j}\) and \({1 \over m - n_c}\) comes from Bessel's Correction.

-

Let us write down the decision function (0). We can remove the first term on the right hand side, since all \(\Sigma_i\) are the same, and we only care about argmax of that equation. Thus it becomes

-

\[- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)\]

-

Notice that we just walked around the problem of figuring out \(\log \det \Sigma\) when \(\Sigma\) is singular.

-

But how about \(\Sigma^{-1}\)?

-

We sidestep this problem by using the pseudoinverse of \(\Sigma\) instead. This can be seen as applying a linear transformation to \(X\) to turn its covariance matrix to identity. And thus the model becomes a sort of a nearest neighbour classifier.

-

Nearest neighbour classifier

-

More specifically, we want to transform the first term of (0) to a norm to get a classifier based on nearest neighbour modulo \(\log \pi_i\):

-

\[- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i\]

-

To compute \(A\), we denote

-

\[X_c = X - M,\]

-

where the \(i\)th row of \(M\) is \(\mu_{y_i}^T\), the mean of the class \(x_i\) belongs to, so that \(\Sigma = {1 \over m - n_c} X_c^T X_c\).

-

Let

-

\[{1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T\]

-

be the SVD of \({1 \over \sqrt{m - n_c}}X_c\). Let \(D_x = \text{diag} (s_1, ..., s_r)\) be the diagonal matrix with all the nonzero singular values, and rewrite \(V_x\) as an \(n \times r\) matrix consisting of the first \(r\) columns of \(V_x\).

-

Then with an abuse of notation, the pseudoinverse of \(\Sigma\) is

-

\[\Sigma^{-1} = V_x D_x^{-2} V_x^T.\]

-

So we just need to make \(A = D_x^{-1} V_x^T\). When it comes to prediction, just transform \(x\) with \(A\), and find the nearest centroid \(A \mu_i\) (again, modulo \(\log \pi_i\)) and label the input with \(i\).

-

Dimensionality reduction

-

We can further simplify the prediction by dimensionality reduction. Assume \(n_c \le n\). Then the centroid spans an affine space of dimension \(p\) which is at most \(n_c - 1\). So what we can do is to project both the transformed sample \(Ax\) and centroids \(A\mu_i\) to the linear subspace parallel to the affine space, and do the nearest neighbour classification there.

-

So we can perform SVD on the matrix \((M - \bar x) V_x D_x^{-1}\) where \(\bar x\), a row vector, is the sample mean of all data i.e. average of rows of \(X\):

-

\[(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.\]

-

Again, we let \(V_m\) be the \(r \times p\) matrix by keeping the first \(p\) columns of \(V_m\).

-

The projection operator is thus \(V_m\). And so the final transformation is \(V_m^T D_x^{-1} V_x^T\).

-

There is no reason to stop here, and we can set \(p\) even smaller, which will result in a lossy compression / regularisation equivalent to doing principle component analysis on \((M - \bar x) V_x D_x^{-1}\).

-

Note that as of 2019-01-04, in the scikit-learn implementation of LDA, the prediction is done without any lossy compression, even if the parameter n_components is set to be smaller than dimension of the affine space spanned by the centroids. In other words, the prediction does not change regardless of n_components.

-

Fisher discriminant analysis

-

The Fisher discriminant analysis involves finding an \(n\)-dimensional vector \(a\) that maximises between-class covariance with respect to within-class covariance:

-

\[{a^T M_c^T M_c a \over a^T X_c^T X_c a},\]

-

where \(M_c = M - \bar x\) is the centred sample mean matrix.

-

As it turns out, this is (almost) equivalent to the derivation above, modulo a constant. In particular, \(a = c V_m^T D_x^{-1} V_x^T\) where \(p = 1\) for arbitrary constant \(c\).

-

To see this, we can first multiply the denominator with a constant \({1 \over m - n_c}\) so that the matrix in the denominator becomes the covariance estimate \(\Sigma\).

-

We decompose \(a\): \(a = V_x D_x^{-1} b + \tilde V_x \tilde b\), where \(\tilde V_x\) consists of column vectors orthogonal to the column space of \(V_x\).

-

We ignore the second term in the decomposition. In other words, we only consider \(a\) in the column space of \(V_x\).

-

Then the problem is to find an \(r\)-dimensional vector \(b\) to maximise

-

\[{b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.\]

-

This is the problem of principle component analysis, and so \(b\) is first column of \(V_m\).

-

Therefore, the solution to Fisher discriminant analysis is \(a = c V_x D_x^{-1} V_m\) with \(p = 1\).

-

Linear model

-

The model is called linear discriminant analysis because it is a linear model. To see this, let \(B = V_m^T D_x^{-1} V_x^T\) be the matrix of transformation. Now we are comparing

-

\[- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k\]

-

across all \(k\)s. Expanding the norm and removing the common term \(\|B x\|^2\), we see a linear form:

-

\[\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k\]

-

So the prediction for \(X_{\text{new}}\) is

-

\[\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)\]

-

thus the decision boundaries are linear.

-

This is how scikit-learn implements LDA, by inheriting from LinearClassifierMixin and redirecting the classification there.

-

Implementation

-

This is where things get interesting. How do I validate my understanding of the theory? By implementing and testing the algorithm.

-

I try to implement it as close as possible to the natural language / mathematical descriptions of the model, which means clarity over performance.

-

How about testing? Numerical experiments are harder to test than combinatorial / discrete algorithms in general because the output is less verifiable by hand. My shortcut solution to this problem is to test against output from the scikit-learn package.

-

It turned out to be harder than expected, as I had to dig into the code of scikit-learn when the outputs mismatch. Their code is quite well-written though.

-

The result is here.

-

Fun facts about LDA

-

One property that can be used to test the LDA implementation is the fact that the scatter matrix \(B(X - \bar x)^T (X - \bar X) B^T\) of the transformed centred sample is diagonal.

-

This can be derived by using another fun fact that the sum of the in-class scatter matrix and the between-class scatter matrix is the sample scatter matrix:

-

\[X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).\]

-

The verification is not very hard and left as an exercise.

- - - -
-
-
- - -- cgit v1.2.3