diff options
Diffstat (limited to 'site-from-md/posts/2019-01-03-discriminant-analysis.html')
-rw-r--r-- | site-from-md/posts/2019-01-03-discriminant-analysis.html | 177 |
1 files changed, 177 insertions, 0 deletions
diff --git a/site-from-md/posts/2019-01-03-discriminant-analysis.html b/site-from-md/posts/2019-01-03-discriminant-analysis.html new file mode 100644 index 0000000..c28df60 --- /dev/null +++ b/site-from-md/posts/2019-01-03-discriminant-analysis.html @@ -0,0 +1,177 @@ +<!doctype html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Discriminant analysis</title> + <link rel="stylesheet" href="../assets/css/default.css" /> + <script data-isso="/comments/" + data-isso-css="true" + data-isso-lang="en" + data-isso-reply-to-self="false" + data-isso-require-author="true" + data-isso-require-email="true" + data-isso-max-comments-top="10" + data-isso-max-comments-nested="5" + data-isso-reveal-on-click="5" + data-isso-avatar="true" + data-isso-avatar-bg="#f0f0f0" + data-isso-avatar-fg="#9abf88 #5698c4 #e279a3 #9163b6 ..." + data-isso-vote="true" + data-vote-levels="" + src="/comments/js/embed.min.js"></script> + <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> + <script src="../assets/js/analytics.js" type="text/javascript"></script> + </head> + <body> + <header> + <span class="logo"> + <a href="../blog.html">Yuchen's Blog</a> + </span> + <nav> + <a href="../index.html">About</a><a href="../postlist.html">All posts</a><a href="../blog-feed.xml">Feed</a> + </nav> + </header> + + <div class="main"> + <div class="bodyitem"> + <h2> Discriminant analysis </h2> + <p>Posted on 2019-01-03 | <a href="/posts/2019-01-03-discriminant-analysis.html#isso-thread">Comments</a> </p> + <!DOCTYPE html> +<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang=""> +<head> + <meta charset="utf-8" /> + <meta name="generator" content="pandoc" /> + <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" /> + <title>Untitled</title> + <style> + code{white-space: pre-wrap;} + span.smallcaps{font-variant: small-caps;} + span.underline{text-decoration: underline;} + div.column{display: inline-block; vertical-align: top; width: 50%;} + </style> + <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-AMS_CHTML-full" type="text/javascript"></script> + <!--[if lt IE 9]> + <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> + <![endif]--> +</head> +<body> +<nav id="TOC"> +<ul> +<li><a href="#theory">Theory</a><ul> +<li><a href="#qda">QDA</a></li> +<li><a href="#vanilla-lda">Vanilla LDA</a></li> +<li><a href="#nearest-neighbour-classifier">Nearest neighbour classifier</a></li> +<li><a href="#dimensionality-reduction">Dimensionality reduction</a></li> +<li><a href="#fisher-discriminant-analysis">Fisher discriminant analysis</a></li> +<li><a href="#linear-model">Linear model</a></li> +</ul></li> +<li><a href="#implementation">Implementation</a><ul> +<li><a href="#fun-facts-about-lda">Fun facts about LDA</a></li> +</ul></li> +</ul> +</nav> +<p>In this post I talk about the theory and implementation of linear and quadratic discriminant analysis, classical methods in statistical learning.</p> +<p><strong>Acknowledgement</strong>. Various sources were of great help to my understanding of the subject, including Chapter 4 of <a href="https://web.stanford.edu/~hastie/ElemStatLearn/">The Elements of Statistical Learning</a>, <a href="http://cs229.stanford.edu/notes/cs229-notes2.pdf">Stanford CS229 Lecture notes</a>, and <a href="https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/discriminant_analysis.py">the scikit-learn code</a>. Research was done while working at KTH mathematics department.</p> +<p><em>If you are reading on a mobile device, you may need to “request desktop site” for the equations to be properly displayed. This post is licensed under CC BY-SA and GNU FDL.</em></p> +<h2 id="theory">Theory</h2> +<p>Quadratic discriminant analysis (QDA) is a classical classification algorithm. It assumes that the data is generated by Gaussian distributions, where each class has its own mean and covariance.</p> +<p><span class="math display">\[(x | y = i) \sim N(\mu_i, \Sigma_i).\]</span></p> +<p>It also assumes a categorical class prior:</p> +<p><span class="math display">\[\mathbb P(y = i) = \pi_i\]</span></p> +<p>The log-likelihood is thus</p> +<p><span class="math display">\[\begin{aligned} +\log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\ +&= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0) +\end{aligned}\]</span></p> +<p>where <span class="math inline">\(C\)</span> and <span class="math inline">\(C'\)</span> are constants.</p> +<p>Thus the prediction is done by taking argmax of the above formula.</p> +<p>In training, let <span class="math inline">\(X\)</span>, <span class="math inline">\(y\)</span> be the input data, where <span class="math inline">\(X\)</span> is of shape <span class="math inline">\(m \times n\)</span>, and <span class="math inline">\(y\)</span> of shape <span class="math inline">\(m\)</span>. We adopt the convention that each row of <span class="math inline">\(X\)</span> is a sample <span class="math inline">\(x^{(i)T}\)</span>. So there are <span class="math inline">\(m\)</span> samples and <span class="math inline">\(n\)</span> features. Denote by <span class="math inline">\(m_i = \#\{j: y_j = i\}\)</span> be the number of samples in class <span class="math inline">\(i\)</span>. Let <span class="math inline">\(n_c\)</span> be the number of classes.</p> +<p>We estimate <span class="math inline">\(\mu_i\)</span> by the sample means, and <span class="math inline">\(\pi_i\)</span> by the frequencies:</p> +<p><span class="math display">\[\begin{aligned} +\mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\ +\pi_i &:= \mathbb P(y = i) = {m_i \over m}. +\end{aligned}\]</span></p> +<p>Linear discriminant analysis (LDA) is a specialisation of QDA: it assumes all classes share the same covariance, i.e. <span class="math inline">\(\Sigma_i = \Sigma\)</span> for all <span class="math inline">\(i\)</span>.</p> +<p>Guassian Naive Bayes is a different specialisation of QDA: it assumes that all <span class="math inline">\(\Sigma_i\)</span> are diagonal, since all the features are assumed to be independent.</p> +<h3 id="qda">QDA</h3> +<p>We look at QDA.</p> +<p>We estimate <span class="math inline">\(\Sigma_i\)</span> by the variance mean:</p> +<p><span class="math display">\[\begin{aligned} +\Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}. +\end{aligned}\]</span></p> +<p>where <span class="math inline">\(\hat x^{(j)} = x^{(j)} - \mu_{y_j}\)</span> are the centred <span class="math inline">\(x^{(j)}\)</span>. Plugging this into (0) we are done.</p> +<p>There are two problems that can break the algorithm. First, if one of the <span class="math inline">\(m_i\)</span> is <span class="math inline">\(1\)</span>, then <span class="math inline">\(\Sigma_i\)</span> is ill-defined. Second, one of <span class="math inline">\(\Sigma_i\)</span>'s might be singular.</p> +<p>In either case, there is no way around it, and the implementation should throw an exception.</p> +<p>This won't be a problem of the LDA, though, unless there is only one sample for each class.</p> +<h3 id="vanilla-lda">Vanilla LDA</h3> +<p>Now let us look at LDA.</p> +<p>Since all classes share the same covariance, we estimate <span class="math inline">\(\Sigma\)</span> using sample variance</p> +<p><span class="math display">\[\begin{aligned} +\Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T}, +\end{aligned}\]</span></p> +<p>where <span class="math inline">\(\hat x^{(j)} = x^{(j)} - \mu_{y_j}\)</span> and <span class="math inline">\({1 \over m - n_c}\)</span> comes from <a href="https://en.wikipedia.org/wiki/Bessel%27s_correction">Bessel's Correction</a>.</p> +<p>Let us write down the decision function (0). We can remove the first term on the right hand side, since all <span class="math inline">\(\Sigma_i\)</span> are the same, and we only care about argmax of that equation. Thus it becomes</p> +<p><span class="math display">\[- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)\]</span></p> +<p>Notice that we just walked around the problem of figuring out <span class="math inline">\(\log \det \Sigma\)</span> when <span class="math inline">\(\Sigma\)</span> is singular.</p> +<p>But how about <span class="math inline">\(\Sigma^{-1}\)</span>?</p> +<p>We sidestep this problem by using the pseudoinverse of <span class="math inline">\(\Sigma\)</span> instead. This can be seen as applying a linear transformation to <span class="math inline">\(X\)</span> to turn its covariance matrix to identity. And thus the model becomes a sort of a nearest neighbour classifier.</p> +<h3 id="nearest-neighbour-classifier">Nearest neighbour classifier</h3> +<p>More specifically, we want to transform the first term of (0) to a norm to get a classifier based on nearest neighbour modulo <span class="math inline">\(\log \pi_i\)</span>:</p> +<p><span class="math display">\[- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i\]</span></p> +<p>To compute <span class="math inline">\(A\)</span>, we denote</p> +<p><span class="math display">\[X_c = X - M,\]</span></p> +<p>where the <span class="math inline">\(i\)</span>th row of <span class="math inline">\(M\)</span> is <span class="math inline">\(\mu_{y_i}^T\)</span>, the mean of the class <span class="math inline">\(x_i\)</span> belongs to, so that <span class="math inline">\(\Sigma = {1 \over m - n_c} X_c^T X_c\)</span>.</p> +<p>Let</p> +<p><span class="math display">\[{1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T\]</span></p> +<p>be the SVD of <span class="math inline">\({1 \over \sqrt{m - n_c}}X_c\)</span>. Let <span class="math inline">\(D_x = \text{diag} (s_1, ..., s_r)\)</span> be the diagonal matrix with all the nonzero singular values, and rewrite <span class="math inline">\(V_x\)</span> as an <span class="math inline">\(n \times r\)</span> matrix consisting of the first <span class="math inline">\(r\)</span> columns of <span class="math inline">\(V_x\)</span>.</p> +<p>Then with an abuse of notation, the pseudoinverse of <span class="math inline">\(\Sigma\)</span> is</p> +<p><span class="math display">\[\Sigma^{-1} = V_x D_x^{-2} V_x^T.\]</span></p> +<p>So we just need to make <span class="math inline">\(A = D_x^{-1} V_x^T\)</span>. When it comes to prediction, just transform <span class="math inline">\(x\)</span> with <span class="math inline">\(A\)</span>, and find the nearest centroid <span class="math inline">\(A \mu_i\)</span> (again, modulo <span class="math inline">\(\log \pi_i\)</span>) and label the input with <span class="math inline">\(i\)</span>.</p> +<h3 id="dimensionality-reduction">Dimensionality reduction</h3> +<p>We can further simplify the prediction by dimensionality reduction. Assume <span class="math inline">\(n_c \le n\)</span>. Then the centroid spans an affine space of dimension <span class="math inline">\(p\)</span> which is at most <span class="math inline">\(n_c - 1\)</span>. So what we can do is to project both the transformed sample <span class="math inline">\(Ax\)</span> and centroids <span class="math inline">\(A\mu_i\)</span> to the linear subspace parallel to the affine space, and do the nearest neighbour classification there.</p> +<p>So we can perform SVD on the matrix <span class="math inline">\((M - \bar x) V_x D_x^{-1}\)</span> where <span class="math inline">\(\bar x\)</span>, a row vector, is the sample mean of all data i.e. average of rows of <span class="math inline">\(X\)</span>:</p> +<p><span class="math display">\[(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.\]</span></p> +<p>Again, we let <span class="math inline">\(V_m\)</span> be the <span class="math inline">\(r \times p\)</span> matrix by keeping the first <span class="math inline">\(p\)</span> columns of <span class="math inline">\(V_m\)</span>.</p> +<p>The projection operator is thus <span class="math inline">\(V_m\)</span>. And so the final transformation is <span class="math inline">\(V_m^T D_x^{-1} V_x^T\)</span>.</p> +<p>There is no reason to stop here, and we can set <span class="math inline">\(p\)</span> even smaller, which will result in a lossy compression / regularisation equivalent to doing <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">principle component analysis</a> on <span class="math inline">\((M - \bar x) V_x D_x^{-1}\)</span>.</p> +<p>Note that as of 2019-01-04, in the <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py">scikit-learn implementation of LDA</a>, the prediction is done without any lossy compression, even if the parameter <code>n_components</code> is set to be smaller than dimension of the affine space spanned by the centroids. In other words, the prediction does not change regardless of <code>n_components</code>.</p> +<h3 id="fisher-discriminant-analysis">Fisher discriminant analysis</h3> +<p>The Fisher discriminant analysis involves finding an <span class="math inline">\(n\)</span>-dimensional vector <span class="math inline">\(a\)</span> that maximises between-class covariance with respect to within-class covariance:</p> +<p><span class="math display">\[{a^T M_c^T M_c a \over a^T X_c^T X_c a},\]</span></p> +<p>where <span class="math inline">\(M_c = M - \bar x\)</span> is the centred sample mean matrix.</p> +<p>As it turns out, this is (almost) equivalent to the derivation above, modulo a constant. In particular, <span class="math inline">\(a = c V_m^T D_x^{-1} V_x^T\)</span> where <span class="math inline">\(p = 1\)</span> for arbitrary constant <span class="math inline">\(c\)</span>.</p> +<p>To see this, we can first multiply the denominator with a constant <span class="math inline">\({1 \over m - n_c}\)</span> so that the matrix in the denominator becomes the covariance estimate <span class="math inline">\(\Sigma\)</span>.</p> +<p>We decompose <span class="math inline">\(a\)</span>: <span class="math inline">\(a = V_x D_x^{-1} b + \tilde V_x \tilde b\)</span>, where <span class="math inline">\(\tilde V_x\)</span> consists of column vectors orthogonal to the column space of <span class="math inline">\(V_x\)</span>.</p> +<p>We ignore the second term in the decomposition. In other words, we only consider <span class="math inline">\(a\)</span> in the column space of <span class="math inline">\(V_x\)</span>.</p> +<p>Then the problem is to find an <span class="math inline">\(r\)</span>-dimensional vector <span class="math inline">\(b\)</span> to maximise</p> +<p><span class="math display">\[{b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.\]</span></p> +<p>This is the problem of principle component analysis, and so <span class="math inline">\(b\)</span> is first column of <span class="math inline">\(V_m\)</span>.</p> +<p>Therefore, the solution to Fisher discriminant analysis is <span class="math inline">\(a = c V_x D_x^{-1} V_m\)</span> with <span class="math inline">\(p = 1\)</span>.</p> +<h3 id="linear-model">Linear model</h3> +<p>The model is called linear discriminant analysis because it is a linear model. To see this, let <span class="math inline">\(B = V_m^T D_x^{-1} V_x^T\)</span> be the matrix of transformation. Now we are comparing</p> +<p><span class="math display">\[- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k\]</span></p> +<p>across all <span class="math inline">\(k\)</span>s. Expanding the norm and removing the common term <span class="math inline">\(\|B x\|^2\)</span>, we see a linear form:</p> +<p><span class="math display">\[\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k\]</span></p> +<p>So the prediction for <span class="math inline">\(X_{\text{new}}\)</span> is</p> +<p><span class="math display">\[\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)\]</span></p> +<p>thus the decision boundaries are linear.</p> +<p>This is how scikit-learn implements LDA, by inheriting from <code>LinearClassifierMixin</code> and redirecting the classification there.</p> +<h2 id="implementation">Implementation</h2> +<p>This is where things get interesting. How do I validate my understanding of the theory? By implementing and testing the algorithm.</p> +<p>I try to implement it as close as possible to the natural language / mathematical descriptions of the model, which means clarity over performance.</p> +<p>How about testing? Numerical experiments are harder to test than combinatorial / discrete algorithms in general because the output is less verifiable by hand. My shortcut solution to this problem is to test against output from the scikit-learn package.</p> +<p>It turned out to be harder than expected, as I had to dig into the code of scikit-learn when the outputs mismatch. Their code is quite well-written though.</p> +<p>The result is <a href="https://github.com/ycpei/machine-learning/tree/master/discriminant-analysis">here</a>.</p> +<h3 id="fun-facts-about-lda">Fun facts about LDA</h3> +<p>One property that can be used to test the LDA implementation is the fact that the scatter matrix <span class="math inline">\(B(X - \bar x)^T (X - \bar X) B^T\)</span> of the transformed centred sample is diagonal.</p> +<p>This can be derived by using another fun fact that the sum of the in-class scatter matrix and the between-class scatter matrix is the sample scatter matrix:</p> +<p><span class="math display">\[X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).\]</span></p> +<p>The verification is not very hard and left as an exercise.</p> +</body> +</html> + + </div> + <section id="isso-thread"></section> + </div> + </body> +</html> |