aboutsummaryrefslogtreecommitdiff
path: root/site-from-md/posts/2019-02-14-raise-your-elbo.html
diff options
context:
space:
mode:
Diffstat (limited to 'site-from-md/posts/2019-02-14-raise-your-elbo.html')
-rw-r--r--site-from-md/posts/2019-02-14-raise-your-elbo.html562
1 files changed, 562 insertions, 0 deletions
diff --git a/site-from-md/posts/2019-02-14-raise-your-elbo.html b/site-from-md/posts/2019-02-14-raise-your-elbo.html
new file mode 100644
index 0000000..a40ede8
--- /dev/null
+++ b/site-from-md/posts/2019-02-14-raise-your-elbo.html
@@ -0,0 +1,562 @@
+<!doctype html>
+<html lang="en">
+ <head>
+ <meta charset="utf-8">
+ <title>Raise your ELBO</title>
+ <link rel="stylesheet" href="../assets/css/default.css" />
+ <script data-isso="/comments/"
+ data-isso-css="true"
+ data-isso-lang="en"
+ data-isso-reply-to-self="false"
+ data-isso-require-author="true"
+ data-isso-require-email="true"
+ data-isso-max-comments-top="10"
+ data-isso-max-comments-nested="5"
+ data-isso-reveal-on-click="5"
+ data-isso-avatar="true"
+ data-isso-avatar-bg="#f0f0f0"
+ data-isso-avatar-fg="#9abf88 #5698c4 #e279a3 #9163b6 ..."
+ data-isso-vote="true"
+ data-vote-levels=""
+ src="/comments/js/embed.min.js"></script>
+ <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
+ <script src="../assets/js/analytics.js" type="text/javascript"></script>
+ </head>
+ <body>
+ <header>
+ <span class="logo">
+ <a href="../blog.html">Yuchen's Blog</a>
+ </span>
+ <nav>
+ <a href="../index.html">About</a><a href="../postlist.html">All posts</a><a href="../blog-feed.xml">Feed</a>
+ </nav>
+ </header>
+
+ <div class="main">
+ <div class="bodyitem">
+ <h2> Raise your ELBO </h2>
+ <p>Posted on 2019-02-14 | <a href="/posts/2019-02-14-raise-your-elbo.html#isso-thread">Comments</a> </p>
+ <!DOCTYPE html>
+<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
+<head>
+ <meta charset="utf-8" />
+ <meta name="generator" content="pandoc" />
+ <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
+ <title>Untitled</title>
+ <style>
+ code{white-space: pre-wrap;}
+ span.smallcaps{font-variant: small-caps;}
+ span.underline{text-decoration: underline;}
+ div.column{display: inline-block; vertical-align: top; width: 50%;}
+ </style>
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-AMS_CHTML-full" type="text/javascript"></script>
+ <!--[if lt IE 9]>
+ <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
+ <![endif]-->
+</head>
+<body>
+<nav id="TOC">
+<ul>
+<li><a href="#kl-divergence-and-elbo">KL divergence and ELBO</a></li>
+<li><a href="#em">EM</a><ul>
+<li><a href="#gmm">GMM</a></li>
+<li><a href="#smm">SMM</a></li>
+<li><a href="#plsa">pLSA</a></li>
+<li><a href="#hmm">HMM</a></li>
+</ul></li>
+<li><a href="#fully-bayesian-em-mfa">Fully Bayesian EM / MFA</a><ul>
+<li><a href="#application-to-mixture-models">Application to mixture models</a></li>
+<li><a href="#fully-bayesian-gmm">Fully Bayesian GMM</a></li>
+<li><a href="#lda">LDA</a></li>
+<li><a href="#dpmm">DPMM</a></li>
+</ul></li>
+<li><a href="#svi">SVI</a></li>
+<li><a href="#aevb">AEVB</a><ul>
+<li><a href="#vae">VAE</a></li>
+<li><a href="#fully-bayesian-aevb">Fully Bayesian AEVB</a></li>
+</ul></li>
+<li><a href="#references">References</a></li>
+</ul>
+</nav>
+<p>In this post I give an introduction to variational inference, which is about maximising the evidence lower bound (ELBO).</p>
+<p>I use a top-down approach, starting with the KL divergence and the ELBO, to lay the mathematical framework of all the models in this post.</p>
+<p>Then I define mixture models and the EM algorithm, with Gaussian mixture model (GMM), probabilistic latent semantic analysis (pLSA) and the hidden markov model (HMM) as examples.</p>
+<p>After that I present the fully Bayesian version of EM, also known as mean field approximation (MFA), and apply it to fully Bayesian mixture models, with fully Bayesian GMM (also known as variational GMM), latent Dirichlet allocation (LDA) and Dirichlet process mixture model (DPMM) as examples.</p>
+<p>Then I explain stochastic variational inference, a modification of EM and MFA to improve efficiency.</p>
+<p>Finally I talk about autoencoding variational Bayes (AEVB), a Monte-Carlo + neural network approach to raising the ELBO, exemplified by the variational autoencoder (VAE). I also show its fully Bayesian version.</p>
+<p><strong>Acknowledgement</strong>. The following texts and resources were illuminating during the writing of this post: the Stanford CS228 notes (<a href="https://ermongroup.github.io/cs228-notes/inference/variational/">1</a>,<a href="https://ermongroup.github.io/cs228-notes/learning/latent/">2</a>), the <a href="https://www.cs.tau.ac.il/~rshamir/algmb/presentations/EM-BW-Ron-16%20.pdf">Tel Aviv Algorithms in Molecular Biology slides</a> (clear explanations of the connection between EM and Baum-Welch), Chapter 10 of <a href="https://www.springer.com/us/book/9780387310732">Bishop's book</a> (brilliant introduction to variational GMM), Section 2.5 of <a href="http://cs.brown.edu/~sudderth/papers/sudderthPhD.pdf">Sudderth's thesis</a> and <a href="https://metacademy.org">metacademy</a>. Also thanks to Josef Lindman Hörnlund for discussions. The research was done while working at KTH mathematics department.</p>
+<p><em>If you are reading on a mobile device, you may need to "request desktop site" for the equations to be properly displayed. This post is licensed under CC BY-SA and GNU FDL.</em></p>
+<h2 id="kl-divergence-and-elbo">KL divergence and ELBO</h2>
+<p>Let <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> be two probability measures. The Kullback-Leibler (KL) divergence is defined as</p>
+<p><span class="math display">\[D(q||p) = E_q \log{q \over p}.\]</span></p>
+<p>It achieves minimum <span class="math inline">\(0\)</span> when <span class="math inline">\(p = q\)</span>.</p>
+<p>If <span class="math inline">\(p\)</span> can be further written as</p>
+<p><span class="math display">\[p(x) = {w(x) \over Z}, \qquad (0)\]</span></p>
+<p>where <span class="math inline">\(Z\)</span> is a normaliser, then</p>
+<p><span class="math display">\[\log Z = D(q||p) + L(w, q), \qquad(1)\]</span></p>
+<p>where <span class="math inline">\(L(w, q)\)</span> is called the evidence lower bound (ELBO), defined by</p>
+<p><span class="math display">\[L(w, q) = E_q \log{w \over q}. \qquad (1.25)\]</span></p>
+<p>From (1), we see that to minimise the nonnegative term <span class="math inline">\(D(q || p)\)</span>, one can maximise the ELBO.</p>
+<p>To this end, we can simply discard <span class="math inline">\(D(q || p)\)</span> in (1) and obtain:</p>
+<p><span class="math display">\[\log Z \ge L(w, q) \qquad (1.3)\]</span></p>
+<p>and keep in mind that the inequality becomes an equality when <span class="math inline">\(q = {w \over Z}\)</span>.</p>
+<p>It is time to define the task of variational inference (VI), also known as variational Bayes (VB).</p>
+<p><strong>Definition</strong>. Variational inference is concerned with maximising the ELBO <span class="math inline">\(L(w, q)\)</span>.</p>
+<p>There are mainly two versions of VI, the half Bayesian and the fully Bayesian cases. Half Bayesian VI, to which expectation-maximisation algorithms (EM) apply, instantiates (1.3) with</p>
+<p><span class="math display">\[\begin{aligned}
+Z &amp;= p(x; \theta)\\
+w &amp;= p(x, z; \theta)\\
+q &amp;= q(z)
+\end{aligned}\]</span></p>
+<p>and the dummy variable <span class="math inline">\(x\)</span> in Equation (0) is substituted with <span class="math inline">\(z\)</span>.</p>
+<p>Fully Bayesian VI, often just called VI, has the following instantiations:</p>
+<p><span class="math display">\[\begin{aligned}
+Z &amp;= p(x) \\
+w &amp;= p(x, z, \theta) \\
+q &amp;= q(z, \theta)
+\end{aligned}\]</span></p>
+<p>and <span class="math inline">\(x\)</span> in Equation (0) is substituted with <span class="math inline">\((z, \theta)\)</span>.</p>
+<p>In both cases <span class="math inline">\(\theta\)</span> are parameters and <span class="math inline">\(z\)</span> are latent variables.</p>
+<p><strong>Remark on the naming of things</strong>. The term "variational" comes from the fact that we perform calculus of variations: maximise some functional (<span class="math inline">\(L(w, q)\)</span>) over a set of functions (<span class="math inline">\(q\)</span>). Note however, most of the VI / VB algorithms do not concern any techniques in calculus of variations, but only uses Jensen's inequality / the fact the <span class="math inline">\(D(q||p)\)</span> reaches minimum when <span class="math inline">\(p = q\)</span>. Due to this reasoning of the naming, EM is also a kind of VI, even though in the literature VI often referes to its fully Bayesian version.</p>
+<h2 id="em">EM</h2>
+<p>To illustrate the EM algorithms, we first define the mixture model.</p>
+<p><strong>Definition (mixture model)</strong>. Given dataset <span class="math inline">\(x_{1 : m}\)</span>, we assume the data has some underlying latent variable <span class="math inline">\(z_{1 : m}\)</span> that may take a value from a finite set <span class="math inline">\(\{1, 2, ..., n_z\}\)</span>. Let <span class="math inline">\(p(z_{i}; \pi)\)</span> be categorically distributed according to the probability vector <span class="math inline">\(\pi\)</span>. That is, <span class="math inline">\(p(z_{i} = k; \pi) = \pi_k\)</span>. Also assume <span class="math inline">\(p(x_{i} | z_{i} = k; \eta) = p(x_{i}; \eta_k)\)</span>. Find <span class="math inline">\(\theta = (\pi, \eta)\)</span> that maximises the likelihood <span class="math inline">\(p(x_{1 : m}; \theta)\)</span>.</p>
+<p>Represented as a DAG (a.k.a the plate notations), the model looks like this:</p>
+<p><img src="/assets/resources/mixture-model.png" style="width:250px" /></p>
+<p>where the boxes with <span class="math inline">\(m\)</span> mean repitition for <span class="math inline">\(m\)</span> times, since there <span class="math inline">\(m\)</span> indepdent pairs of <span class="math inline">\((x, z)\)</span>, and the same goes for <span class="math inline">\(\eta\)</span>.</p>
+<p>The direct maximisation</p>
+<p><span class="math display">\[\max_\theta \sum_i \log p(x_{i}; \theta) = \max_\theta \sum_i \log \int p(x_{i} | z_i; \theta) p(z_i; \theta) dz_i\]</span></p>
+<p>is hard because of the integral in the log.</p>
+<p>We can fit this problem in (1.3) by having <span class="math inline">\(Z = p(x_{1 : m}; \theta)\)</span> and <span class="math inline">\(w = p(z_{1 : m}, x_{1 : m}; \theta)\)</span>. The plan is to update <span class="math inline">\(\theta\)</span> repeatedly so that <span class="math inline">\(L(p(z, x; \theta_t), q(z))\)</span> is non decreasing over time <span class="math inline">\(t\)</span>.</p>
+<p>Equation (1.3) at time <span class="math inline">\(t\)</span> for the <span class="math inline">\(i\)</span>th datapoint is</p>
+<p><span class="math display">\[\log p(x_{i}; \theta_t) \ge L(p(z_i, x_{i}; \theta_t), q(z_i)) \qquad (2)\]</span></p>
+<p>Each timestep consists of two steps, the E-step and the M-step.</p>
+<p>At E-step, we set</p>
+<p><span class="math display">\[q(z_{i}) = p(z_{i}|x_{i}; \theta_t), \]</span></p>
+<p>to turn the inequality into equality. We denote <span class="math inline">\(r_{ik} = q(z_i = k)\)</span> and call them responsibilities, so the posterior <span class="math inline">\(q(z_i)\)</span> is categorical distribution with parameter <span class="math inline">\(r_i = r_{i, 1 : n_z}\)</span>.</p>
+<p>At M-step, we maximise <span class="math inline">\(\sum_i L(p(x_{i}, z_{i}; \theta), q(z_{i}))\)</span> over <span class="math inline">\(\theta\)</span>:</p>
+<p><span class="math display">\[\begin{aligned}
+\theta_{t + 1} &amp;= \text{argmax}_\theta \sum_i L(p(x_{i}, z_{i}; \theta), p(z_{i} | x_{i}; \theta_t)) \\
+&amp;= \text{argmax}_\theta \sum_i \mathbb E_{p(z_{i} | x_{i}; \theta_t)} \log p(x_{i}, z_{i}; \theta) \qquad (2.3)
+\end{aligned}\]</span></p>
+<p>So <span class="math inline">\(\sum_i L(p(x_{i}, z_{i}; \theta), q(z_i))\)</span> is non-decreasing at both the E-step and the M-step.</p>
+<p>We can see from this derivation that EM is half-Bayesian. The E-step is Bayesian it computes the posterior of the latent variables and the M-step is frequentist because it performs maximum likelihood estimate of <span class="math inline">\(\theta\)</span>.</p>
+<p>It is clear that the ELBO sum coverges as it is nondecreasing with an upper bound, but it is not clear whether the sum converges to the correct value, i.e. <span class="math inline">\(\max_\theta p(x_{1 : m}; \theta)\)</span>. In fact it is said that the EM does get stuck in local maximum sometimes.</p>
+<p>A different way of describing EM, which will be useful in hidden Markov model is:</p>
+<ul>
+<li><p>At E-step, one writes down the formula <span class="math display">\[\sum_i \mathbb E_{p(z_i | x_{i}; \theta_t)} \log p(x_{i}, z_i; \theta). \qquad (2.5)\]</span></p></li>
+<li><p>At M-setp, one finds <span class="math inline">\(\theta_{t + 1}\)</span> to be the <span class="math inline">\(\theta\)</span> that maximises the above formula.</p></li>
+</ul>
+<h3 id="gmm">GMM</h3>
+<p>Gaussian mixture model (GMM) is an example of mixture models.</p>
+<p>The space of the data is <span class="math inline">\(\mathbb R^n\)</span>. We use the hypothesis that the data is Gaussian conditioned on the latent variable:</p>
+<p><span class="math display">\[(x_i; \eta_k) \sim N(\mu_k, \Sigma_k),\]</span></p>
+<p>so we write <span class="math inline">\(\eta_k = (\mu_k, \Sigma_k)\)</span>,</p>
+<p>During E-step, the <span class="math inline">\(q(z_i)\)</span> can be directly computed using Bayes’ theorem:</p>
+<p><span class="math display">\[r_{ik} = q(z_i = k) = \mathbb P(z_i = k | x_{i}; \theta_t)
+= {g_{\mu_{t, k}, \Sigma_{t, k}} (x_{i}) \pi_{t, k} \over \sum_{j = 1 : n_z} g_{\mu_{t, j}, \Sigma_{t, j}} (x_{i}) \pi_{t, j}},\]</span></p>
+<p>where <span class="math inline">\(g_{\mu, \Sigma} (x) = (2 \pi)^{- n / 2} \det \Sigma^{-1 / 2} \exp(- {1 \over 2} (x - \mu)^T \Sigma^{-1} (x - \mu))\)</span> is the pdf of the Gaussian distribution <span class="math inline">\(N(\mu, \Sigma)\)</span>.</p>
+<p>During M-step, we need to compute</p>
+<p><span class="math display">\[\text{argmax}_{\Sigma, \mu, \pi} \sum_{i = 1 : m} \sum_{k = 1 : n_z} r_{ik} \log (g_{\mu_k, \Sigma_k}(x_{i}) \pi_k).\]</span></p>
+<p>This is similar to the quadratic discriminant analysis, and the solution is</p>
+<p><span class="math display">\[\begin{aligned}
+\pi_{k} &amp;= {1 \over m} \sum_{i = 1 : m} r_{ik}, \\
+\mu_{k} &amp;= {\sum_i r_{ik} x_{i} \over \sum_i r_{ik}}, \\
+\Sigma_{k} &amp;= {\sum_i r_{ik} (x_{i} - \mu_{t, k}) (x_{i} - \mu_{t, k})^T \over \sum_i r_{ik}}.
+\end{aligned}\]</span></p>
+<p><strong>Remark</strong>. The k-means algorithm is the <span class="math inline">\(\epsilon \to 0\)</span> limit of the GMM with constraints <span class="math inline">\(\Sigma_k = \epsilon I\)</span>. See Section 9.3.2 of Bishop 2006 for derivation. It is also briefly mentioned there that a variant in this setting where the covariance matrix is not restricted to <span class="math inline">\(\epsilon I\)</span> is called elliptical k-means algorithm.</p>
+<h3 id="smm">SMM</h3>
+<p>As a transition to the next models to study, let us consider a simpler mixture model obtained by making one modification to GMM: change <span class="math inline">\((x; \eta_k) \sim N(\mu_k, \Sigma_k)\)</span> to <span class="math inline">\(\mathbb P(x = w; \eta_k) = \eta_{kw}\)</span> where <span class="math inline">\(\eta\)</span> is a stochastic matrix and <span class="math inline">\(w\)</span> is an arbitrary element of the space for <span class="math inline">\(x\)</span>. So now the space for both <span class="math inline">\(x\)</span> and <span class="math inline">\(z\)</span> are finite. We call this model the simple mixture model (SMM).</p>
+<p>As in GMM, at E-step <span class="math inline">\(r_{ik}\)</span> can be explicitly computed using Bayes' theorem.</p>
+<p>It is not hard to write down the solution to the M-step in this case:</p>
+<p><span class="math display">\[\begin{aligned}
+\pi_{k} &amp;= {1 \over m} \sum_i r_{ik}, \qquad (2.7)\\
+\eta_{k, w} &amp;= {\sum_i r_{ik} 1_{x_i = w} \over \sum_i r_{ik}}. \qquad (2.8)
+\end{aligned}\]</span></p>
+<p>where <span class="math inline">\(1_{x_i = w}\)</span> is the <a href="https://en.wikipedia.org/wiki/Indicator_function">indicator function</a>, and evaluates to <span class="math inline">\(1\)</span> if <span class="math inline">\(x_i = w\)</span> and <span class="math inline">\(0\)</span> otherwise.</p>
+<p>Two trivial variants of the SMM are the two versions of probabilistic latent semantic analysis (pLSA), which we call pLSA1 and pLSA2.</p>
+<p>The model pLSA1 is a probabilistic version of latent semantic analysis, which is basically a simple matrix factorisation model in collaborative filtering, whereas pLSA2 has a fully Bayesian version called latent Dirichlet allocation (LDA), not to be confused with the other LDA (linear discriminant analysis).</p>
+<h3 id="plsa">pLSA</h3>
+<p>The pLSA model (Hoffman 2000) is a mixture model, where the dataset is now pairs <span class="math inline">\((d_i, x_i)_{i = 1 : m}\)</span>. In natural language processing, <span class="math inline">\(x\)</span> are words and <span class="math inline">\(d\)</span> are documents, and a pair <span class="math inline">\((d, x)\)</span> represent an ocurrance of word <span class="math inline">\(x\)</span> in document <span class="math inline">\(d\)</span>.</p>
+<p>For each datapoint <span class="math inline">\((d_{i}, x_{i})\)</span>,</p>
+<p><span class="math display">\[\begin{aligned}
+p(d_i, x_i; \theta) &amp;= \sum_{z_i} p(z_i; \theta) p(d_i | z_i; \theta) p(x_i | z_i; \theta) \qquad (2.91)\\
+&amp;= p(d_i; \theta) \sum_z p(x_i | z_i; \theta) p (z_i | d_i; \theta) \qquad (2.92).
+\end{aligned}\]</span></p>
+<p>Of the two formulations, (2.91) corresponds to pLSA type 1, and (2.92) corresponds to type 2.</p>
+<h4 id="plsa1">pLSA1</h4>
+<p>The pLSA1 model (Hoffman 2000) is basically SMM with <span class="math inline">\(x_i\)</span> substituted with <span class="math inline">\((d_i, x_i)\)</span>, which conditioned on <span class="math inline">\(z_i\)</span> are independently categorically distributed:</p>
+<p><span class="math display">\[p(d_i = u, x_i = w | z_i = k; \theta) = p(d_i ; \xi_k) p(x_i; \eta_k) = \xi_{ku} \eta_{kw}.\]</span></p>
+<p>The model can be illustrated in the plate notations:</p>
+<p><img src="/assets/resources/plsa1.png" style="width:350px" /></p>
+<p>So the solution of the M-step is</p>
+<p><span class="math display">\[\begin{aligned}
+\pi_{k} &amp;= {1 \over m} \sum_i r_{ik} \\
+\xi_{k, u} &amp;= {\sum_i r_{ik} 1_{d_{i} = u} \over \sum_i r_{ik}} \\
+\eta_{k, w} &amp;= {\sum_i r_{ik} 1_{x_{i} = w} \over \sum_i r_{ik}}.
+\end{aligned}\]</span></p>
+<p><strong>Remark</strong>. pLSA1 is the probabilistic version of LSA, also known as matrix factorisation.</p>
+<p>Let <span class="math inline">\(n_d\)</span> and <span class="math inline">\(n_x\)</span> be the number of values <span class="math inline">\(d_i\)</span> and <span class="math inline">\(x_i\)</span> can take.</p>
+<p><strong>Problem</strong> (LSA). Let <span class="math inline">\(R\)</span> be a <span class="math inline">\(n_d \times n_x\)</span> matrix, fix <span class="math inline">\(s \le \min\{n_d, n_x\}\)</span>. Find <span class="math inline">\(n_d \times s\)</span> matrix <span class="math inline">\(D\)</span> and <span class="math inline">\(n_x \times s\)</span> matrix <span class="math inline">\(X\)</span> that minimises</p>
+<p><span class="math display">\[J(D, X) = \|R - D X^T\|_F.\]</span></p>
+<p>where <span class="math inline">\(\|\cdot\|_F\)</span> is the Frobenius norm.</p>
+<p><strong>Claim</strong>. Let <span class="math inline">\(R = U \Sigma V^T\)</span> be the SVD of <span class="math inline">\(R\)</span>, then the solution to the above problem is <span class="math inline">\(D = U_s \Sigma_s^{{1 \over 2}}\)</span> and <span class="math inline">\(X = V_s \Sigma_s^{{1 \over 2}}\)</span>, where <span class="math inline">\(U_s\)</span> (resp. <span class="math inline">\(V_s\)</span>) is the matrix of the first <span class="math inline">\(s\)</span> columns of <span class="math inline">\(U\)</span> (resp. <span class="math inline">\(V\)</span>) and <span class="math inline">\(\Sigma_s\)</span> is the <span class="math inline">\(s \times s\)</span> submatrix of <span class="math inline">\(\Sigma\)</span>.</p>
+<p>One can compare pLSA1 with LSA. Both procedures produce embeddings of <span class="math inline">\(d\)</span> and <span class="math inline">\(x\)</span>: in pLSA we obtain <span class="math inline">\(n_z\)</span> dimensional embeddings <span class="math inline">\(\xi_{\cdot, u}\)</span> and <span class="math inline">\(\eta_{\cdot, w}\)</span>, whereas in LSA we obtain <span class="math inline">\(s\)</span> dimensional embeddings <span class="math inline">\(D_{u, \cdot}\)</span> and <span class="math inline">\(X_{w, \cdot}\)</span>.</p>
+<h4 id="plsa2">pLSA2</h4>
+<p>Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite it as</p>
+<p><span class="math display">\[p(x_i | d_i; \theta) = \sum_{z_i} p(x_i | z_i; \theta) p(z_i | d_i; \theta).\]</span></p>
+<p>To simplify notations, we collect all the <span class="math inline">\(x_i\)</span>s with the corresponding <span class="math inline">\(d_i\)</span> equal to 1 (suppose there are <span class="math inline">\(m_1\)</span> of them), and write them as <span class="math inline">\((x_{1, j})_{j = 1 : m_1}\)</span>. In the same fashion we construct <span class="math inline">\(x_{2, 1 : m_2}, x_{3, 1 : m_3}, ... x_{n_d, 1 : m_{n_d}}\)</span>. Similarly, we relabel the corresponding <span class="math inline">\(d_i\)</span> and <span class="math inline">\(z_i\)</span> accordingly.</p>
+<p>With almost no loss of generality, we assume all <span class="math inline">\(m_\ell\)</span>s are equal and write them as <span class="math inline">\(m\)</span>.</p>
+<p>Now the model becomes</p>
+<p><span class="math display">\[p(x_{\ell, i} | d_{\ell, i} = \ell; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k | d_{\ell, i} = \ell; \theta).\]</span></p>
+<p>Since we have regrouped the <span class="math inline">\(x\)</span>’s and <span class="math inline">\(z\)</span>’s whose indices record the values of the <span class="math inline">\(d\)</span>’s, we can remove the <span class="math inline">\(d\)</span>’s from the equation altogether:</p>
+<p><span class="math display">\[p(x_{\ell, i}; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k; \theta).\]</span></p>
+<p>It is effectively a modification of SMM by making <span class="math inline">\(n_d\)</span> copies of <span class="math inline">\(\pi\)</span>. More specifically the parameters are <span class="math inline">\(\theta = (\pi_{1 : n_d, 1 : n_z}, \eta_{1 : n_z, 1 : n_x})\)</span>, where we model <span class="math inline">\((z | d = \ell) \sim \text{Cat}(\pi_{\ell, \cdot})\)</span> and, as in pLSA1, <span class="math inline">\((x | z = k) \sim \text{Cat}(\eta_{k, \cdot})\)</span>.</p>
+<p>Illustrated in the plate notations, pLSA2 is:</p>
+<p><img src="/assets/resources/plsa2.png" style="width:350px" /></p>
+<p>The computation is basically adding an index <span class="math inline">\(\ell\)</span> to the computation of SMM wherever applicable.</p>
+<p>The updates at the E-step is</p>
+<p><span class="math display">\[r_{\ell i k} = p(z_{\ell i} = k | x_{\ell i}; \theta) \propto \pi_{\ell k} \eta_{k, x_{\ell i}}.\]</span></p>
+<p>And at the M-step</p>
+<p><span class="math display">\[\begin{aligned}
+\pi_{\ell k} &amp;= {1 \over m} \sum_i r_{\ell i k} \\
+\eta_{k w} &amp;= {\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w} \over \sum_{\ell, i} r_{\ell i k}}.
+\end{aligned}\]</span></p>
+<h3 id="hmm">HMM</h3>
+<p>The hidden markov model (HMM) is a sequential version of SMM, in the same sense that recurrent neural networks are sequential versions of feed-forward neural networks.</p>
+<p>HMM is an example where the posterior <span class="math inline">\(p(z_i | x_i; \theta)\)</span> is not easy to compute, and one has to utilise properties of the underlying Bayesian network to go around it.</p>
+<p>Now each sample is a sequence <span class="math inline">\(x_i = (x_{ij})_{j = 1 : T}\)</span>, and so are the latent variables <span class="math inline">\(z_i = (z_{ij})_{j = 1 : T}\)</span>.</p>
+<p>The latent variables are assumed to form a Markov chain with transition matrix <span class="math inline">\((\xi_{k \ell})_{k \ell}\)</span>, and <span class="math inline">\(x_{ij}\)</span> is completely dependent on <span class="math inline">\(z_{ij}\)</span>:</p>
+<p><span class="math display">\[\begin{aligned}
+p(z_{ij} | z_{i, j - 1}) &amp;= \xi_{z_{i, j - 1}, z_{ij}},\\
+p(x_{ij} | z_{ij}) &amp;= \eta_{z_{ij}, x_{ij}}.
+\end{aligned}\]</span></p>
+<p>Also, the distribution of <span class="math inline">\(z_{i1}\)</span> is again categorical with parameter <span class="math inline">\(\pi\)</span>:</p>
+<p><span class="math display">\[p(z_{i1}) = \pi_{z_{i1}}\]</span></p>
+<p>So the parameters are <span class="math inline">\(\theta = (\pi, \xi, \eta)\)</span>. And HMM can be shown in plate notations as:</p>
+<p><img src="/assets/resources/hmm.png" style="width:350px" /></p>
+<p>Now we apply EM to HMM, which is called the <a href="https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm">Baum-Welch algorithm</a>. Unlike the previous examples, it is too messy to compute <span class="math inline">\(p(z_i | x_{i}; \theta)\)</span>, so during the E-step we instead write down formula (2.5) directly in hope of simplifying it:</p>
+<p><span class="math display">\[\begin{aligned}
+\mathbb E_{p(z_i | x_i; \theta_t)} \log p(x_i, z_i; \theta_t) &amp;=\mathbb E_{p(z_i | x_i; \theta_t)} \left(\log \pi_{z_{i1}} + \sum_{j = 2 : T} \log \xi_{z_{i, j - 1}, z_{ij}} + \sum_{j = 1 : T} \log \eta_{z_{ij}, x_{ij}}\right). \qquad (3)
+\end{aligned}\]</span></p>
+<p>Let us compute the summand in second term:</p>
+<p><span class="math display">\[\begin{aligned}
+\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \xi_{z_{i, j - 1}, z_{ij}} &amp;= \sum_{k, \ell} (\log \xi_{k, \ell}) \mathbb E_{p(z_{i} | x_{i}; \theta_t)} 1_{z_{i, j - 1} = k, z_{i, j} = \ell} \\
+&amp;= \sum_{k, \ell} p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t) \log \xi_{k, \ell}. \qquad (4)
+\end{aligned}\]</span></p>
+<p>Similarly, one can write down the first term and the summand in the third term to obtain</p>
+<p><span class="math display">\[\begin{aligned}
+\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \pi_{z_{i1}} &amp;= \sum_k p(z_{i1} = k | x_{i}; \theta_t), \qquad (5) \\
+\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \eta_{z_{i, j}, x_{i, j}} &amp;= \sum_{k, w} 1_{x_{ij} = w} p(z_{i, j} = k | x_i; \theta_t) \log \eta_{k, w}. \qquad (6)
+\end{aligned}\]</span></p>
+<p>plugging (4)(5)(6) back into (3) and summing over <span class="math inline">\(j\)</span>, we obtain the formula to maximise over <span class="math inline">\(\theta\)</span> on:</p>
+<p><span class="math display">\[\sum_k \sum_i r_{i1k} \log \pi_k + \sum_{k, \ell} \sum_{j = 2 : T, i} s_{ijk\ell} \log \xi_{k, \ell} + \sum_{k, w} \sum_{j = 1 : T, i} r_{ijk} 1_{x_{ij} = w} \log \eta_{k, w},\]</span></p>
+<p>where</p>
+<p><span class="math display">\[\begin{aligned}
+r_{ijk} &amp;:= p(z_{ij} = k | x_{i}; \theta_t), \\
+s_{ijk\ell} &amp;:= p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t).
+\end{aligned}\]</span></p>
+<p>Now we proceed to the M-step. Since each of the <span class="math inline">\(\pi_k, \xi_{k, \ell}, \eta_{k, w}\)</span> is nicely confined in the inner sum of each term, together with the constraint <span class="math inline">\(\sum_k \pi_k = \sum_\ell \xi_{k, \ell} = \sum_w \eta_{k, w} = 1\)</span> it is not hard to find the argmax at time <span class="math inline">\(t + 1\)</span> (the same way one finds the MLE for any categorical distribution):</p>
+<p><span class="math display">\[\begin{aligned}
+\pi_{k} &amp;= {1 \over m} \sum_i r_{i1k}, \qquad (6.1) \\
+\xi_{k, \ell} &amp;= {\sum_{j = 2 : T, i} s_{ijk\ell} \over \sum_{j = 1 : T - 1, i} r_{ijk}}, \qquad(6.2) \\
+\eta_{k, w} &amp;= {\sum_{ij} 1_{x_{ij} = w} r_{ijk} \over \sum_{ij} r_{ijk}}. \qquad(6.3)
+\end{aligned}\]</span></p>
+<p>Note that (6.1)(6.3) are almost identical to (2.7)(2.8). This makes sense as the only modification HMM makes over SMM is the added dependencies between the latent variables.</p>
+<p>What remains is to compute <span class="math inline">\(r\)</span> and <span class="math inline">\(s\)</span>.</p>
+<p>This is done by using the forward and backward procedures which takes advantage of the conditioned independence / topology of the underlying Bayesian network. It is out of scope of this post, but for the sake of completeness I include it here.</p>
+<p>Let</p>
+<p><span class="math display">\[\begin{aligned}
+\alpha_k(i, j) &amp;:= p(x_{i, 1 : j}, z_{ij} = k; \theta_t), \\
+\beta_k(i, j) &amp;:= p(x_{i, j + 1 : T} | z_{ij} = k; \theta_t).
+\end{aligned}\]</span></p>
+<p>They can be computed recursively as</p>
+<p><span class="math display">\[\begin{aligned}
+\alpha_k(i, j) &amp;= \begin{cases}
+\eta_{k, x_{1j}} \pi_k, &amp; j = 1; \\
+\eta_{k, x_{ij}} \sum_\ell \alpha_\ell(j - 1, i) \xi_{k\ell}, &amp; j \ge 2.
+\end{cases}\\
+\beta_k(i, j) &amp;= \begin{cases}
+1, &amp; j = T;\\
+\sum_\ell \xi_{k\ell} \beta_\ell(j + 1, i) \eta_{\ell, x_{i, j + 1}}, &amp; j &lt; T.
+\end{cases}
+\end{aligned}\]</span></p>
+<p>Then</p>
+<p><span class="math display">\[\begin{aligned}
+p(z_{ij} = k, x_{i}; \theta_t) &amp;= \alpha_k(i, j) \beta_k(i, j), \qquad (7)\\
+p(x_{i}; \theta_t) &amp;= \sum_k \alpha_k(i, j) \beta_k(i, j),\forall j = 1 : T \qquad (8)\\
+p(z_{i, j - 1} = k, z_{i, j} = \ell, x_{i}; \theta_t) &amp;= \alpha_k(i, j) \xi_{k\ell} \beta_\ell(i, j + 1) \eta_{\ell, x_{j + 1, i}}. \qquad (9)
+\end{aligned}\]</span></p>
+<p>And this yields <span class="math inline">\(r_{ijk}\)</span> and <span class="math inline">\(s_{ijk\ell}\)</span> since they can be computed as <span class="math inline">\({(7) \over (8)}\)</span> and <span class="math inline">\({(9) \over (8)}\)</span> respectively.</p>
+<h2 id="fully-bayesian-em-mfa">Fully Bayesian EM / MFA</h2>
+<p>Let us now venture into the realm of full Bayesian.</p>
+<p>In EM we aim to maximise the ELBO</p>
+<p><span class="math display">\[\int q(z) \log {p(x, z; \theta) \over q(z)} dz d\theta\]</span></p>
+<p>alternately over <span class="math inline">\(q\)</span> and <span class="math inline">\(\theta\)</span>. As mentioned before, the E-step of maximising over <span class="math inline">\(q\)</span> is Bayesian, in that it computes the posterior of <span class="math inline">\(z\)</span>, whereas the M-step of maximising over <span class="math inline">\(\theta\)</span> is maximum likelihood and frequentist.</p>
+<p>The fully Bayesian EM makes the M-step Bayesian by making <span class="math inline">\(\theta\)</span> a random variable, so the ELBO becomes</p>
+<p><span class="math display">\[L(p(x, z, \theta), q(z, \theta)) = \int q(z, \theta) \log {p(x, z, \theta) \over q(z, \theta)} dz d\theta\]</span></p>
+<p>We further assume <span class="math inline">\(q\)</span> can be factorised into distributions on <span class="math inline">\(z\)</span> and <span class="math inline">\(\theta\)</span>: <span class="math inline">\(q(z, \theta) = q_1(z) q_2(\theta)\)</span>. So the above formula is rewritten as</p>
+<p><span class="math display">\[L(p(x, z, \theta), q(z, \theta)) = \int q_1(z) q_2(\theta) \log {p(x, z, \theta) \over q_1(z) q_2(\theta)} dz d\theta\]</span></p>
+<p>To find argmax over <span class="math inline">\(q_1\)</span>, we rewrite</p>
+<p><span class="math display">\[\begin{aligned}
+L(p(x, z, \theta), q(z, \theta)) &amp;= \int q_1(z) \left(\int q_2(\theta) \log p(x, z, \theta) d\theta\right) dz - \int q_1(z) \log q_1(z) dz - \int q_2(\theta) \log q_2(\theta) \\&amp;= - D(q_1(z) || p_x(z)) + C,
+\end{aligned}\]</span></p>
+<p>where <span class="math inline">\(p_x\)</span> is a density in <span class="math inline">\(z\)</span> with</p>
+<p><span class="math display">\[\log p_x(z) = \mathbb E_{q_2(\theta)} \log p(x, z, \theta) + C.\]</span></p>
+<p>So the <span class="math inline">\(q_1\)</span> that maximises the ELBO is <span class="math inline">\(q_1^* = p_x\)</span>.</p>
+<p>Similarly, the optimal <span class="math inline">\(q_2\)</span> is such that</p>
+<p><span class="math display">\[\log q_2^*(\theta) = \mathbb E_{q_1(z)} \log p(x, z, \theta) + C.\]</span></p>
+<p>The fully Bayesian EM thus alternately evaluates <span class="math inline">\(q_1^*\)</span> (E-step) and <span class="math inline">\(q_2^*\)</span> (M-step).</p>
+<p>It is also called mean field approximation (MFA), and can be easily generalised to models with more than two groups of latent variables, see e.g. Section 10.1 of Bishop 2006.</p>
+<h3 id="application-to-mixture-models">Application to mixture models</h3>
+<p><strong>Definition (Fully Bayesian mixture model)</strong>. The relations between <span class="math inline">\(\pi\)</span>, <span class="math inline">\(\eta\)</span>, <span class="math inline">\(x\)</span>, <span class="math inline">\(z\)</span> are the same as in the definition of mixture models. Furthermore, we assume the distribution of <span class="math inline">\((x | \eta_k)\)</span> belongs to the <a href="https://en.wikipedia.org/wiki/Exponential_family">exponential family</a> (the definition of the exponential family is briefly touched at the end of this section). But now both <span class="math inline">\(\pi\)</span> and <span class="math inline">\(\eta\)</span> are random variables. Let the prior distribution <span class="math inline">\(p(\pi)\)</span> is Dirichlet with parameter <span class="math inline">\((\alpha, \alpha, ..., \alpha)\)</span>. Let the prior <span class="math inline">\(p(\eta_k)\)</span> be the conjugate prior of <span class="math inline">\((x | \eta_k)\)</span>, with parameter <span class="math inline">\(\beta\)</span>, we will see later in this section that the posterior <span class="math inline">\(q(\eta_k)\)</span> belongs to the same family as <span class="math inline">\(p(\eta_k)\)</span>. Represented in a plate notations, a fully Bayesian mixture model looks like:</p>
+<p><img src="/assets/resources/fully-bayesian-mm.png" style="width:450px" /></p>
+<p>Given this structure we can write down the mean-field approximation:</p>
+<p><span class="math display">\[\log q(z) = \mathbb E_{q(\eta)q(\pi)} (\log(x | z, \eta) + \log(z | \pi)) + C.\]</span></p>
+<p>Both sides can be factored into per-sample expressions, giving us</p>
+<p><span class="math display">\[\log q(z_i) = \mathbb E_{q(\eta)} \log p(x_i | z_i, \eta) + \mathbb E_{q(\pi)} \log p(z_i | \pi) + C\]</span></p>
+<p>Therefore</p>
+<p><span class="math display">\[\log r_{ik} = \log q(z_i = k) = \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k) + \mathbb E_{q(\pi)} \log \pi_k + C. \qquad (9.1)\]</span></p>
+<p>So the posterior of each <span class="math inline">\(z_i\)</span> is categorical regardless of the <span class="math inline">\(p\)</span>s and <span class="math inline">\(q\)</span>s.</p>
+<p>Computing the posterior of <span class="math inline">\(\pi\)</span> and <span class="math inline">\(\eta\)</span>:</p>
+<p><span class="math display">\[\log q(\pi) + \log q(\eta) = \log p(\pi) + \log p(\eta) + \sum_i \mathbb E_{q(z_i)} p(x_i | z_i, \eta) + \sum_i \mathbb E_{q(z_i)} p(z_i | \pi) + C.\]</span></p>
+<p>So we can separate the terms involving <span class="math inline">\(\pi\)</span> and those involving <span class="math inline">\(\eta\)</span>. First compute the posterior of <span class="math inline">\(\pi\)</span>:</p>
+<p><span class="math display">\[\log q(\pi) = \log p(\pi) + \sum_i \mathbb E_{q(z_i)} \log p(z_i | \pi) = \log p(\pi) + \sum_i \sum_k r_{ik} \log \pi_k + C.\]</span></p>
+<p>The right hand side is the log of a Dirichlet modulus the constant <span class="math inline">\(C\)</span>, from which we can update the posterior parameter <span class="math inline">\(\phi^\pi\)</span>:</p>
+<p><span class="math display">\[\phi^\pi_k = \alpha + \sum_i r_{ik}. \qquad (9.3)\]</span></p>
+<p>Similarly we can obtain the posterior of <span class="math inline">\(\eta\)</span>:</p>
+<p><span class="math display">\[\log q(\eta) = \log p(\eta) + \sum_i \sum_k r_{ik} \log p(x_i | \eta_k) + C.\]</span></p>
+<p>Again we can factor the terms with respect to <span class="math inline">\(k\)</span> and get:</p>
+<p><span class="math display">\[\log q(\eta_k) = \log p(\eta_k) + \sum_i r_{ik} \log p(x_i | \eta_k) + C. \qquad (9.5)\]</span></p>
+<p>Here we can see why conjugate prior works. Mathematically, given a probability distribution <span class="math inline">\(p(x | \theta)\)</span>, the distribution <span class="math inline">\(p(\theta)\)</span> is called conjugate prior of <span class="math inline">\(p(x | \theta)\)</span> if <span class="math inline">\(\log p(\theta) + \log p(x | \theta)\)</span> has the same form as <span class="math inline">\(\log p(\theta)\)</span>.</p>
+<p>For example, the conjugate prior for the exponential family <span class="math inline">\(p(x | \theta) = h(x) \exp(\theta \cdot T(x) - A(\theta))\)</span> where <span class="math inline">\(T\)</span>, <span class="math inline">\(A\)</span> and <span class="math inline">\(h\)</span> are some functions is <span class="math inline">\(p(\theta; \chi, \nu) \propto \exp(\chi \cdot \theta - \nu A(\theta))\)</span>.</p>
+<p>Here what we want is a bit different from conjugate priors because of the coefficients <span class="math inline">\(r_{ik}\)</span>. But the computation carries over to the conjugate priors of the exponential family (try it yourself and you'll see). That is, if <span class="math inline">\(p(x_i | \eta_k)\)</span> belongs to the exponential family</p>
+<p><span class="math display">\[p(x_i | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))\]</span></p>
+<p>and if <span class="math inline">\(p(\eta_k)\)</span> is the conjugate prior of <span class="math inline">\(p(x_i | \eta_k)\)</span></p>
+<p><span class="math display">\[p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k))\]</span></p>
+<p>then <span class="math inline">\(q(\eta_k)\)</span> has the same form as <span class="math inline">\(p(\eta_k)\)</span>, and from (9.5) we can compute the updates of <span class="math inline">\(\phi^{\eta_k}\)</span>:</p>
+<p><span class="math display">\[\begin{aligned}
+\phi^{\eta_k}_1 &amp;= \chi + \sum_i r_{ik} T(x_i), \qquad (9.7) \\
+\phi^{\eta_k}_2 &amp;= \nu + \sum_i r_{ik}. \qquad (9.9)
+\end{aligned}\]</span></p>
+<p>So the mean field approximation for the fully Bayesian mixture model is the alternate iteration of (9.1) (E-step) and (9.3)(9.7)(9.9) (M-step) until convergence.</p>
+<h3 id="fully-bayesian-gmm">Fully Bayesian GMM</h3>
+<p>A typical example of fully Bayesian mixture models is the fully Bayesian Gaussian mixture model (Attias 2000, also called variational GMM in the literature). It is defined by applying the same modification to GMM as the difference between Fully Bayesian mixture model and the vanilla mixture model.</p>
+<p>More specifically:</p>
+<ul>
+<li><span class="math inline">\(p(z_{i}) = \text{Cat}(\pi)\)</span> as in vanilla GMM</li>
+<li><span class="math inline">\(p(\pi) = \text{Dir}(\alpha, \alpha, ..., \alpha)\)</span> has Dirichlet distribution, the conjugate prior to the parameters of the categorical distribution.</li>
+<li><span class="math inline">\(p(x_i | z_i = k) = p(x_i | \eta_k) = N(\mu_{k}, \Sigma_{k})\)</span> as in vanilla GMM</li>
+<li><span class="math inline">\(p(\mu_k, \Sigma_k) = \text{NIW} (\mu_0, \lambda, \Psi, \nu)\)</span> is the normal-inverse-Wishart distribution, the conjugate prior to the mean and covariance matrix of the Gaussian distribution.</li>
+</ul>
+<p>The E-step and M-step can be computed using (9.1) and (9.3)(9.7)(9.9) in the previous section. The details of the computation can be found in Chapter 10.2 of Bishop 2006 or Attias 2000.</p>
+<h3 id="lda">LDA</h3>
+<p>As the second example of fully Bayesian mixture models, Latent Dirichlet allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of pLSA2, with the following plate notations:</p>
+<p><img src="/assets/resources/lda.png" style="width:450px" /></p>
+<p>It is the smoothed version in the paper.</p>
+<p>More specifically, on the basis of pLSA2, we add prior distributions to <span class="math inline">\(\eta\)</span> and <span class="math inline">\(\pi\)</span>:</p>
+<p><span class="math display">\[\begin{aligned}
+p(\eta_k) &amp;= \text{Dir} (\beta, ..., \beta), \qquad k = 1 : n_z \\
+p(\pi_\ell) &amp;= \text{Dir} (\alpha, ..., \alpha), \qquad \ell = 1 : n_d \\
+\end{aligned}\]</span></p>
+<p>And as before, the prior of <span class="math inline">\(z\)</span> is</p>
+<p><span class="math display">\[p(z_{\ell, i}) = \text{Cat} (\pi_\ell), \qquad \ell = 1 : n_d, i = 1 : m\]</span></p>
+<p>We also denote posterior distribution</p>
+<p><span class="math display">\[\begin{aligned}
+q(\eta_k) &amp;= \text{Dir} (\phi^{\eta_k}), \qquad k = 1 : n_z \\
+q(\pi_\ell) &amp;= \text{Dir} (\phi^{\pi_\ell}), \qquad \ell = 1 : n_d \\
+p(z_{\ell, i}) &amp;= \text{Cat} (r_{\ell, i}), \qquad \ell = 1 : n_d, i = 1 : m
+\end{aligned}\]</span></p>
+<p>As before, in E-step we update <span class="math inline">\(r\)</span>, and in M-step we update <span class="math inline">\(\lambda\)</span> and <span class="math inline">\(\gamma\)</span>.</p>
+<p>But in the LDA paper, one treats optimisation over <span class="math inline">\(r\)</span>, <span class="math inline">\(\lambda\)</span> and <span class="math inline">\(\gamma\)</span> as a E-step, and treats <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span> as parameters, which is optmised over at M-step. This makes it more akin to the classical EM where the E-step is Bayesian and M-step MLE. This is more complicated, and we do not consider it this way here.</p>
+<p>Plugging in (9.1) we obtain the updates at E-step</p>
+<p><span class="math display">\[r_{\ell i k} \propto \exp(\psi(\phi^{\pi_\ell}_k) + \psi(\phi^{\eta_k}_{x_{\ell i}}) - \psi(\sum_w \phi^{\eta_k}_w)), \qquad (10)\]</span></p>
+<p>where <span class="math inline">\(\psi\)</span> is the digamma function. Similarly, plugging in (9.3)(9.7)(9.9), at M-step, we update the posterior of <span class="math inline">\(\pi\)</span> and <span class="math inline">\(\eta\)</span>:</p>
+<p><span class="math display">\[\begin{aligned}
+\phi^{\pi_\ell}_k &amp;= \alpha + \sum_i r_{\ell i k}. \qquad (11)\\
+%%}}$
+%%As for $\eta$, we have
+%%{{$%align%
+%%\log q(\eta) &amp;= \sum_k \log p(\eta_k) + \sum_{\ell, i} \mathbb E_{q(z_{\ell i})} \log p(x_{\ell i} | z_{\ell i}, \eta) \\
+%%&amp;= \sum_{k, j} (\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = j} + \beta - 1) \log \eta_{k j}
+%%}}$
+%%which gives us
+%%{{$
+\phi^{\eta_k}_w &amp;= \beta + \sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w}. \qquad (12)
+\end{aligned}\]</span></p>
+<p>So the algorithm iterates over (10) and (11)(12) until convergence.</p>
+<h3 id="dpmm">DPMM</h3>
+<p>The Dirichlet process mixture model (DPMM) is like the fully Bayesian mixture model except <span class="math inline">\(n_z = \infty\)</span>, i.e. <span class="math inline">\(z\)</span> can take any positive integer value.</p>
+<p>The probability of <span class="math inline">\(z_i = k\)</span> is defined using the so called stick-breaking process: let <span class="math inline">\(v_i \sim \text{Beta} (\alpha, \beta)\)</span> be i.i.d. random variables with Beta distributions, then</p>
+<p><span class="math display">\[\mathbb P(z_i = k | v_{1:\infty}) = (1 - v_1) (1 - v_2) ... (1 - v_{k - 1}) v_k.\]</span></p>
+<p>So <span class="math inline">\(v\)</span> plays a similar role to <span class="math inline">\(\pi\)</span> in the previous models.</p>
+<p>As before, we have that the distribution of <span class="math inline">\(x\)</span> belongs to the exponential family:</p>
+<p><span class="math display">\[p(x | z = k, \eta) = p(x | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))\]</span></p>
+<p>so the prior of <span class="math inline">\(\eta_k\)</span> is</p>
+<p><span class="math display">\[p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k)).\]</span></p>
+<p>Because of the infinities we can't directly apply the formulas in the general fully Bayesian mixture models. So let us carefully derive the whole thing again.</p>
+<p>As before, we can write down the ELBO:</p>
+<p><span class="math display">\[L(p(x, z, \theta), q(z, \theta)) = \mathbb E_{q(\theta)} \log {p(\theta) \over q(\theta)} + \mathbb E_{q(\theta) q(z)} \log {p(x, z | \theta) \over q(z)}.\]</span></p>
+<p>Both terms are infinite series:</p>
+<p><span class="math display">\[L(p, q) = \sum_{k = 1 : \infty} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_{i = 1 : m} \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}.\]</span></p>
+<p>There are several ways to deal with the infinities. One is to fix some level <span class="math inline">\(T &gt; 0\)</span> and set <span class="math inline">\(v_T = 1\)</span> almost surely (Blei-Jordan 2006). This effectively turns the model into a finite one, and both terms become finite sums over <span class="math inline">\(k = 1 : T\)</span>.</p>
+<p>Another walkaround (Kurihara-Welling-Vlassis 2007) is also a kind of truncation, but less heavy-handed: setting the posterior <span class="math inline">\(q(\theta) = q(\eta) q(v)\)</span> to be:</p>
+<p><span class="math display">\[q(\theta) = q(\theta_{1 : T}) p(\theta_{T + 1 : \infty}) =: q(\theta_{\le T}) p(\theta_{&gt; T}).\]</span></p>
+<p>That is, tie the posterior after <span class="math inline">\(T\)</span> to the prior. This effectively turns the first term in the ELBO to a finite sum over <span class="math inline">\(k = 1 : T\)</span>, while keeping the second sum an infinite series:</p>
+<p><span class="math display">\[L(p, q) = \sum_{k = 1 : T} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_i \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}. \qquad (13)\]</span></p>
+<p>The plate notation of this model is:</p>
+<p><img src="/assets/resources/dpmm.png" style="width:450px" /></p>
+<p>As it turns out, the infinities can be tamed in this case.</p>
+<p>As before, the optimal <span class="math inline">\(q(z_i)\)</span> is computed as</p>
+<p><span class="math display">\[r_{ik} = q(z_i = k) = s_{ik} / S_i\]</span></p>
+<p>where</p>
+<p><span class="math display">\[\begin{aligned}
+s_{ik} &amp;= \exp(\mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta)) \\
+S_i &amp;= \sum_{k = 1 : \infty} s_{ik}.
+\end{aligned}\]</span></p>
+<p>Plugging this back to (13) we have</p>
+<p><span class="math display">\[\begin{aligned}
+\sum_{k = 1 : \infty} r_{ik} &amp;\mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over r_{ik}} \\
+&amp;= \sum_{k = 1 : \infty} r_{ik} \mathbb E_{q(\theta)} (\log p(x_i, z_i = k | \theta) - \mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta) + \log S_i) = \log S_i.
+\end{aligned}\]</span></p>
+<p>So it all rests upon <span class="math inline">\(S_i\)</span> being finite.</p>
+<p>For <span class="math inline">\(k \le T + 1\)</span>, we compute the quantity <span class="math inline">\(s_{ik}\)</span> directly. For <span class="math inline">\(k &gt; T\)</span>, it is not hard to show that</p>
+<p><span class="math display">\[s_{ik} = s_{i, T + 1} \exp((k - T - 1) \mathbb E_{p(w)} \log (1 - w)),\]</span></p>
+<p>where <span class="math inline">\(w\)</span> is a random variable with same distribution as <span class="math inline">\(p(v_k)\)</span>, i.e. <span class="math inline">\(\text{Beta}(\alpha, \beta)\)</span>.</p>
+<p>Hence</p>
+<p><span class="math display">\[S_i = \sum_{k = 1 : T} s_{ik} + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}\]</span></p>
+<p>is indeed finite. Similarly we also obtain</p>
+<p><span class="math display">\[q(z_i &gt; k) = S^{-1} \left(\sum_{\ell = k + 1 : T} s_\ell + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}\right), k \le T \qquad (14)\]</span></p>
+<p>Now let us compute the posterior of <span class="math inline">\(\theta_{\le T}\)</span>. In the following we exchange the integrals without justifying them (c.f. Fubini's Theorem). Equation (13) can be rewritten as</p>
+<p><span class="math display">\[L(p, q) = \mathbb E_{q(\theta_{\le T})} \left(\log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{&gt; T})} \log {p(x_i, z_i | \theta) \over q(z_i)} - \log q(\theta_{\le T})\right).\]</span></p>
+<p>Note that unlike the derivation of the mean-field approximation, we keep the <span class="math inline">\(- \mathbb E_{q(z)} \log q(z)\)</span> term even though we are only interested in <span class="math inline">\(\theta\)</span> at this stage. This is again due to the problem of infinities: as in the computation of <span class="math inline">\(S\)</span>, we would like to cancel out some undesirable unbounded terms using <span class="math inline">\(q(z)\)</span>. We now have</p>
+<p><span class="math display">\[\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{&gt; T})} \log {p(x_i, z_i | \theta) \over q(z_i)} + C.\]</span></p>
+<p>By plugging in <span class="math inline">\(q(z = k)\)</span> we obtain</p>
+<p><span class="math display">\[\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_{k = 1 : \infty} q(z_i = k) \left(\mathbb E_{p(\theta_{&gt; T})} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)} - \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}\right) + C.\]</span></p>
+<p>Again, we separate the <span class="math inline">\(v_k\)</span>'s and the <span class="math inline">\(\eta_k\)</span>'s to obtain</p>
+<p><span class="math display">\[q(v_{\le T}) = \log p(v_{\le T}) + \sum_i \sum_k q(z_i = k) \left(\mathbb E_{p(v_{&gt; T})} \log p(z_i = k | v) - \mathbb E_{q(v)} \log p (z_i = k | v)\right).\]</span></p>
+<p>Denote by <span class="math inline">\(D_k\)</span> the difference between the two expetations on the right hand side. It is easy to show that</p>
+<p><span class="math display">\[D_k = \begin{cases}
+\log(1 - v_1) ... (1 - v_{k - 1}) v_k - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_{k - 1}) v_k &amp; k \le T\\
+\log(1 - v_1) ... (1 - v_T) - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_T) &amp; k &gt; T
+\end{cases}\]</span></p>
+<p>so <span class="math inline">\(D_k\)</span> is bounded. With this we can derive the update for <span class="math inline">\(\phi^{v, 1}\)</span> and <span class="math inline">\(\phi^{v, 2}\)</span>:</p>
+<p><span class="math display">\[\begin{aligned}
+\phi^{v, 1}_k &amp;= \alpha + \sum_i q(z_i = k) \\
+\phi^{v, 2}_k &amp;= \beta + \sum_i q(z_i &gt; k),
+\end{aligned}\]</span></p>
+<p>where <span class="math inline">\(q(z_i &gt; k)\)</span> can be computed as in (14).</p>
+<p>When it comes to <span class="math inline">\(\eta\)</span>, we have</p>
+<p><span class="math display">\[\log q(\eta_{\le T}) = \log p(\eta_{\le T}) + \sum_i \sum_{k = 1 : \infty} q(z_i = k) (\mathbb E_{p(\eta_k)} \log p(x_i | \eta_k) - \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k)).\]</span></p>
+<p>Since <span class="math inline">\(q(\eta_k) = p(\eta_k)\)</span> for <span class="math inline">\(k &gt; T\)</span>, the inner sum on the right hand side is a finite sum over <span class="math inline">\(k = 1 : T\)</span>. By factorising <span class="math inline">\(q(\eta_{\le T})\)</span> and <span class="math inline">\(p(\eta_{\le T})\)</span>, we have</p>
+<p><span class="math display">\[\log q(\eta_k) = \log p(\eta_k) + \sum_i q(z_i = k) \log (x_i | \eta_k) + C,\]</span></p>
+<p>which gives us</p>
+<p><span class="math display">\[\begin{aligned}
+\phi^{\eta, 1}_k &amp;= \xi + \sum_i q(z_i = k) T(x_i) \\
+\phi^{\eta, 2}_k &amp;= \nu + \sum_i q(z_i = k).
+\end{aligned}\]</span></p>
+<h2 id="svi">SVI</h2>
+<p>In variational inference, the computation of some parameters are more expensive than others.</p>
+<p>For example, the computation of M-step is often much more expensive than that of E-step:</p>
+<ul>
+<li>In the vanilla mixture models with the EM algorithm, the update of <span class="math inline">\(\theta\)</span> requires the computation of <span class="math inline">\(r_{ik}\)</span> for all <span class="math inline">\(i = 1 : m\)</span>, see Eq (2.3).</li>
+<li>In the fully Bayesian mixture model with mean field approximation, the updates of <span class="math inline">\(\phi^\pi\)</span> and <span class="math inline">\(\phi^\eta\)</span> require the computation of a sum over all samples (see Eq (9.3)(9.7)(9.9)).</li>
+</ul>
+<p>Similarly, in pLSA2 (resp. LDA), the updates of <span class="math inline">\(\eta_k\)</span> (resp. <span class="math inline">\(\phi^{\eta_k}\)</span>) requires a sum over <span class="math inline">\(\ell = 1 : n_d\)</span>, whereas the updates of other parameters do not.</p>
+<p>In these cases, the parameter that requires more computations are called global and the other ones local.</p>
+<p>Stochastic variational inference (SVI, Hoffman-Blei-Wang-Paisley 2012) addresses this problem in the same way as stochastic gradient descent improves efficiency of gradient descent.</p>
+<p>Each time SVI picks a sample, updates the corresponding local parameters, and computes the update of the global parameters as if all the <span class="math inline">\(m\)</span> samples are identical to the picked sample. Finally it incorporates this global parameter value into previous computations of the global parameters, by means of an exponential moving average.</p>
+<p>As an example, here's SVI applied to LDA:</p>
+<ol type="1">
+<li>Set <span class="math inline">\(t = 1\)</span>.</li>
+<li>Pick <span class="math inline">\(\ell\)</span> uniformly from <span class="math inline">\(\{1, 2, ..., n_d\}\)</span>.</li>
+<li>Repeat until convergence:
+<ol type="1">
+<li>Compute <span class="math inline">\((r_{\ell i k})_{i = 1 : m, k = 1 : n_z}\)</span> using (10).</li>
+<li>Compute <span class="math inline">\((\phi^{\pi_\ell}_k)_{k = 1 : n_z}\)</span> using (11).</li>
+</ol></li>
+<li><p>Compute <span class="math inline">\((\tilde \phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}\)</span> using the following formula (compare with (12)) <span class="math display">\[\tilde \phi^{\eta_k}_w = \beta + n_d \sum_{i} r_{\ell i k} 1_{x_{\ell i} = w}\]</span></p></li>
+<li><p>Update the exponential moving average <span class="math inline">\((\phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}\)</span>: <span class="math display">\[\phi^{\eta_k}_w = (1 - \rho_t) \phi^{\eta_k}_w + \rho_t \tilde \phi^{\eta_k}_w\]</span></p></li>
+<li><p>Increment <span class="math inline">\(t\)</span> and go back to Step 2.</p></li>
+</ol>
+<p>In the original paper, <span class="math inline">\(\rho_t\)</span> needs to satisfy some conditions that guarantees convergence of the global parameters:</p>
+<p><span class="math display">\[\begin{aligned}
+\sum_t \rho_t = \infty \\
+\sum_t \rho_t^2 &lt; \infty
+\end{aligned}\]</span></p>
+<p>and the choice made there is</p>
+<p><span class="math display">\[\rho_t = (t + \tau)^{-\kappa}\]</span></p>
+<p>for some <span class="math inline">\(\kappa \in (.5, 1]\)</span> and <span class="math inline">\(\tau \ge 0\)</span>.</p>
+<h2 id="aevb">AEVB</h2>
+<p>SVI adds to variational inference stochastic updates similar to stochastic gradient descent. Why not just use neural networks with stochastic gradient descent while we are at it? Autoencoding variational Bayes (AEVB) (Kingma-Welling 2013) is such an algorithm.</p>
+<p>Let's look back to the original problem of maximising the ELBO:</p>
+<p><span class="math display">\[\max_{\theta, q} \sum_{i = 1 : m} L(p(x_i | z_i; \theta) p(z_i; \theta), q(z_i))\]</span></p>
+<p>Since for any given <span class="math inline">\(\theta\)</span>, the optimal <span class="math inline">\(q(z_i)\)</span> is the posterior <span class="math inline">\(p(z_i | x_i; \theta)\)</span>, the problem reduces to</p>
+<p><span class="math display">\[\max_{\theta} \sum_i L(p(x_i | z_i; \theta) p(z_i; \theta), p(z_i | x_i; \theta))\]</span></p>
+<p>Let us assume <span class="math inline">\(p(z_i; \theta) = p(z_i)\)</span> is independent of <span class="math inline">\(\theta\)</span> to simplify the problem. In the old mixture models, we have <span class="math inline">\(p(x_i | z_i; \theta) = p(x_i; \eta_{z_i})\)</span>, which we can generalise to <span class="math inline">\(p(x_i; f(\theta, z_i))\)</span> for some function <span class="math inline">\(f\)</span>. Using Beyes' theorem we can also write down <span class="math inline">\(p(z_i | x_i; \theta) = q(z_i; g(\theta, x_i))\)</span> for some function <span class="math inline">\(g\)</span>. So the problem becomes</p>
+<p><span class="math display">\[\max_{\theta} \sum_i L(p(x_i; f(\theta, z_i)) p(z_i), q(z_i; g(\theta, x_i)))\]</span></p>
+<p>In some cases <span class="math inline">\(g\)</span> can be hard to write down or compute. AEVB addresses this problem by replacing <span class="math inline">\(g(\theta, x_i)\)</span> with a neural network <span class="math inline">\(g_\phi(x_i)\)</span> with input <span class="math inline">\(x_i\)</span> and some separate parameters <span class="math inline">\(\phi\)</span>. It also replaces <span class="math inline">\(f(\theta, z_i)\)</span> with a neural network <span class="math inline">\(f_\theta(z_i)\)</span> with input <span class="math inline">\(z_i\)</span> and parameters <span class="math inline">\(\theta\)</span>. And now the problem becomes</p>
+<p><span class="math display">\[\max_{\theta, \phi} \sum_i L(p(x_i; f_\theta(z_i)) p(z_i), q(z_i; g_\phi(x_i))).\]</span></p>
+<p>The objective function can be written as</p>
+<p><span class="math display">\[\sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - D(q(z_i; g_\phi(x_i)) || p(z_i)).\]</span></p>
+<p>The first term is called the negative reconstruction error, like the <span class="math inline">\(- \|decoder(encoder(x)) - x\|\)</span> in autoencoders, which is where the "autoencoder" in the name comes from.</p>
+<p>The second term is a regularisation term that penalises the posterior <span class="math inline">\(q(z_i)\)</span> that is very different from the prior <span class="math inline">\(p(z_i)\)</span>. We assume this term can be computed analytically.</p>
+<p>So only the first term requires computing.</p>
+<p>We can approximate the sum over <span class="math inline">\(i\)</span> in a similar fashion as SVI: pick <span class="math inline">\(j\)</span> uniformly randomly from <span class="math inline">\(\{1 ... m\}\)</span> and treat the whole dataset as <span class="math inline">\(m\)</span> replicates of <span class="math inline">\(x_j\)</span>, and approximate the expectation using Monte-Carlo:</p>
+<p><span class="math display">\[U(x_i, \theta, \phi) := \sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) \approx m \mathbb E_{q(z_j; g_\phi(x_j))} \log p(x_j; f_\theta(z_j)) \approx {m \over L} \sum_{\ell = 1}^L \log p(x_j; f_\theta(z_{j, \ell})),\]</span></p>
+<p>where each <span class="math inline">\(z_{j, \ell}\)</span> is sampled from <span class="math inline">\(q(z_j; g_\phi(x_j))\)</span>.</p>
+<p>But then it is not easy to approximate the gradient over <span class="math inline">\(\phi\)</span>. One can use the log trick as in policy gradients, but it has the problem of high variance. In policy gradients this is overcome by using baseline subtractions. In the AEVB paper it is tackled with the reparameterisation trick.</p>
+<p>Assume there exists a transformation <span class="math inline">\(T_\phi\)</span> and a random variable <span class="math inline">\(\epsilon\)</span> with distribution independent of <span class="math inline">\(\phi\)</span> or <span class="math inline">\(\theta\)</span>, such that <span class="math inline">\(T_\phi(x_i, \epsilon)\)</span> has distribution <span class="math inline">\(q(z_i; g_\phi(x_i))\)</span>. In this case we can rewrite <span class="math inline">\(U(x, \phi, \theta)\)</span> as</p>
+<p><span class="math display">\[\sum_i \mathbb E_{\epsilon \sim p(\epsilon)} \log p(x_i; f_\theta(T_\phi(x_i, \epsilon))),\]</span></p>
+<p>This way one can use Monte-Carlo to approximate <span class="math inline">\(\nabla_\phi U(x, \phi, \theta)\)</span>:</p>
+<p><span class="math display">\[\nabla_\phi U(x, \phi, \theta) \approx {m \over L} \sum_{\ell = 1 : L} \nabla_\phi \log p(x_j; f_\theta(T_\phi(x_j, \epsilon_\ell))),\]</span></p>
+<p>where each <span class="math inline">\(\epsilon_{\ell}\)</span> is sampled from <span class="math inline">\(p(\epsilon)\)</span>. The approximation of <span class="math inline">\(U(x, \phi, \theta)\)</span> itself can be done similarly.</p>
+<h3 id="vae">VAE</h3>
+<p>As an example of AEVB, the paper introduces variational autoencoder (VAE), with the following instantiations:</p>
+<ul>
+<li>The prior <span class="math inline">\(p(z_i) = N(0, I)\)</span> is standard normal, thus independent of <span class="math inline">\(\theta\)</span>.</li>
+<li>The distribution <span class="math inline">\(p(x_i; \eta)\)</span> is either Gaussian or categorical.</li>
+<li>The distribution <span class="math inline">\(q(z_i; \mu, \Sigma)\)</span> is Gaussian with diagonal covariance matrix. So <span class="math inline">\(g_\phi(z_i) = (\mu_\phi(x_i), \text{diag}(\sigma^2_\phi(x_i)_{1 : d}))\)</span>. Thus in the reparameterisation trick <span class="math inline">\(\epsilon \sim N(0, I)\)</span> and <span class="math inline">\(T_\phi(x_i, \epsilon) = \epsilon \odot \sigma_\phi(x_i) + \mu_\phi(x_i)\)</span>, where <span class="math inline">\(\odot\)</span> is elementwise multiplication.</li>
+<li>The KL divergence can be easily computed analytically as <span class="math inline">\(- D(q(z_i; g_\phi(x_i)) || p(z_i)) = {d \over 2} + \sum_{j = 1 : d} \log\sigma_\phi(x_i)_j - {1 \over 2} \sum_{j = 1 : d} (\mu_\phi(x_i)_j^2 + \sigma_\phi(x_i)_j^2)\)</span>.</li>
+</ul>
+<p>With this, one can use backprop to maximise the ELBO.</p>
+<h3 id="fully-bayesian-aevb">Fully Bayesian AEVB</h3>
+<p>Let us turn to fully Bayesian version of AEVB. Again, we first recall the ELBO of the fully Bayesian mixture models:</p>
+<p><span class="math display">\[L(p(x, z, \pi, \eta; \alpha, \beta), q(z, \pi, \eta; r, \phi)) = L(p(x | z, \eta) p(z | \pi) p(\pi; \alpha) p(\eta; \beta), q(z; r) q(\eta; \phi^\eta) q(\pi; \phi^\pi)).\]</span></p>
+<p>We write <span class="math inline">\(\theta = (\pi, \eta)\)</span>, rewrite <span class="math inline">\(\alpha := (\alpha, \beta)\)</span>, <span class="math inline">\(\phi := r\)</span>, and <span class="math inline">\(\gamma := (\phi^\eta, \phi^\pi)\)</span>. Furthermore, as in the half-Bayesian version we assume <span class="math inline">\(p(z | \theta) = p(z)\)</span>, i.e. <span class="math inline">\(z\)</span> does not depend on <span class="math inline">\(\theta\)</span>. Similarly we also assume <span class="math inline">\(p(\theta; \alpha) = p(\theta)\)</span>. Now we have</p>
+<p><span class="math display">\[L(p(x, z, \theta; \alpha), q(z, \theta; \phi, \gamma)) = L(p(x | z, \theta) p(z) p(\theta), q(z; \phi) q(\theta; \gamma)).\]</span></p>
+<p>And the objective is to maximise it over <span class="math inline">\(\phi\)</span> and <span class="math inline">\(\gamma\)</span>. We no longer maximise over <span class="math inline">\(\theta\)</span>, because it is now a random variable, like <span class="math inline">\(z\)</span>. Now let us transform it to a neural network model, as in the half-Bayesian case:</p>
+<p><span class="math display">\[L\left(\left(\prod_{i = 1 : m} p(x_i; f_\theta(z_i))\right) \left(\prod_{i = 1 : m} p(z_i) \right) p(\theta), \left(\prod_{i = 1 : m} q(z_i; g_\phi(x_i))\right) q(\theta; h_\gamma(x))\right).\]</span></p>
+<p>where <span class="math inline">\(f_\theta\)</span>, <span class="math inline">\(g_\phi\)</span> and <span class="math inline">\(h_\gamma\)</span> are neural networks. Again, by separating out KL-divergence terms, the above formula becomes</p>
+<p><span class="math display">\[\sum_i \mathbb E_{q(\theta; h_\gamma(x))q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - \sum_i D(q(z_i; g_\phi(x_i)) || p(z_i)) - D(q(\theta; h_\gamma(x)) || p(\theta)).\]</span></p>
+<p>Again, we assume the latter two terms can be computed analytically. Using reparameterisation trick, we write</p>
+<p><span class="math display">\[\begin{aligned}
+\theta &amp;= R_\gamma(\zeta, x) \\
+z_i &amp;= T_\phi(\epsilon, x_i)
+\end{aligned}\]</span></p>
+<p>for some transformations <span class="math inline">\(R_\gamma\)</span> and <span class="math inline">\(T_\phi\)</span> and random variables <span class="math inline">\(\zeta\)</span> and <span class="math inline">\(\epsilon\)</span> so that the output has the desired distributions.</p>
+<p>Then the first term can be written as</p>
+<p><span class="math display">\[\mathbb E_{\zeta, \epsilon} \log p(x_i; f_{R_\gamma(\zeta, x)} (T_\phi(\epsilon, x_i))),\]</span></p>
+<p>so that the gradients can be computed accordingly.</p>
+<p>Again, one may use Monte-Carlo to approximate this expetation.</p>
+<h2 id="references">References</h2>
+<ul>
+<li>Attias, Hagai. "A variational baysian framework for graphical models." In Advances in neural information processing systems, pp. 209-215. 2000.</li>
+<li>Bishop, Christopher M. Neural networks for pattern recognition. Springer. 2006.</li>
+<li>Blei, David M., and Michael I. Jordan. “Variational Inference for Dirichlet Process Mixtures.” Bayesian Analysis 1, no. 1 (March 2006): 121–43. <a href="https://doi.org/10.1214/06-BA104" class="uri">https://doi.org/10.1214/06-BA104</a>.</li>
+<li>Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3, no. Jan (2003): 993–1022.</li>
+<li>Hofmann, Thomas. “Latent Semantic Models for Collaborative Filtering.” ACM Transactions on Information Systems 22, no. 1 (January 1, 2004): 89–115. <a href="https://doi.org/10.1145/963770.963774" class="uri">https://doi.org/10.1145/963770.963774</a>.</li>
+<li>Hofmann, Thomas. "Learning the similarity of documents: An information-geometric approach to document retrieval and categorization." In Advances in neural information processing systems, pp. 914-920. 2000.</li>
+<li>Hoffman, Matt, David M. Blei, Chong Wang, and John Paisley. “Stochastic Variational Inference.” ArXiv:1206.7051 [Cs, Stat], June 29, 2012. <a href="http://arxiv.org/abs/1206.7051" class="uri">http://arxiv.org/abs/1206.7051</a>.</li>
+<li>Kingma, Diederik P., and Max Welling. “Auto-Encoding Variational Bayes.” ArXiv:1312.6114 [Cs, Stat], December 20, 2013. <a href="http://arxiv.org/abs/1312.6114" class="uri">http://arxiv.org/abs/1312.6114</a>.</li>
+<li>Kurihara, Kenichi, Max Welling, and Nikos Vlassis. "Accelerated variational Dirichlet process mixtures." In Advances in neural information processing systems, pp. 761-768. 2007.</li>
+<li>Sudderth, Erik Blaine. "Graphical models for visual object recognition and tracking." PhD diss., Massachusetts Institute of Technology, 2006.</li>
+</ul>
+</body>
+</html>
+
+ </div>
+ <section id="isso-thread"></section>
+ </div>
+ </body>
+</html>