From 6c8e5849392cc2541bbdb84d43ce4be2d7fe4319 Mon Sep 17 00:00:00 2001 From: Yuchen Pei Date: Thu, 1 Jul 2021 12:20:22 +1000 Subject: Removed files no longer in use. Also renamed agpl license file. --- posts/2019-02-14-raise-your-elbo.md | 1121 ----------------------------------- 1 file changed, 1121 deletions(-) delete mode 100644 posts/2019-02-14-raise-your-elbo.md (limited to 'posts/2019-02-14-raise-your-elbo.md') diff --git a/posts/2019-02-14-raise-your-elbo.md b/posts/2019-02-14-raise-your-elbo.md deleted file mode 100644 index 36bd364..0000000 --- a/posts/2019-02-14-raise-your-elbo.md +++ /dev/null @@ -1,1121 +0,0 @@ ---- -title: Raise your ELBO -date: 2019-02-14 -template: post -comments: true ---- - -In this post I give an introduction to variational inference, which is -about maximising the evidence lower bound (ELBO). - -I use a top-down approach, starting with the KL divergence and the ELBO, -to lay the mathematical framework of all the models in this post. - -Then I define mixture models and the EM algorithm, with Gaussian mixture -model (GMM), probabilistic latent semantic analysis (pLSA) and the hidden -markov model (HMM) as examples. - -After that I present the fully Bayesian version of EM, also known as -mean field approximation (MFA), and apply it to fully Bayesian mixture -models, with fully Bayesian GMM (also known as variational GMM), latent -Dirichlet allocation (LDA) and Dirichlet process mixture model (DPMM) as -examples. - -Then I explain stochastic variational inference, a modification of EM -and MFA to improve efficiency. - -Finally I talk about autoencoding variational Bayes (AEVB), a -Monte-Carlo + neural network approach to raising the ELBO, exemplified -by the variational autoencoder (VAE). I also show its fully Bayesian -version. - -**Acknowledgement**. The following texts and -resources were illuminating during the writing of this post: the -Stanford CS228 notes -([1](https://ermongroup.github.io/cs228-notes/inference/variational/),[2](https://ermongroup.github.io/cs228-notes/learning/latent/)), -the [Tel Aviv Algorithms in Molecular Biology -slides](https://www.cs.tau.ac.il/~rshamir/algmb/presentations/EM-BW-Ron-16%20.pdf) -(clear explanations of the connection between EM and Baum-Welch), -Chapter 10 of [Bishop\'s -book](https://www.springer.com/us/book/9780387310732) (brilliant -introduction to variational GMM), Section 2.5 of [Sudderth\'s -thesis](http://cs.brown.edu/~sudderth/papers/sudderthPhD.pdf) and -[metacademy](https://metacademy.org). Also thanks to Josef Lindman -Hörnlund for discussions. The research was done while working at KTH -mathematics department. - -*If you are reading on a mobile device, you may need to \"request -desktop site\" for the equations to be properly displayed. This post is -licensed under CC BY-SA and GNU FDL.* - -KL divergence and ELBO ----------------------- - -Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL) -divergence is defined as - -$$D(q||p) = E_q \log{q \over p}.$$ - -It achieves minimum $0$ when $p = q$. - -If $p$ can be further written as - -$$p(x) = {w(x) \over Z}, \qquad (0)$$ - -where $Z$ is a normaliser, then - -$$\log Z = D(q||p) + L(w, q), \qquad(1)$$ - -where $L(w, q)$ is called the evidence lower bound (ELBO), defined by - -$$L(w, q) = E_q \log{w \over q}. \qquad (1.25)$$ - -From (1), we see that to minimise the nonnegative term $D(q || p)$, one -can maximise the ELBO. - -To this end, we can simply discard $D(q || p)$ in (1) and obtain: - -$$\log Z \ge L(w, q) \qquad (1.3)$$ - -and keep in mind that the inequality becomes an equality when -$q = {w \over Z}$. - -It is time to define the task of variational inference (VI), also known -as variational Bayes (VB). - -**Definition**. Variational inference is concerned with -maximising the ELBO $L(w, q)$. - -There are mainly two versions of VI, the half Bayesian and the fully -Bayesian cases. Half Bayesian VI, to which expectation-maximisation -algorithms (EM) apply, instantiates (1.3) with - -$$\begin{aligned} -Z &= p(x; \theta)\\ -w &= p(x, z; \theta)\\ -q &= q(z) -\end{aligned}$$ - -and the dummy variable $x$ in Equation (0) is substituted with $z$. - -Fully Bayesian VI, often just called VI, has the following -instantiations: - -$$\begin{aligned} -Z &= p(x) \\ -w &= p(x, z, \theta) \\ -q &= q(z, \theta) -\end{aligned}$$ - -and $x$ in Equation (0) is substituted with $(z, \theta)$. - -In both cases $\theta$ are parameters and $z$ are latent variables. - -**Remark on the naming of things**. -The term \"variational\" comes from the fact that we perform calculus of -variations: maximise some functional ($L(w, q)$) over a set of functions -($q$). Note however, most of the VI / VB algorithms do not concern any -techniques in calculus of variations, but only uses Jensen\'s inequality -/ the fact the $D(q||p)$ reaches minimum when $p = q$. Due to this -reasoning of the naming, EM is also a kind of VI, even though in the -literature VI often referes to its fully Bayesian version. - -EM --- - -To illustrate the EM algorithms, we first define the mixture model. - -**Definition (mixture model)**. Given -dataset $x_{1 : m}$, we assume the data has some underlying latent -variable $z_{1 : m}$ that may take a value from a finite set -$\{1, 2, ..., n_z\}$. Let $p(z_{i}; \pi)$ be categorically distributed -according to the probability vector $\pi$. That is, -$p(z_{i} = k; \pi) = \pi_k$. Also assume -$p(x_{i} | z_{i} = k; \eta) = p(x_{i}; \eta_k)$. Find -$\theta = (\pi, \eta)$ that maximises the likelihood -$p(x_{1 : m}; \theta)$. - -Represented as a DAG (a.k.a the plate notations), the model looks like -this: - -![](/assets/mixture-model.png){style="width:250px"} - -where the boxes with $m$ mean repitition for $m$ times, since there $m$ -indepdent pairs of $(x, z)$, and the same goes for $\eta$. - -The direct maximisation - -$$\max_\theta \sum_i \log p(x_{i}; \theta) = \max_\theta \sum_i \log \int p(x_{i} | z_i; \theta) p(z_i; \theta) dz_i$$ - -is hard because of the integral in the log. - -We can fit this problem in (1.3) by having $Z = p(x_{1 : m}; \theta)$ -and $w = p(z_{1 : m}, x_{1 : m}; \theta)$. The plan is to update -$\theta$ repeatedly so that $L(p(z, x; \theta_t), q(z))$ is non -decreasing over time $t$. - -Equation (1.3) at time $t$ for the $i$th datapoint is - -$$\log p(x_{i}; \theta_t) \ge L(p(z_i, x_{i}; \theta_t), q(z_i)) \qquad (2)$$ - -Each timestep consists of two steps, the E-step and the M-step. - -At E-step, we set - -$$q(z_{i}) = p(z_{i}|x_{i}; \theta_t), $$ - -to turn the inequality into equality. We denote $r_{ik} = q(z_i = k)$ -and call them responsibilities, so the posterior $q(z_i)$ is categorical -distribution with parameter $r_i = r_{i, 1 : n_z}$. - -At M-step, we maximise $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_{i}))$ -over $\theta$: - -$$\begin{aligned} -\theta_{t + 1} &= \text{argmax}_\theta \sum_i L(p(x_{i}, z_{i}; \theta), p(z_{i} | x_{i}; \theta_t)) \\ -&= \text{argmax}_\theta \sum_i \mathbb E_{p(z_{i} | x_{i}; \theta_t)} \log p(x_{i}, z_{i}; \theta) \qquad (2.3) -\end{aligned}$$ - -So $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_i))$ is non-decreasing at both the -E-step and the M-step. - -We can see from this derivation that EM is half-Bayesian. The E-step is -Bayesian it computes the posterior of the latent variables and the -M-step is frequentist because it performs maximum likelihood estimate of -$\theta$. - -It is clear that the ELBO sum coverges as it is nondecreasing with an -upper bound, but it is not clear whether the sum converges to the -correct value, i.e. $\max_\theta p(x_{1 : m}; \theta)$. In fact it is -said that the EM does get stuck in local maximum sometimes. - -A different way of describing EM, which will be useful in hidden Markov -model is: - -- At E-step, one writes down the formula - $$\sum_i \mathbb E_{p(z_i | x_{i}; \theta_t)} \log p(x_{i}, z_i; \theta). \qquad (2.5)$$ - -- At M-setp, one finds $\theta_{t + 1}$ to be the $\theta$ that - maximises the above formula. - -### GMM - -Gaussian mixture model (GMM) is an example of mixture models. - -The space of the data is $\mathbb R^n$. We use the hypothesis that the -data is Gaussian conditioned on the latent variable: - -$$(x_i; \eta_k) \sim N(\mu_k, \Sigma_k),$$ - -so we write $\eta_k = (\mu_k, \Sigma_k)$, - -During E-step, the $q(z_i)$ can be directly computed using Bayes' -theorem: - -$$r_{ik} = q(z_i = k) = \mathbb P(z_i = k | x_{i}; \theta_t) -= {g_{\mu_{t, k}, \Sigma_{t, k}} (x_{i}) \pi_{t, k} \over \sum_{j = 1 : n_z} g_{\mu_{t, j}, \Sigma_{t, j}} (x_{i}) \pi_{t, j}},$$ - -where -$g_{\mu, \Sigma} (x) = (2 \pi)^{- n / 2} \det \Sigma^{-1 / 2} \exp(- {1 \over 2} (x - \mu)^T \Sigma^{-1} (x - \mu))$ -is the pdf of the Gaussian distribution $N(\mu, \Sigma)$. - -During M-step, we need to compute - -$$\text{argmax}_{\Sigma, \mu, \pi} \sum_{i = 1 : m} \sum_{k = 1 : n_z} r_{ik} \log (g_{\mu_k, \Sigma_k}(x_{i}) \pi_k).$$ - -This is similar to the quadratic discriminant analysis, and the solution -is - -$$\begin{aligned} -\pi_{k} &= {1 \over m} \sum_{i = 1 : m} r_{ik}, \\ -\mu_{k} &= {\sum_i r_{ik} x_{i} \over \sum_i r_{ik}}, \\ -\Sigma_{k} &= {\sum_i r_{ik} (x_{i} - \mu_{t, k}) (x_{i} - \mu_{t, k})^T \over \sum_i r_{ik}}. -\end{aligned}$$ - -**Remark**. The k-means algorithm is the $\epsilon \to 0$ -limit of the GMM with constraints $\Sigma_k = \epsilon I$. See Section -9.3.2 of Bishop 2006 for derivation. It is also briefly mentioned there -that a variant in this setting where the covariance matrix is not -restricted to $\epsilon I$ is called elliptical k-means algorithm. - -### SMM - -As a transition to the next models to study, let us consider a simpler -mixture model obtained by making one modification to GMM: change -$(x; \eta_k) \sim N(\mu_k, \Sigma_k)$ to -$\mathbb P(x = w; \eta_k) = \eta_{kw}$ where $\eta$ is a stochastic matrix -and $w$ is an arbitrary element of the space for $x$. -So now the space for both $x$ and $z$ are finite. We call this model the -simple mixture model (SMM). - - -As in GMM, at E-step $r_{ik}$ can be explicitly computed using -Bayes\' theorem. - -It is not hard to write down the solution to the M-step in this case: - -$$\begin{aligned} -\pi_{k} &= {1 \over m} \sum_i r_{ik}, \qquad (2.7)\\ -\eta_{k, w} &= {\sum_i r_{ik} 1_{x_i = w} \over \sum_i r_{ik}}. \qquad (2.8) -\end{aligned}$$ - -where $1_{x_i = w}$ is the [indicator function](https://en.wikipedia.org/wiki/Indicator_function), -and evaluates to $1$ if $x_i = w$ and $0$ otherwise. - -Two trivial variants of the SMM are the two versions of probabilistic -latent semantic analysis (pLSA), which we call pLSA1 and pLSA2. - -The model pLSA1 is a probabilistic version of latent semantic analysis, -which is basically a simple matrix factorisation model in collaborative -filtering, whereas pLSA2 has a fully Bayesian version called latent -Dirichlet allocation (LDA), not to be confused with the other LDA -(linear discriminant analysis). - -### pLSA - -The pLSA model (Hoffman 2000) is a mixture model, where the dataset is -now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$ -are words and $d$ are documents, and a pair $(d, x)$ represent an -ocurrance of word $x$ in document $d$. - -For each datapoint $(d_{i}, x_{i})$, - -$$\begin{aligned} -p(d_i, x_i; \theta) &= \sum_{z_i} p(z_i; \theta) p(d_i | z_i; \theta) p(x_i | z_i; \theta) \qquad (2.91)\\ -&= p(d_i; \theta) \sum_z p(x_i | z_i; \theta) p (z_i | d_i; \theta) \qquad (2.92). -\end{aligned}$$ - -Of the two formulations, (2.91) corresponds to pLSA type 1, and (2.92) -corresponds to type 2. - -#### pLSA1 - -The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted -with $(d_i, x_i)$, which conditioned on $z_i$ are independently -categorically distributed: - -$$p(d_i = u, x_i = w | z_i = k; \theta) = p(d_i ; \xi_k) p(x_i; \eta_k) = \xi_{ku} \eta_{kw}.$$ - -The model can be illustrated in the plate notations: - -![](/assets/plsa1.png){style="width:350px"} - -So the solution of the M-step is - -$$\begin{aligned} -\pi_{k} &= {1 \over m} \sum_i r_{ik} \\ -\xi_{k, u} &= {\sum_i r_{ik} 1_{d_{i} = u} \over \sum_i r_{ik}} \\ -\eta_{k, w} &= {\sum_i r_{ik} 1_{x_{i} = w} \over \sum_i r_{ik}}. -\end{aligned}$$ - -**Remark**. pLSA1 is the probabilistic version of LSA, also -known as matrix factorisation. - -Let $n_d$ and $n_x$ be the number of values $d_i$ and $x_i$ can take. - -**Problem** (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix -$s \le \min\{n_d, n_x\}$. Find $n_d \times s$ matrix $D$ and -$n_x \times s$ matrix $X$ that minimises - -$$J(D, X) = \|R - D X^T\|_F.$$ - -where $\|\cdot\|_F$ is the Frobenius norm. - -**Claim**. Let $R = U \Sigma V^T$ be the SVD of $R$, then the -solution to the above problem is $D = U_s \Sigma_s^{{1 \over 2}}$ and -$X = V_s \Sigma_s^{{1 \over 2}}$, where $U_s$ (resp. $V_s$) is the -matrix of the first $s$ columns of $U$ (resp. $V$) and $\Sigma_s$ is the -$s \times s$ submatrix of $\Sigma$. - -One can compare pLSA1 with LSA. Both procedures produce embeddings of -$d$ and $x$: in pLSA we obtain $n_z$ dimensional embeddings -$\xi_{\cdot, u}$ and $\eta_{\cdot, w}$, whereas in LSA we obtain $s$ -dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$. - -#### pLSA2 - -Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite -it as - -$$p(x_i | d_i; \theta) = \sum_{z_i} p(x_i | z_i; \theta) p(z_i | d_i; \theta).$$ - -To simplify notations, we collect all the $x_i$s with the corresponding -$d_i$ equal to 1 (suppose there are $m_1$ of them), and write them as -$(x_{1, j})_{j = 1 : m_1}$. In the same fashion we construct -$x_{2, 1 : m_2}, x_{3, 1 : m_3}, ... x_{n_d, 1 : m_{n_d}}$. -Similarly, we relabel the corresponding $d_i$ and $z_i$ accordingly. - -With almost no loss of generality, we assume all $m_\ell$s are equal and -write them as $m$. - -Now the model becomes - -$$p(x_{\ell, i} | d_{\ell, i} = \ell; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k | d_{\ell, i} = \ell; \theta).$$ - -Since we have regrouped the $x$'s and $z$'s whose indices record the values -of the $d$'s, we can remove the $d$'s from the equation altogether: - -$$p(x_{\ell, i}; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k; \theta).$$ - -It is effectively a modification of SMM by making $n_d$ copies of $\pi$. -More specifically the parameters are -$\theta = (\pi_{1 : n_d, 1 : n_z}, \eta_{1 : n_z, 1 : n_x})$, where we -model $(z | d = \ell) \sim \text{Cat}(\pi_{\ell, \cdot})$ and, as in -pLSA1, $(x | z = k) \sim \text{Cat}(\eta_{k, \cdot})$. - -Illustrated in the plate notations, pLSA2 is: - -![](/assets/plsa2.png){style="width:350px"} - -The computation is basically adding an index $\ell$ to the computation -of SMM wherever applicable. - -The updates at the E-step is - -$$r_{\ell i k} = p(z_{\ell i} = k | x_{\ell i}; \theta) \propto \pi_{\ell k} \eta_{k, x_{\ell i}}.$$ - -And at the M-step - -$$\begin{aligned} -\pi_{\ell k} &= {1 \over m} \sum_i r_{\ell i k} \\ -\eta_{k w} &= {\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w} \over \sum_{\ell, i} r_{\ell i k}}. -\end{aligned}$$ - -### HMM - -The hidden markov model (HMM) is a sequential version of SMM, in the -same sense that recurrent neural networks are sequential versions of -feed-forward neural networks. - -HMM is an example where the posterior $p(z_i | x_i; \theta)$ is not easy -to compute, and one has to utilise properties of the underlying Bayesian -network to go around it. - -Now each sample is a sequence $x_i = (x_{ij})_{j = 1 : T}$, and so are -the latent variables $z_i = (z_{ij})_{j = 1 : T}$. - -The latent variables are assumed to form a Markov chain with transition -matrix $(\xi_{k \ell})_{k \ell}$, and $x_{ij}$ is completely dependent -on $z_{ij}$: - -$$\begin{aligned} -p(z_{ij} | z_{i, j - 1}) &= \xi_{z_{i, j - 1}, z_{ij}},\\ -p(x_{ij} | z_{ij}) &= \eta_{z_{ij}, x_{ij}}. -\end{aligned}$$ - -Also, the distribution of $z_{i1}$ is again categorical with parameter -$\pi$: - -$$p(z_{i1}) = \pi_{z_{i1}}$$ - -So the parameters are $\theta = (\pi, \xi, \eta)$. And HMM can be shown -in plate notations as: - -![](/assets/hmm.png){style="width:350px"} - -Now we apply EM to HMM, which is called the [Baum-Welch -algorithm](https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm). -Unlike the previous examples, it is too messy to compute -$p(z_i | x_{i}; \theta)$, so during the E-step we instead write down -formula (2.5) directly in hope of simplifying it: - -$$\begin{aligned} -\mathbb E_{p(z_i | x_i; \theta_t)} \log p(x_i, z_i; \theta_t) &=\mathbb E_{p(z_i | x_i; \theta_t)} \left(\log \pi_{z_{i1}} + \sum_{j = 2 : T} \log \xi_{z_{i, j - 1}, z_{ij}} + \sum_{j = 1 : T} \log \eta_{z_{ij}, x_{ij}}\right). \qquad (3) -\end{aligned}$$ - -Let us compute the summand in second term: - -$$\begin{aligned} -\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \xi_{z_{i, j - 1}, z_{ij}} &= \sum_{k, \ell} (\log \xi_{k, \ell}) \mathbb E_{p(z_{i} | x_{i}; \theta_t)} 1_{z_{i, j - 1} = k, z_{i, j} = \ell} \\ -&= \sum_{k, \ell} p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t) \log \xi_{k, \ell}. \qquad (4) -\end{aligned}$$ - -Similarly, one can write down the first term and the summand in the -third term to obtain - -$$\begin{aligned} -\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \pi_{z_{i1}} &= \sum_k p(z_{i1} = k | x_{i}; \theta_t), \qquad (5) \\ -\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \eta_{z_{i, j}, x_{i, j}} &= \sum_{k, w} 1_{x_{ij} = w} p(z_{i, j} = k | x_i; \theta_t) \log \eta_{k, w}. \qquad (6) -\end{aligned}$$ - -plugging (4)(5)(6) back into (3) and summing over $j$, we obtain the -formula to maximise over $\theta$ on: - -$$\sum_k \sum_i r_{i1k} \log \pi_k + \sum_{k, \ell} \sum_{j = 2 : T, i} s_{ijk\ell} \log \xi_{k, \ell} + \sum_{k, w} \sum_{j = 1 : T, i} r_{ijk} 1_{x_{ij} = w} \log \eta_{k, w},$$ - -where - -$$\begin{aligned} -r_{ijk} &:= p(z_{ij} = k | x_{i}; \theta_t), \\ -s_{ijk\ell} &:= p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t). -\end{aligned}$$ - -Now we proceed to the M-step. Since each of the -$\pi_k, \xi_{k, \ell}, \eta_{k, w}$ is nicely confined in the inner sum -of each term, together with the constraint -$\sum_k \pi_k = \sum_\ell \xi_{k, \ell} = \sum_w \eta_{k, w} = 1$ it is -not hard to find the argmax at time $t + 1$ (the same way one finds the -MLE for any categorical distribution): - -$$\begin{aligned} -\pi_{k} &= {1 \over m} \sum_i r_{i1k}, \qquad (6.1) \\ -\xi_{k, \ell} &= {\sum_{j = 2 : T, i} s_{ijk\ell} \over \sum_{j = 1 : T - 1, i} r_{ijk}}, \qquad(6.2) \\ -\eta_{k, w} &= {\sum_{ij} 1_{x_{ij} = w} r_{ijk} \over \sum_{ij} r_{ijk}}. \qquad(6.3) -\end{aligned}$$ - -Note that (6.1)(6.3) are almost identical to (2.7)(2.8). This makes -sense as the only modification HMM makes over SMM is the added -dependencies between the latent variables. - -What remains is to compute $r$ and $s$. - -This is done by using the forward and backward procedures which takes -advantage of the conditioned independence / topology of the underlying -Bayesian network. It is out of scope of this post, but for the sake of -completeness I include it here. - -Let - -$$\begin{aligned} -\alpha_k(i, j) &:= p(x_{i, 1 : j}, z_{ij} = k; \theta_t), \\ -\beta_k(i, j) &:= p(x_{i, j + 1 : T} | z_{ij} = k; \theta_t). -\end{aligned}$$ - -They can be computed recursively as - -$$\begin{aligned} -\alpha_k(i, j) &= \begin{cases} -\eta_{k, x_{1j}} \pi_k, & j = 1; \\ -\eta_{k, x_{ij}} \sum_\ell \alpha_\ell(j - 1, i) \xi_{k\ell}, & j \ge 2. -\end{cases}\\ -\beta_k(i, j) &= \begin{cases} -1, & j = T;\\ -\sum_\ell \xi_{k\ell} \beta_\ell(j + 1, i) \eta_{\ell, x_{i, j + 1}}, & j < T. -\end{cases} -\end{aligned}$$ - -Then - -$$\begin{aligned} -p(z_{ij} = k, x_{i}; \theta_t) &= \alpha_k(i, j) \beta_k(i, j), \qquad (7)\\ -p(x_{i}; \theta_t) &= \sum_k \alpha_k(i, j) \beta_k(i, j),\forall j = 1 : T \qquad (8)\\ -p(z_{i, j - 1} = k, z_{i, j} = \ell, x_{i}; \theta_t) &= \alpha_k(i, j) \xi_{k\ell} \beta_\ell(i, j + 1) \eta_{\ell, x_{j + 1, i}}. \qquad (9) -\end{aligned}$$ - -And this yields $r_{ijk}$ and $s_{ijk\ell}$ since they can be computed -as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively. - -Fully Bayesian EM / MFA ------------------------ - -Let us now venture into the realm of full Bayesian. - -In EM we aim to maximise the ELBO - -$$\int q(z) \log {p(x, z; \theta) \over q(z)} dz d\theta$$ - -alternately over $q$ and $\theta$. As mentioned before, the E-step of -maximising over $q$ is Bayesian, in that it computes the posterior of -$z$, whereas the M-step of maximising over $\theta$ is maximum -likelihood and frequentist. - -The fully Bayesian EM makes the M-step Bayesian by making $\theta$ a -random variable, so the ELBO becomes - -$$L(p(x, z, \theta), q(z, \theta)) = \int q(z, \theta) \log {p(x, z, \theta) \over q(z, \theta)} dz d\theta$$ - -We further assume $q$ can be factorised into distributions on $z$ and -$\theta$: $q(z, \theta) = q_1(z) q_2(\theta)$. So the above formula is -rewritten as - -$$L(p(x, z, \theta), q(z, \theta)) = \int q_1(z) q_2(\theta) \log {p(x, z, \theta) \over q_1(z) q_2(\theta)} dz d\theta$$ - -To find argmax over $q_1$, we rewrite - -$$\begin{aligned} -L(p(x, z, \theta), q(z, \theta)) &= \int q_1(z) \left(\int q_2(\theta) \log p(x, z, \theta) d\theta\right) dz - \int q_1(z) \log q_1(z) dz - \int q_2(\theta) \log q_2(\theta) \\&= - D(q_1(z) || p_x(z)) + C, -\end{aligned}$$ - -where $p_x$ is a density in $z$ with - -$$\log p_x(z) = \mathbb E_{q_2(\theta)} \log p(x, z, \theta) + C.$$ - -So the $q_1$ that maximises the ELBO is $q_1^* = p_x$. - -Similarly, the optimal $q_2$ is such that - -$$\log q_2^*(\theta) = \mathbb E_{q_1(z)} \log p(x, z, \theta) + C.$$ - -The fully Bayesian EM thus alternately evaluates $q_1^*$ (E-step) and -$q_2^*$ (M-step). - -It is also called mean field approximation (MFA), and can be easily -generalised to models with more than two groups of latent variables, see -e.g. Section 10.1 of Bishop 2006. - -### Application to mixture models - -**Definition (Fully -Bayesian mixture model)**. The relations between $\pi$, $\eta$, $x$, $z$ -are the same as in the definition of mixture models. Furthermore, we -assume the distribution of $(x | \eta_k)$ belongs to the [exponential -family](https://en.wikipedia.org/wiki/Exponential_family) (the -definition of the exponential family is briefly touched at the end of -this section). But now both $\pi$ and $\eta$ are random variables. Let -the prior distribution $p(\pi)$ is Dirichlet with parameter -$(\alpha, \alpha, ..., \alpha)$. Let the prior $p(\eta_k)$ be the -conjugate prior of $(x | \eta_k)$, with parameter $\beta$, we will see -later in this section that the posterior $q(\eta_k)$ belongs to the -same family as $p(\eta_k)$. Represented in a plate notations, a fully -Bayesian mixture model looks like: - -![](/assets/fully-bayesian-mm.png){style="width:450px"} - -Given this structure we can write down the mean-field approximation: - -$$\log q(z) = \mathbb E_{q(\eta)q(\pi)} (\log(x | z, \eta) + \log(z | \pi)) + C.$$ - -Both sides can be factored into per-sample expressions, giving us - -$$\log q(z_i) = \mathbb E_{q(\eta)} \log p(x_i | z_i, \eta) + \mathbb E_{q(\pi)} \log p(z_i | \pi) + C$$ - -Therefore - -$$\log r_{ik} = \log q(z_i = k) = \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k) + \mathbb E_{q(\pi)} \log \pi_k + C. \qquad (9.1)$$ - -So the posterior of each $z_i$ is categorical regardless of the $p$s and -$q$s. - -Computing the posterior of $\pi$ and $\eta$: - -$$\log q(\pi) + \log q(\eta) = \log p(\pi) + \log p(\eta) + \sum_i \mathbb E_{q(z_i)} p(x_i | z_i, \eta) + \sum_i \mathbb E_{q(z_i)} p(z_i | \pi) + C.$$ - -So we can separate the terms involving $\pi$ and those involving $\eta$. -First compute the posterior of $\pi$: - -$$\log q(\pi) = \log p(\pi) + \sum_i \mathbb E_{q(z_i)} \log p(z_i | \pi) = \log p(\pi) + \sum_i \sum_k r_{ik} \log \pi_k + C.$$ - -The right hand side is the log of a Dirichlet modulus the constant $C$, -from which we can update the posterior parameter $\phi^\pi$: - -$$\phi^\pi_k = \alpha + \sum_i r_{ik}. \qquad (9.3)$$ - -Similarly we can obtain the posterior of $\eta$: - -$$\log q(\eta) = \log p(\eta) + \sum_i \sum_k r_{ik} \log p(x_i | \eta_k) + C.$$ - -Again we can factor the terms with respect to $k$ and get: - -$$\log q(\eta_k) = \log p(\eta_k) + \sum_i r_{ik} \log p(x_i | \eta_k) + C. \qquad (9.5)$$ - -Here we can see why conjugate prior works. Mathematically, given a -probability distribution $p(x | \theta)$, the distribution $p(\theta)$ -is called conjugate prior of $p(x | \theta)$ if -$\log p(\theta) + \log p(x | \theta)$ has the same form as -$\log p(\theta)$. - -For example, the conjugate prior for the exponential family -$p(x | \theta) = h(x) \exp(\theta \cdot T(x) - A(\theta))$ where $T$, -$A$ and $h$ are some functions is -$p(\theta; \chi, \nu) \propto \exp(\chi \cdot \theta - \nu A(\theta))$. - -Here what we want is a bit different from conjugate priors because of -the coefficients $r_{ik}$. But the computation carries over to the -conjugate priors of the exponential family (try it yourself and you\'ll -see). That is, if $p(x_i | \eta_k)$ belongs to the exponential family - -$$p(x_i | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$ - -and if $p(\eta_k)$ is the conjugate prior of $p(x_i | \eta_k)$ - -$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k))$$ - -then $q(\eta_k)$ has the same form as $p(\eta_k)$, and from (9.5) we can -compute the updates of $\phi^{\eta_k}$: - -$$\begin{aligned} -\phi^{\eta_k}_1 &= \chi + \sum_i r_{ik} T(x_i), \qquad (9.7) \\ -\phi^{\eta_k}_2 &= \nu + \sum_i r_{ik}. \qquad (9.9) -\end{aligned}$$ - -So the mean field approximation for the fully Bayesian mixture model is -the alternate iteration of (9.1) (E-step) and (9.3)(9.7)(9.9) (M-step) -until convergence. - -### Fully Bayesian GMM - -A typical example of fully Bayesian mixture models is the fully Bayesian -Gaussian mixture model (Attias 2000, also called variational GMM in the -literature). It is defined by applying the same modification to GMM as -the difference between Fully Bayesian mixture model and the vanilla -mixture model. - -More specifically: - -- $p(z_{i}) = \text{Cat}(\pi)$ as in vanilla GMM -- $p(\pi) = \text{Dir}(\alpha, \alpha, ..., \alpha)$ has Dirichlet - distribution, the conjugate prior to the parameters of the - categorical distribution. -- $p(x_i | z_i = k) = p(x_i | \eta_k) = N(\mu_{k}, \Sigma_{k})$ as in - vanilla GMM -- $p(\mu_k, \Sigma_k) = \text{NIW} (\mu_0, \lambda, \Psi, \nu)$ is the - normal-inverse-Wishart distribution, the conjugate prior to the mean - and covariance matrix of the Gaussian distribution. - -The E-step and M-step can be computed using (9.1) and (9.3)(9.7)(9.9) in -the previous section. The details of the computation can be found in -Chapter 10.2 of Bishop 2006 or Attias 2000. - -### LDA - -As the second example of fully Bayesian mixture models, Latent Dirichlet -allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of -pLSA2, with the following plate notations: - -![](/assets/lda.png){style="width:450px"} - -It is the smoothed version in the paper. - -More specifically, on the basis of pLSA2, we add prior distributions to -$\eta$ and $\pi$: - -$$\begin{aligned} -p(\eta_k) &= \text{Dir} (\beta, ..., \beta), \qquad k = 1 : n_z \\ -p(\pi_\ell) &= \text{Dir} (\alpha, ..., \alpha), \qquad \ell = 1 : n_d \\ -\end{aligned}$$ - -And as before, the prior of $z$ is - -$$p(z_{\ell, i}) = \text{Cat} (\pi_\ell), \qquad \ell = 1 : n_d, i = 1 : m$$ - -We also denote posterior distribution - -$$\begin{aligned} -q(\eta_k) &= \text{Dir} (\phi^{\eta_k}), \qquad k = 1 : n_z \\ -q(\pi_\ell) &= \text{Dir} (\phi^{\pi_\ell}), \qquad \ell = 1 : n_d \\ -p(z_{\ell, i}) &= \text{Cat} (r_{\ell, i}), \qquad \ell = 1 : n_d, i = 1 : m -\end{aligned}$$ - -As before, in E-step we update $r$, and in M-step we update $\lambda$ -and $\gamma$. - -But in the LDA paper, one treats optimisation over $r$, $\lambda$ and -$\gamma$ as a E-step, and treats $\alpha$ and $\beta$ as parameters, -which is optmised over at M-step. This makes it more akin to the -classical EM where the E-step is Bayesian and M-step MLE. This is more -complicated, and we do not consider it this way here. - -Plugging in (9.1) we obtain the updates at E-step - -$$r_{\ell i k} \propto \exp(\psi(\phi^{\pi_\ell}_k) + \psi(\phi^{\eta_k}_{x_{\ell i}}) - \psi(\sum_w \phi^{\eta_k}_w)), \qquad (10)$$ - -where $\psi$ is the digamma function. -Similarly, plugging in (9.3)(9.7)(9.9), at M-step, we update the -posterior of $\pi$ and $\eta$: - -$$\begin{aligned} -\phi^{\pi_\ell}_k &= \alpha + \sum_i r_{\ell i k}. \qquad (11)\\ -%%}}$ -%%As for $\eta$, we have -%%{{$%align% -%%\log q(\eta) &= \sum_k \log p(\eta_k) + \sum_{\ell, i} \mathbb E_{q(z_{\ell i})} \log p(x_{\ell i} | z_{\ell i}, \eta) \\ -%%&= \sum_{k, j} (\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = j} + \beta - 1) \log \eta_{k j} -%%}}$ -%%which gives us -%%{{$ -\phi^{\eta_k}_w &= \beta + \sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w}. \qquad (12) -\end{aligned}$$ - -So the algorithm iterates over (10) and (11)(12) until convergence. - -### DPMM - -The Dirichlet process mixture model (DPMM) is like the fully Bayesian -mixture model except $n_z = \infty$, i.e. $z$ can take any positive -integer value. - -The probability of $z_i = k$ is defined using the so called -stick-breaking process: let $v_i \sim \text{Beta} (\alpha, \beta)$ be -i.i.d. random variables with Beta distributions, then - -$$\mathbb P(z_i = k | v_{1:\infty}) = (1 - v_1) (1 - v_2) ... (1 - v_{k - 1}) v_k.$$ - -So $v$ plays a similar role to $\pi$ in the previous models. - -As before, we have that the distribution of $x$ belongs to the -exponential family: - -$$p(x | z = k, \eta) = p(x | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$ - -so the prior of $\eta_k$ is - -$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k)).$$ - -Because of the infinities we can\'t directly apply the formulas in the -general fully Bayesian mixture models. So let us carefully derive the -whole thing again. - -As before, we can write down the ELBO: - -$$L(p(x, z, \theta), q(z, \theta)) = \mathbb E_{q(\theta)} \log {p(\theta) \over q(\theta)} + \mathbb E_{q(\theta) q(z)} \log {p(x, z | \theta) \over q(z)}.$$ - -Both terms are infinite series: - -$$L(p, q) = \sum_{k = 1 : \infty} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_{i = 1 : m} \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}.$$ - -There are several ways to deal with the infinities. One is to fix some level $T > 0$ and set -$v_T = 1$ almost surely (Blei-Jordan 2006). This effectively turns the -model into a finite one, and both terms become finite sums over -$k = 1 : T$. - -Another walkaround (Kurihara-Welling-Vlassis 2007) is also a kind of -truncation, but less heavy-handed: setting the posterior -$q(\theta) = q(\eta) q(v)$ to be: - -$$q(\theta) = q(\theta_{1 : T}) p(\theta_{T + 1 : \infty}) =: q(\theta_{\le T}) p(\theta_{> T}).$$ - -That is, tie the posterior after $T$ to the prior. This effectively -turns the first term in the ELBO to a finite sum over $k = 1 : T$, while -keeping the second sum an infinite series: - -$$L(p, q) = \sum_{k = 1 : T} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_i \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}. \qquad (13)$$ - -The plate notation of this model is: - -![](/assets/dpmm.png){style="width:450px"} - -As it turns out, the infinities can be tamed in this case. - -As before, the optimal $q(z_i)$ is computed as - -$$r_{ik} = q(z_i = k) = s_{ik} / S_i$$ - -where - -$$\begin{aligned} -s_{ik} &= \exp(\mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta)) \\ -S_i &= \sum_{k = 1 : \infty} s_{ik}. -\end{aligned}$$ - -Plugging this back to (13) we have - -$$\begin{aligned} -\sum_{k = 1 : \infty} r_{ik} &\mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over r_{ik}} \\ -&= \sum_{k = 1 : \infty} r_{ik} \mathbb E_{q(\theta)} (\log p(x_i, z_i = k | \theta) - \mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta) + \log S_i) = \log S_i. -\end{aligned}$$ - -So it all rests upon $S_i$ being finite. - -For $k \le T + 1$, we compute the quantity $s_{ik}$ directly. For -$k > T$, it is not hard to show that - -$$s_{ik} = s_{i, T + 1} \exp((k - T - 1) \mathbb E_{p(w)} \log (1 - w)),$$ - -where $w$ is a random variable with same distribution as $p(v_k)$, i.e. -$\text{Beta}(\alpha, \beta)$. - -Hence - -$$S_i = \sum_{k = 1 : T} s_{ik} + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}$$ - -is indeed finite. Similarly we also obtain - -$$q(z_i > k) = S^{-1} \left(\sum_{\ell = k + 1 : T} s_\ell + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}\right), k \le T \qquad (14)$$ - -Now let us compute the posterior of $\theta_{\le T}$. In the following -we exchange the integrals without justifying them (c.f. Fubini\'s -Theorem). Equation (13) can be rewritten as - -$$L(p, q) = \mathbb E_{q(\theta_{\le T})} \left(\log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} - \log q(\theta_{\le T})\right).$$ - -Note that unlike the derivation of the mean-field approximation, we keep -the $- \mathbb E_{q(z)} \log q(z)$ term even though we are only -interested in $\theta$ at this stage. This is again due to the problem -of infinities: as in the computation of $S$, we would like to cancel out -some undesirable unbounded terms using $q(z)$. We now have - -$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} + C.$$ - -By plugging in $q(z = k)$ we obtain - -$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_{k = 1 : \infty} q(z_i = k) \left(\mathbb E_{p(\theta_{> T})} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)} - \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}\right) + C.$$ - -Again, we separate the $v_k$\'s and the $\eta_k$\'s to obtain - -$$q(v_{\le T}) = \log p(v_{\le T}) + \sum_i \sum_k q(z_i = k) \left(\mathbb E_{p(v_{> T})} \log p(z_i = k | v) - \mathbb E_{q(v)} \log p (z_i = k | v)\right).$$ - -Denote by $D_k$ the difference between the two expetations on the right -hand side. It is easy to show that - -$$D_k = \begin{cases} -\log(1 - v_1) ... (1 - v_{k - 1}) v_k - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_{k - 1}) v_k & k \le T\\ -\log(1 - v_1) ... (1 - v_T) - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_T) & k > T -\end{cases}$$ - -so $D_k$ is bounded. With this we can derive the update for -$\phi^{v, 1}$ and $\phi^{v, 2}$: - -$$\begin{aligned} -\phi^{v, 1}_k &= \alpha + \sum_i q(z_i = k) \\ -\phi^{v, 2}_k &= \beta + \sum_i q(z_i > k), -\end{aligned}$$ - -where $q(z_i > k)$ can be computed as in (14). - -When it comes to $\eta$, we have - -$$\log q(\eta_{\le T}) = \log p(\eta_{\le T}) + \sum_i \sum_{k = 1 : \infty} q(z_i = k) (\mathbb E_{p(\eta_k)} \log p(x_i | \eta_k) - \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k)).$$ - -Since $q(\eta_k) = p(\eta_k)$ for $k > T$, the inner sum on the right -hand side is a finite sum over $k = 1 : T$. By factorising -$q(\eta_{\le T})$ and $p(\eta_{\le T})$, we have - -$$\log q(\eta_k) = \log p(\eta_k) + \sum_i q(z_i = k) \log (x_i | \eta_k) + C,$$ - -which gives us - -$$\begin{aligned} -\phi^{\eta, 1}_k &= \xi + \sum_i q(z_i = k) T(x_i) \\ -\phi^{\eta, 2}_k &= \nu + \sum_i q(z_i = k). -\end{aligned}$$ - -SVI ---- - -In variational inference, the computation of some parameters are more -expensive than others. - -For example, the computation of M-step is often much more expensive than -that of E-step: - -- In the vanilla mixture models with the EM algorithm, the update of - $\theta$ requires the computation of $r_{ik}$ for all $i = 1 : m$, - see Eq (2.3). -- In the fully Bayesian mixture model with mean field approximation, - the updates of $\phi^\pi$ and $\phi^\eta$ require the computation of - a sum over all samples (see Eq (9.3)(9.7)(9.9)). - -Similarly, in pLSA2 (resp. LDA), the updates of $\eta_k$ (resp. -$\phi^{\eta_k}$) requires a sum over $\ell = 1 : n_d$, whereas the -updates of other parameters do not. - -In these cases, the parameter that requires more computations are called -global and the other ones local. - -Stochastic variational inference (SVI, Hoffman-Blei-Wang-Paisley 2012) -addresses this problem in the same way as stochastic gradient descent -improves efficiency of gradient descent. - -Each time SVI picks a sample, updates the corresponding local -parameters, and computes the update of the global parameters as if all -the $m$ samples are identical to the picked sample. Finally it -incorporates this global parameter value into previous computations of -the global parameters, by means of an exponential moving average. - -As an example, here\'s SVI applied to LDA: - -1. Set $t = 1$. -2. Pick $\ell$ uniformly from $\{1, 2, ..., n_d\}$. -3. Repeat until convergence: - 1. Compute $(r_{\ell i k})_{i = 1 : m, k = 1 : n_z}$ using (10). - 2. Compute $(\phi^{\pi_\ell}_k)_{k = 1 : n_z}$ using (11). -4. Compute $(\tilde \phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$ using - the following formula (compare with (12)) - $$\tilde \phi^{\eta_k}_w = \beta + n_d \sum_{i} r_{\ell i k} 1_{x_{\ell i} = w}$$ - -5. Update the exponential moving average - $(\phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$: - $$\phi^{\eta_k}_w = (1 - \rho_t) \phi^{\eta_k}_w + \rho_t \tilde \phi^{\eta_k}_w$$ - -6. Increment $t$ and go back to Step 2. - -In the original paper, $\rho_t$ needs to satisfy some conditions that -guarantees convergence of the global parameters: - -$$\begin{aligned} -\sum_t \rho_t = \infty \\ -\sum_t \rho_t^2 < \infty -\end{aligned}$$ - -and the choice made there is - -$$\rho_t = (t + \tau)^{-\kappa}$$ - -for some $\kappa \in (.5, 1]$ and $\tau \ge 0$. - -AEVB ----- - -SVI adds to variational inference stochastic updates similar to -stochastic gradient descent. Why not just use neural networks with -stochastic gradient descent while we are at it? Autoencoding variational -Bayes (AEVB) (Kingma-Welling 2013) is such an algorithm. - -Let\'s look back to the original problem of maximising the ELBO: - -$$\max_{\theta, q} \sum_{i = 1 : m} L(p(x_i | z_i; \theta) p(z_i; \theta), q(z_i))$$ - -Since for any given $\theta$, the optimal $q(z_i)$ is the posterior -$p(z_i | x_i; \theta)$, the problem reduces to - -$$\max_{\theta} \sum_i L(p(x_i | z_i; \theta) p(z_i; \theta), p(z_i | x_i; \theta))$$ - -Let us assume $p(z_i; \theta) = p(z_i)$ is independent of $\theta$ to -simplify the problem. In the old mixture models, we have -$p(x_i | z_i; \theta) = p(x_i; \eta_{z_i})$, which we can generalise to -$p(x_i; f(\theta, z_i))$ for some function $f$. Using Beyes\' theorem we -can also write down $p(z_i | x_i; \theta) = q(z_i; g(\theta, x_i))$ for -some function $g$. So the problem becomes - -$$\max_{\theta} \sum_i L(p(x_i; f(\theta, z_i)) p(z_i), q(z_i; g(\theta, x_i)))$$ - -In some cases $g$ can be hard to write down or compute. AEVB addresses -this problem by replacing $g(\theta, x_i)$ with a neural network -$g_\phi(x_i)$ with input $x_i$ and some separate parameters $\phi$. It -also replaces $f(\theta, z_i)$ with a neural network $f_\theta(z_i)$ -with input $z_i$ and parameters $\theta$. And now the problem becomes - -$$\max_{\theta, \phi} \sum_i L(p(x_i; f_\theta(z_i)) p(z_i), q(z_i; g_\phi(x_i))).$$ - -The objective function can be written as - -$$\sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - D(q(z_i; g_\phi(x_i)) || p(z_i)).$$ - -The first term is called the negative reconstruction error, like the -$- \|decoder(encoder(x)) - x\|$ in autoencoders, which is where the -\"autoencoder\" in the name comes from. - -The second term is a regularisation term that penalises the posterior -$q(z_i)$ that is very different from the prior $p(z_i)$. We assume this -term can be computed analytically. - -So only the first term requires computing. - -We can approximate the sum over $i$ in a similar fashion as SVI: pick -$j$ uniformly randomly from $\{1 ... m\}$ and treat the whole dataset as -$m$ replicates of $x_j$, and approximate the expectation using -Monte-Carlo: - -$$U(x_i, \theta, \phi) := \sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) \approx m \mathbb E_{q(z_j; g_\phi(x_j))} \log p(x_j; f_\theta(z_j)) \approx {m \over L} \sum_{\ell = 1}^L \log p(x_j; f_\theta(z_{j, \ell})),$$ - -where each $z_{j, \ell}$ is sampled from $q(z_j; g_\phi(x_j))$. - -But then it is not easy to approximate the gradient over $\phi$. One can -use the log trick as in policy gradients, but it has the problem of high -variance. In policy gradients this is overcome by using baseline -subtractions. In the AEVB paper it is tackled with the -reparameterisation trick. - -Assume there exists a transformation $T_\phi$ and a random variable -$\epsilon$ with distribution independent of $\phi$ or $\theta$, such -that $T_\phi(x_i, \epsilon)$ has distribution $q(z_i; g_\phi(x_i))$. In -this case we can rewrite $U(x, \phi, \theta)$ as - -$$\sum_i \mathbb E_{\epsilon \sim p(\epsilon)} \log p(x_i; f_\theta(T_\phi(x_i, \epsilon))),$$ - -This way one can use Monte-Carlo to approximate -$\nabla_\phi U(x, \phi, \theta)$: - -$$\nabla_\phi U(x, \phi, \theta) \approx {m \over L} \sum_{\ell = 1 : L} \nabla_\phi \log p(x_j; f_\theta(T_\phi(x_j, \epsilon_\ell))),$$ - -where each $\epsilon_{\ell}$ is sampled from $p(\epsilon)$. The -approximation of $U(x, \phi, \theta)$ itself can be done similarly. - -### VAE - -As an example of AEVB, the paper introduces variational autoencoder -(VAE), with the following instantiations: - -- The prior $p(z_i) = N(0, I)$ is standard normal, thus independent of - $\theta$. -- The distribution $p(x_i; \eta)$ is either Gaussian or categorical. -- The distribution $q(z_i; \mu, \Sigma)$ is Gaussian with diagonal - covariance matrix. So - $g_\phi(z_i) = (\mu_\phi(x_i), \text{diag}(\sigma^2_\phi(x_i)_{1 : d}))$. - Thus in the reparameterisation trick $\epsilon \sim N(0, I)$ and - $T_\phi(x_i, \epsilon) = \epsilon \odot \sigma_\phi(x_i) + \mu_\phi(x_i)$, - where $\odot$ is elementwise multiplication. -- The KL divergence can be easily computed analytically as - $- D(q(z_i; g_\phi(x_i)) || p(z_i)) = {d \over 2} + \sum_{j = 1 : d} \log\sigma_\phi(x_i)_j - {1 \over 2} \sum_{j = 1 : d} (\mu_\phi(x_i)_j^2 + \sigma_\phi(x_i)_j^2)$. - -With this, one can use backprop to maximise the ELBO. - -### Fully Bayesian AEVB - -Let us turn to fully Bayesian version of AEVB. Again, we first recall -the ELBO of the fully Bayesian mixture models: - -$$L(p(x, z, \pi, \eta; \alpha, \beta), q(z, \pi, \eta; r, \phi)) = L(p(x | z, \eta) p(z | \pi) p(\pi; \alpha) p(\eta; \beta), q(z; r) q(\eta; \phi^\eta) q(\pi; \phi^\pi)).$$ - -We write $\theta = (\pi, \eta)$, rewrite $\alpha := (\alpha, \beta)$, -$\phi := r$, and $\gamma := (\phi^\eta, \phi^\pi)$. Furthermore, as in -the half-Bayesian version we assume $p(z | \theta) = p(z)$, i.e. $z$ -does not depend on $\theta$. Similarly we also assume -$p(\theta; \alpha) = p(\theta)$. Now we have - -$$L(p(x, z, \theta; \alpha), q(z, \theta; \phi, \gamma)) = L(p(x | z, \theta) p(z) p(\theta), q(z; \phi) q(\theta; \gamma)).$$ - -And the objective is to maximise it over $\phi$ and $\gamma$. We no -longer maximise over $\theta$, because it is now a random variable, like -$z$. Now let us transform it to a neural network model, as in the -half-Bayesian case: - -$$L\left(\left(\prod_{i = 1 : m} p(x_i; f_\theta(z_i))\right) \left(\prod_{i = 1 : m} p(z_i) \right) p(\theta), \left(\prod_{i = 1 : m} q(z_i; g_\phi(x_i))\right) q(\theta; h_\gamma(x))\right).$$ - -where $f_\theta$, $g_\phi$ and $h_\gamma$ are neural networks. Again, by -separating out KL-divergence terms, the above formula becomes - -$$\sum_i \mathbb E_{q(\theta; h_\gamma(x))q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - \sum_i D(q(z_i; g_\phi(x_i)) || p(z_i)) - D(q(\theta; h_\gamma(x)) || p(\theta)).$$ - -Again, we assume the latter two terms can be computed analytically. -Using reparameterisation trick, we write - -$$\begin{aligned} -\theta &= R_\gamma(\zeta, x) \\ -z_i &= T_\phi(\epsilon, x_i) -\end{aligned}$$ - -for some transformations $R_\gamma$ and $T_\phi$ and random variables -$\zeta$ and $\epsilon$ so that the output has the desired distributions. - -Then the first term can be written as - -$$\mathbb E_{\zeta, \epsilon} \log p(x_i; f_{R_\gamma(\zeta, x)} (T_\phi(\epsilon, x_i))),$$ - -so that the gradients can be computed accordingly. - -Again, one may use Monte-Carlo to approximate this expetation. - -References ----------- - -- Attias, Hagai. \"A variational baysian framework for graphical - models.\" In Advances in neural information processing systems, pp. - 209-215. 2000. -- Bishop, Christopher M. Neural networks for pattern recognition. - Springer. 2006. -- Blei, David M., and Michael I. Jordan. "Variational Inference for - Dirichlet Process Mixtures." Bayesian Analysis 1, no. 1 (March - 2006): 121--43. . -- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent - Dirichlet Allocation." Journal of Machine Learning Research 3, no. - Jan (2003): 993--1022. -- Hofmann, Thomas. "Latent Semantic Models for Collaborative - Filtering." ACM Transactions on Information Systems 22, no. 1 - (January 1, 2004): 89--115. . -- Hofmann, Thomas. \"Learning the similarity of documents: An - information-geometric approach to document retrieval and - categorization.\" In Advances in neural information processing - systems, pp. 914-920. 2000. -- Hoffman, Matt, David M. Blei, Chong Wang, and John Paisley. - "Stochastic Variational Inference." ArXiv:1206.7051 \[Cs, Stat\], - June 29, 2012. . -- Kingma, Diederik P., and Max Welling. "Auto-Encoding Variational - Bayes." ArXiv:1312.6114 \[Cs, Stat\], December 20, 2013. - . -- Kurihara, Kenichi, Max Welling, and Nikos Vlassis. \"Accelerated - variational Dirichlet process mixtures.\" In Advances in neural - information processing systems, pp. 761-768. 2007. -- Sudderth, Erik Blaine. \"Graphical models for visual object - recognition and tracking.\" PhD diss., Massachusetts Institute of - Technology, 2006. -- cgit v1.2.3