From 70420c04d3b32866275a0353a6c818ab9d1213b9 Mon Sep 17 00:00:00 2001 From: Yuchen Pei Date: Thu, 14 Feb 2019 09:57:49 +0100 Subject: added images for the new post; fixed format --- posts/2019-02-14-raise-your-elbo.md | 68 ++++++++++++++++++------------------- 1 file changed, 34 insertions(+), 34 deletions(-) (limited to 'posts/2019-02-14-raise-your-elbo.md') diff --git a/posts/2019-02-14-raise-your-elbo.md b/posts/2019-02-14-raise-your-elbo.md index 969946d..e862007 100644 --- a/posts/2019-02-14-raise-your-elbo.md +++ b/posts/2019-02-14-raise-your-elbo.md @@ -29,7 +29,7 @@ Monte-Carlo + neural network approach to raising the ELBO, exemplified by the variational autoencoder (VAE). I also show its fully Bayesian version. -[]{#Acknowledgement}**Acknowledgement**. The following texts and +**Acknowledgement**. The following texts and resources were illuminating during the writing of this post: the Stanford CS228 notes ([1](https://ermongroup.github.io/cs228-notes/inference/variational/),[2](https://ermongroup.github.io/cs228-notes/learning/latent/)), @@ -48,7 +48,7 @@ mathematics department. desktop site\" for the equations to be properly displayed. This post is licensed under CC BY-SA and GNU FDL.* -KL divergence and ELBO {#KL divergence and ELBO} +KL divergence and ELBO ---------------------- Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL) @@ -83,7 +83,7 @@ $q = {w \over Z}$. It is time to define the task of variational inference (VI), also known as variational Bayes (VB). -[]{#Definition}**Definition**. Variational inference is concerned with +**Definition**. Variational inference is concerned with maximising the ELBO $L(w, q)$. There are mainly two version of VI, the half Bayesian and the fully @@ -107,7 +107,7 @@ q &= q(z, \theta) In both cases $\theta$ are parameters and $z$ are latent variables. -[]{#Remark on the naming of things}**Remark on the naming of things**. +**Remark on the naming of things**. The term \"variational\" comes from the fact that we perform calculus of variations: maximise some functional ($L(w, q)$) over a set of functions ($q$). Note however, most of the VI / VB algorithms do not concern any @@ -116,12 +116,12 @@ techniques in calculus of variations, but only uses Jensen\'s inequality reasoning of the naming, EM is also a kind of VI, even though in the literature VI often referes to its fully Bayesian version. -EM {#EM} +EM -- To illustrate the EM algorithms, we first define the mixture model. -[]{#Definition (mixture model)}**Definition (mixture model)**. Given +**Definition (mixture model)**. Given dataset $x_{1 : m}$, we assume the data has some underlying latent variable $z_{1 : m}$ that may take a value from a finite set $\{1, 2, ..., n_z\}$. Let $p(z_{i}; \pi)$ be categorically distributed @@ -134,7 +134,7 @@ $p(x_{1 : m}; \theta)$. Represented as a DAG (a.k.a the plate notations), the model looks like this: -![](file:../resource/mixture-model.png){style="width:250px"} +![](/assets/resources/mixture-model.png){style="width:250px"} where the boxes with $m$ mean repitition for $m$ times, since there $m$ indepdent pairs of $(x, z)$, and the same goes for $\eta$. @@ -194,7 +194,7 @@ model is: - At M-setp, one finds $\theta_{t + 1}$ to be the $\theta$ that maximises the above formula. -### GMM {#GMM} +### GMM Gaussian mixture model (GMM) is an example of mixture models. @@ -228,13 +228,13 @@ $$\begin{aligned} \Sigma_{k} &= {\sum_i r_{ik} (x_{i} - \mu_{t, k}) (x_{i} - \mu_{t, k})^T \over \sum_i r_{ik}}. \end{aligned}$$ -[]{#Remark}**Remark**. The k-means algorithm is the $\epsilon \to 0$ +**Remark**. The k-means algorithm is the $\epsilon \to 0$ limit of the GMM with constraints $\Sigma_k = \epsilon I$. See Section 9.3.2 of Bishop 1995 for derivation. It is also briefly mentioned there that a variant in this setting where the covariance matrix is not restricted to $\epsilon I$ is called elliptical k-means algorithm. -### SMM {#SMM} +### SMM As a transition to the next models to study, let us consider a simpler mixture model obtained by making one modification to GMM: change @@ -262,7 +262,7 @@ filtering, whereas pLSA2 has a fully Bayesian version called latent Dirichlet allocation (LDA), not to be confused with the other LDA (linear discriminant analysis). -### pLSA {#pLSA} +### pLSA The pLSA model (Hoffman 2000) is a mixture model, where the dataset is now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$ @@ -279,7 +279,7 @@ p(d_i, x_i; \theta) &= \sum_z p(z; \theta) p(d_i | z; \theta) p(x_i | z; \theta) Of the two formulations, (2.91) corresponds to pLSA type 1, and (2.92) corresponds to type 2. -#### pLSA1 {#pLSA1} +#### pLSA1 The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted with $(d_i, x_i)$, which conditioned on $z$ are independently @@ -289,7 +289,7 @@ $$p(d_i = u, x_i = w | z = k) = p(d_i | \xi_k) p(x_i; \eta_k) = \xi_{ku} \eta_{k The model can be illustrated in the plate notations: -![](file:../resource/plsa1.png){style="width:350px"} +![](/assets/resources/plsa1.png){style="width:350px"} So the solution of the M-step is @@ -299,12 +299,12 @@ $$\begin{aligned} \eta_{k, w} &= {\sum_i r_{ik} 1_{x_{i} = w} \over \sum_i r_{ik}}. \end{aligned}$$ -[]{#Remark}**Remark**. pLSA1 is the probabilistic version of LSA, also +**Remark**. pLSA1 is the probabilistic version of LSA, also known as matrix factorisation. Let $n_d$ and $n_x$ be the number of values $d_i$ and $x_i$ can take. -[]{#Problem}**Problem** (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix +**Problem** (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix $s \le \min\{n_d, n_x\}$. Find $n_d \times s$ matrix $D$ and $n_x \times s$ matrix $X$ that minimises @@ -312,7 +312,7 @@ $$J(D, X) = \|R - D X^T\|_F.$$ where $\|\cdot\|_F$ is the Frobenius norm. -[]{#Claim}**Claim**. Let $R = U \Sigma V^T$ be the SVD of $R$, then the +**Claim**. Let $R = U \Sigma V^T$ be the SVD of $R$, then the solution to the above problem is $D = U_s \Sigma_s^{{1 \over 2}}$ and $X = V_s \Sigma_s^{{1 \over 2}}$, where $U_s$ (resp. $V_s$) is the matrix of the first $s$ columns of $U$ (resp. $V$) and $\Sigma_s$ is the @@ -323,7 +323,7 @@ $d$ and $x$: in pLSA we obtain $n_z$ dimensional embeddings $\xi_{\cdot, u}$ and $\eta_{\cdot, w}$, whereas in LSA we obtain $s$ dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$. -#### pLSA2 {#pLSA2} +#### pLSA2 Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite it as @@ -350,7 +350,7 @@ pLSA1, $(x | z = k) \sim \text{Cat}(\eta_{k, \cdot})$. Illustrated in the plate notations, pLSA2 is: -![](file:../resource/plsa2.png){style="width:350px"} +![](/assets/resources/plsa2.png){style="width:350px"} The computation is basically adding an index $\ell$ to the computation of SMM wherever applicable. @@ -366,7 +366,7 @@ $$\begin{aligned} \eta_{k w} &= {\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w} \over \sum_{\ell, i} r_{\ell i k}}. \end{aligned}$$ -### HMM {#HMM} +### HMM The hidden markov model (HMM) is a sequential version of SMM, in the same sense that recurrent neural networks are sequential versions of @@ -396,7 +396,7 @@ $$p(z_{i1}) = \pi_{z_{i1}}$$ So the parameters are $\theta = (\pi, \xi, \eta)$. And HMM can be shown in plate notations as: -![](file:../resource/hmm.png){style="width:350px"} +![](/assets/resources/hmm.png){style="width:350px"} Now we apply EM to HMM, which is called the [Baum-Welch algorithm](https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm). @@ -490,7 +490,7 @@ p(z_{i, j - 1} = k, z_{i, j} = \ell, x_{i}; \theta_t) &= \alpha_k(i, j) \xi_{k\e And this yields $r_{ijk}$ and $s_{ijk\ell}$ since they can be computed as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively. -Fully Bayesian EM / MFA {#Fully Bayesian EM / MFA} +Fully Bayesian EM / MFA ----------------------- Let us now venture into the realm of full Bayesian. @@ -538,9 +538,9 @@ It is also called mean field approximation (MFA), and can be easily generalised to models with more than two groups of latent variables, see e.g. Section 10.1 of Bishop 1995. -### Application to mixture models {#Application to mixture models} +### Application to mixture models -[]{#Definition (Fully Bayesian mixture model)}**Definition (Fully +**Definition (Fully Bayesian mixture model)**. The relations between $\pi$, $\eta$, $x$, $z$ are the same as in the definition of mixture models. Furthermore, we assume the distribution of $(x | \eta_k)$ belongs to the [exponential @@ -554,7 +554,7 @@ later in this section that the posterior $q(\eta_k)$ has belongs to the same family as $p(\eta_k)$. Represented in a plate notations, a fully Bayesian mixture model looks like: -![](file:../resource/fully-bayesian-mm.png){style="width:450px"} +![](/assets/resources/fully-bayesian-mm.png){style="width:450px"} Given this structure we can write down the mean-field approximation: @@ -627,7 +627,7 @@ So the mean field approximation for the fully Bayesian mixture model is the alternate iteration of (9.1) (E-step) and (9.3)(9.7)(9.9) (M-step) until convergence. -### Fully Bayesian GMM {#Fully Bayesian GMM} +### Fully Bayesian GMM A typical example of fully Bayesian mixture models is the fully Bayesian Gaussian mixture model (Attias 2000, also called variational GMM in the @@ -651,13 +651,13 @@ The E-step and M-step can be computed using (9.1) and (9.3)(9.7)(9.9) in the previous section. The details of the computation can be found in Chapter 10.2 of Bishop or the Attias. -### LDA {#LDA} +### LDA As the second example of fully Bayesian mixture models, Latent Dirichlet allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of pLSA2, with the following plate notations: -![](file:../resource/lda.png){style="width:450px"} +![](/assets/resources/lda.png){style="width:450px"} It is the smoothed version in the paper. @@ -712,7 +712,7 @@ $$\begin{aligned} So the algorithm iterates over (10) and (11)(12) until convergence. -### DPMM {#DPMM} +### DPMM The Dirichlet process mixture model (DPMM) is like the fully Bayesian mixture model except $n_z = \infty$, i.e. $z$ can take any positive @@ -766,7 +766,7 @@ $$L(p, q) = \sum_{k = 1 : T} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\ The plate notation of this model is: -![](file:../resource/dpmm.png){style="width:450px"} +![](/assets/resources/dpmm.png){style="width:450px"} As it turns out, the infinities can be tamed in this case. @@ -863,7 +863,7 @@ $$\begin{aligned} \phi^{\eta, 2}_k &= \nu + \sum_i q(z_i = k). \end{aligned}$$ -SVI {#SVI} +SVI --- In variational inference, the computation of some parameters are more @@ -927,7 +927,7 @@ $$\rho_t = (t + \tau)^{-\kappa}$$ for some $\kappa \in (.5, 1]$ and $\tau \ge 0$. -AEVB {#AEVB} +AEVB ---- SVI adds to variational inference stochastic updates similar to @@ -1005,7 +1005,7 @@ $$\nabla_\phi U(x, \phi, \theta) \approx {m \over L} \sum_{\ell = 1 : L} \nabla_ where each $\epsilon_{\ell}$ is sampled from $p(\epsilon)$. The approximation of $U(x, \phi, \theta)$ itself can be done similarly. -### VAE {#VAE} +### VAE As an example of AEVB, the paper introduces variational autoencoder (VAE), with the following instantiations: @@ -1024,7 +1024,7 @@ As an example of AEVB, the paper introduces variational autoencoder With this, one can use backprop to maximise the ELBO. -### Fully Bayesian AEVB {#Fully Bayesian AEVB} +### Fully Bayesian AEVB Let us turn to fully Bayesian version of AEVB. Again, we first recall the ELBO of the fully Bayesian mixture models: @@ -1070,7 +1070,7 @@ so that the gradients can be computed accordingly. Again, one may use Monte-Carlo to approximate this expetation. -References {#References} +References ---------- - Attias, Hagai. \"A variational baysian framework for graphical -- cgit v1.2.3