From 70420c04d3b32866275a0353a6c818ab9d1213b9 Mon Sep 17 00:00:00 2001
From: Yuchen Pei <me@ypei.me>
Date: Thu, 14 Feb 2019 09:57:49 +0100
Subject: added images for the new post; fixed format

---
 posts/2019-02-14-raise-your-elbo.md | 68 ++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 34 deletions(-)

(limited to 'posts/2019-02-14-raise-your-elbo.md')

diff --git a/posts/2019-02-14-raise-your-elbo.md b/posts/2019-02-14-raise-your-elbo.md
index 969946d..e862007 100644
--- a/posts/2019-02-14-raise-your-elbo.md
+++ b/posts/2019-02-14-raise-your-elbo.md
@@ -29,7 +29,7 @@ Monte-Carlo + neural network approach to raising the ELBO, exemplified
 by the variational autoencoder (VAE). I also show its fully Bayesian
 version.
 
-[]{#Acknowledgement}**Acknowledgement**. The following texts and
+**Acknowledgement**. The following texts and
 resources were illuminating during the writing of this post: the
 Stanford CS228 notes
 ([1](https://ermongroup.github.io/cs228-notes/inference/variational/),[2](https://ermongroup.github.io/cs228-notes/learning/latent/)),
@@ -48,7 +48,7 @@ mathematics department.
 desktop site\" for the equations to be properly displayed. This post is
 licensed under CC BY-SA and GNU FDL.*
 
-KL divergence and ELBO {#KL divergence and ELBO}
+KL divergence and ELBO 
 ----------------------
 
 Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL)
@@ -83,7 +83,7 @@ $q = {w \over Z}$.
 It is time to define the task of variational inference (VI), also known
 as variational Bayes (VB).
 
-[]{#Definition}**Definition**. Variational inference is concerned with
+**Definition**. Variational inference is concerned with
 maximising the ELBO $L(w, q)$.
 
 There are mainly two version of VI, the half Bayesian and the fully
@@ -107,7 +107,7 @@ q &= q(z, \theta)
 
 In both cases $\theta$ are parameters and $z$ are latent variables.
 
-[]{#Remark on the naming of things}**Remark on the naming of things**.
+**Remark on the naming of things**.
 The term \"variational\" comes from the fact that we perform calculus of
 variations: maximise some functional ($L(w, q)$) over a set of functions
 ($q$). Note however, most of the VI / VB algorithms do not concern any
@@ -116,12 +116,12 @@ techniques in calculus of variations, but only uses Jensen\'s inequality
 reasoning of the naming, EM is also a kind of VI, even though in the
 literature VI often referes to its fully Bayesian version.
 
-EM {#EM}
+EM 
 --
 
 To illustrate the EM algorithms, we first define the mixture model.
 
-[]{#Definition (mixture model)}**Definition (mixture model)**. Given
+**Definition (mixture model)**. Given
 dataset $x_{1 : m}$, we assume the data has some underlying latent
 variable $z_{1 : m}$ that may take a value from a finite set
 $\{1, 2, ..., n_z\}$. Let $p(z_{i}; \pi)$ be categorically distributed
@@ -134,7 +134,7 @@ $p(x_{1 : m}; \theta)$.
 Represented as a DAG (a.k.a the plate notations), the model looks like
 this:
 
-![](file:../resource/mixture-model.png){style="width:250px"}
+![](/assets/resources/mixture-model.png){style="width:250px"}
 
 where the boxes with $m$ mean repitition for $m$ times, since there $m$
 indepdent pairs of $(x, z)$, and the same goes for $\eta$.
@@ -194,7 +194,7 @@ model is:
 -   At M-setp, one finds $\theta_{t + 1}$ to be the $\theta$ that
     maximises the above formula.
 
-### GMM {#GMM}
+### GMM 
 
 Gaussian mixture model (GMM) is an example of mixture models.
 
@@ -228,13 +228,13 @@ $$\begin{aligned}
 \Sigma_{k} &= {\sum_i r_{ik} (x_{i} - \mu_{t, k}) (x_{i} - \mu_{t, k})^T \over \sum_i r_{ik}}.
 \end{aligned}$$
 
-[]{#Remark}**Remark**. The k-means algorithm is the $\epsilon \to 0$
+**Remark**. The k-means algorithm is the $\epsilon \to 0$
 limit of the GMM with constraints $\Sigma_k = \epsilon I$. See Section
 9.3.2 of Bishop 1995 for derivation. It is also briefly mentioned there
 that a variant in this setting where the covariance matrix is not
 restricted to $\epsilon I$ is called elliptical k-means algorithm.
 
-### SMM {#SMM}
+### SMM 
 
 As a transition to the next models to study, let us consider a simpler
 mixture model obtained by making one modification to GMM: change
@@ -262,7 +262,7 @@ filtering, whereas pLSA2 has a fully Bayesian version called latent
 Dirichlet allocation (LDA), not to be confused with the other LDA
 (linear discriminant analysis).
 
-### pLSA {#pLSA}
+### pLSA 
 
 The pLSA model (Hoffman 2000) is a mixture model, where the dataset is
 now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$
@@ -279,7 +279,7 @@ p(d_i, x_i; \theta) &= \sum_z p(z; \theta) p(d_i | z; \theta) p(x_i | z; \theta)
 Of the two formulations, (2.91) corresponds to pLSA type 1, and (2.92)
 corresponds to type 2.
 
-#### pLSA1 {#pLSA1}
+#### pLSA1 
 
 The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted
 with $(d_i, x_i)$, which conditioned on $z$ are independently
@@ -289,7 +289,7 @@ $$p(d_i = u, x_i = w | z = k) = p(d_i | \xi_k) p(x_i; \eta_k) = \xi_{ku} \eta_{k
 
 The model can be illustrated in the plate notations:
 
-![](file:../resource/plsa1.png){style="width:350px"}
+![](/assets/resources/plsa1.png){style="width:350px"}
 
 So the solution of the M-step is
 
@@ -299,12 +299,12 @@ $$\begin{aligned}
 \eta_{k, w} &= {\sum_i r_{ik} 1_{x_{i} = w} \over \sum_i r_{ik}}.
 \end{aligned}$$
 
-[]{#Remark}**Remark**. pLSA1 is the probabilistic version of LSA, also
+**Remark**. pLSA1 is the probabilistic version of LSA, also
 known as matrix factorisation.
 
 Let $n_d$ and $n_x$ be the number of values $d_i$ and $x_i$ can take.
 
-[]{#Problem}**Problem** (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix
+**Problem** (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix
 $s \le \min\{n_d, n_x\}$. Find $n_d \times s$ matrix $D$ and
 $n_x \times s$ matrix $X$ that minimises
 
@@ -312,7 +312,7 @@ $$J(D, X) = \|R - D X^T\|_F.$$
 
 where $\|\cdot\|_F$ is the Frobenius norm.
 
-[]{#Claim}**Claim**. Let $R = U \Sigma V^T$ be the SVD of $R$, then the
+**Claim**. Let $R = U \Sigma V^T$ be the SVD of $R$, then the
 solution to the above problem is $D = U_s \Sigma_s^{{1 \over 2}}$ and
 $X = V_s \Sigma_s^{{1 \over 2}}$, where $U_s$ (resp. $V_s$) is the
 matrix of the first $s$ columns of $U$ (resp. $V$) and $\Sigma_s$ is the
@@ -323,7 +323,7 @@ $d$ and $x$: in pLSA we obtain $n_z$ dimensional embeddings
 $\xi_{\cdot, u}$ and $\eta_{\cdot, w}$, whereas in LSA we obtain $s$
 dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$.
 
-#### pLSA2 {#pLSA2}
+#### pLSA2 
 
 Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite
 it as
@@ -350,7 +350,7 @@ pLSA1, $(x | z = k) \sim \text{Cat}(\eta_{k, \cdot})$.
 
 Illustrated in the plate notations, pLSA2 is:
 
-![](file:../resource/plsa2.png){style="width:350px"}
+![](/assets/resources/plsa2.png){style="width:350px"}
 
 The computation is basically adding an index $\ell$ to the computation
 of SMM wherever applicable.
@@ -366,7 +366,7 @@ $$\begin{aligned}
 \eta_{k w} &= {\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w} \over \sum_{\ell, i} r_{\ell i k}}.
 \end{aligned}$$
 
-### HMM {#HMM}
+### HMM 
 
 The hidden markov model (HMM) is a sequential version of SMM, in the
 same sense that recurrent neural networks are sequential versions of
@@ -396,7 +396,7 @@ $$p(z_{i1}) = \pi_{z_{i1}}$$
 So the parameters are $\theta = (\pi, \xi, \eta)$. And HMM can be shown
 in plate notations as:
 
-![](file:../resource/hmm.png){style="width:350px"}
+![](/assets/resources/hmm.png){style="width:350px"}
 
 Now we apply EM to HMM, which is called the [Baum-Welch
 algorithm](https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm).
@@ -490,7 +490,7 @@ p(z_{i, j - 1} = k, z_{i, j} = \ell, x_{i}; \theta_t) &= \alpha_k(i, j) \xi_{k\e
 And this yields $r_{ijk}$ and $s_{ijk\ell}$ since they can be computed
 as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively.
 
-Fully Bayesian EM / MFA {#Fully Bayesian EM / MFA}
+Fully Bayesian EM / MFA 
 -----------------------
 
 Let us now venture into the realm of full Bayesian.
@@ -538,9 +538,9 @@ It is also called mean field approximation (MFA), and can be easily
 generalised to models with more than two groups of latent variables, see
 e.g. Section 10.1 of Bishop 1995.
 
-### Application to mixture models {#Application to mixture models}
+### Application to mixture models 
 
-[]{#Definition (Fully Bayesian mixture model)}**Definition (Fully
+**Definition (Fully
 Bayesian mixture model)**. The relations between $\pi$, $\eta$, $x$, $z$
 are the same as in the definition of mixture models. Furthermore, we
 assume the distribution of $(x | \eta_k)$ belongs to the [exponential
@@ -554,7 +554,7 @@ later in this section that the posterior $q(\eta_k)$ has belongs to the
 same family as $p(\eta_k)$. Represented in a plate notations, a fully
 Bayesian mixture model looks like:
 
-![](file:../resource/fully-bayesian-mm.png){style="width:450px"}
+![](/assets/resources/fully-bayesian-mm.png){style="width:450px"}
 
 Given this structure we can write down the mean-field approximation:
 
@@ -627,7 +627,7 @@ So the mean field approximation for the fully Bayesian mixture model is
 the alternate iteration of (9.1) (E-step) and (9.3)(9.7)(9.9) (M-step)
 until convergence.
 
-### Fully Bayesian GMM {#Fully Bayesian GMM}
+### Fully Bayesian GMM 
 
 A typical example of fully Bayesian mixture models is the fully Bayesian
 Gaussian mixture model (Attias 2000, also called variational GMM in the
@@ -651,13 +651,13 @@ The E-step and M-step can be computed using (9.1) and (9.3)(9.7)(9.9) in
 the previous section. The details of the computation can be found in
 Chapter 10.2 of Bishop or the Attias.
 
-### LDA {#LDA}
+### LDA 
 
 As the second example of fully Bayesian mixture models, Latent Dirichlet
 allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of
 pLSA2, with the following plate notations:
 
-![](file:../resource/lda.png){style="width:450px"}
+![](/assets/resources/lda.png){style="width:450px"}
 
 It is the smoothed version in the paper.
 
@@ -712,7 +712,7 @@ $$\begin{aligned}
 
 So the algorithm iterates over (10) and (11)(12) until convergence.
 
-### DPMM {#DPMM}
+### DPMM 
 
 The Dirichlet process mixture model (DPMM) is like the fully Bayesian
 mixture model except $n_z = \infty$, i.e. $z$ can take any positive
@@ -766,7 +766,7 @@ $$L(p, q) = \sum_{k = 1 : T} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\
 
 The plate notation of this model is:
 
-![](file:../resource/dpmm.png){style="width:450px"}
+![](/assets/resources/dpmm.png){style="width:450px"}
 
 As it turns out, the infinities can be tamed in this case.
 
@@ -863,7 +863,7 @@ $$\begin{aligned}
 \phi^{\eta, 2}_k &= \nu + \sum_i q(z_i = k).
 \end{aligned}$$
 
-SVI {#SVI}
+SVI 
 ---
 
 In variational inference, the computation of some parameters are more
@@ -927,7 +927,7 @@ $$\rho_t = (t + \tau)^{-\kappa}$$
 
 for some $\kappa \in (.5, 1]$ and $\tau \ge 0$.
 
-AEVB {#AEVB}
+AEVB 
 ----
 
 SVI adds to variational inference stochastic updates similar to
@@ -1005,7 +1005,7 @@ $$\nabla_\phi U(x, \phi, \theta) \approx {m \over L} \sum_{\ell = 1 : L} \nabla_
 where each $\epsilon_{\ell}$ is sampled from $p(\epsilon)$. The
 approximation of $U(x, \phi, \theta)$ itself can be done similarly.
 
-### VAE {#VAE}
+### VAE 
 
 As an example of AEVB, the paper introduces variational autoencoder
 (VAE), with the following instantiations:
@@ -1024,7 +1024,7 @@ As an example of AEVB, the paper introduces variational autoencoder
 
 With this, one can use backprop to maximise the ELBO.
 
-### Fully Bayesian AEVB {#Fully Bayesian AEVB}
+### Fully Bayesian AEVB 
 
 Let us turn to fully Bayesian version of AEVB. Again, we first recall
 the ELBO of the fully Bayesian mixture models:
@@ -1070,7 +1070,7 @@ so that the gradients can be computed accordingly.
 
 Again, one may use Monte-Carlo to approximate this expetation.
 
-References {#References}
+References 
 ----------
 
 -   Attias, Hagai. \"A variational baysian framework for graphical
-- 
cgit v1.2.3