path: root/posts/2019-02-14-raise-your-elbo.org



# Copyright (C) 2013-2021 Yuchen Pei.

# Permission is granted to copy, distribute and/or modify this
# document under the terms of the GNU Free Documentation License,
# Version 1.3 or any later version published by the Free Software
# Foundation; with no Invariant Sections, no Front-Cover Texts, and
# no Back-Cover Texts. You should have received a copy of the GNU
# Free Documentation License. If not, see <https://www.gnu.org/licenses/>.

# This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

#+title: Raise your ELBO

#+date: <2019-02-14>

In this post I give an introduction to variational inference, which is
about maximising the evidence lower bound (ELBO).

I use a top-down approach, starting with the KL divergence and the ELBO,
to lay the mathematical framework of all the models in this post.

Then I define mixture models and the EM algorithm, with Gaussian mixture
model (GMM), probabilistic latent semantic analysis (pLSA) and the
hidden markov model (HMM) as examples.

After that I present the fully Bayesian version of EM, also known as
mean field approximation (MFA), and apply it to fully Bayesian mixture
models, with fully Bayesian GMM (also known as variational GMM), latent
Dirichlet allocation (LDA) and Dirichlet process mixture model (DPMM) as
examples.

Then I explain stochastic variational inference, a modification of EM
and MFA to improve efficiency.

Finally I talk about autoencoding variational Bayes (AEVB), a
Monte-Carlo + neural network approach to raising the ELBO, exemplified
by the variational autoencoder (VAE). I also show its fully Bayesian
version.

*Acknowledgement*. The following texts and resources were illuminating
during the writing of this post: the Stanford CS228 notes
([[https://ermongroup.github.io/cs228-notes/inference/variational/][1]],[[https://ermongroup.github.io/cs228-notes/learning/latent/][2]]),
the
[[https://www.cs.tau.ac.il/~rshamir/algmb/presentations/EM-BW-Ron-16%20.pdf][Tel
Aviv Algorithms in Molecular Biology slides]] (clear explanations of the
connection between EM and Baum-Welch), Chapter 10 of
[[https://www.springer.com/us/book/9780387310732][Bishop's book]]
(brilliant introduction to variational GMM), Section 2.5 of
[[http://cs.brown.edu/~sudderth/papers/sudderthPhD.pdf][Sudderth's
thesis]] and [[https://metacademy.org][metacademy]]. Also thanks to
Josef Lindman Hörnlund for discussions. The research was done while
working at KTH mathematics department.

/If you are reading on a mobile device, you may need to "request desktop
site" for the equations to be properly displayed. This post is licensed
under CC BY-SA and GNU FDL./

** KL divergence and ELBO
   :PROPERTIES:
   :CUSTOM_ID: kl-divergence-and-elbo
   :ID:       2bb0d405-f6b4-483f-9f2d-c0e945faa3ac
   :END:
Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL)
divergence is defined as

$$D(q||p) = E_q \log{q \over p}.$$

It achieves minimum $0$ when $p = q$.

If $p$ can be further written as

$$p(x) = {w(x) \over Z}, \qquad (0)$$

where $Z$ is a normaliser, then

$$\log Z = D(q||p) + L(w, q), \qquad(1)$$

where $L(w, q)$ is called the evidence lower bound (ELBO), defined by

$$L(w, q) = E_q \log{w \over q}. \qquad (1.25)$$

From (1), we see that to minimise the nonnegative term $D(q || p)$, one
can maximise the ELBO.

To this end, we can simply discard $D(q || p)$ in (1) and obtain:

$$\log Z \ge L(w, q) \qquad (1.3)$$

and keep in mind that the inequality becomes an equality when
$q = {w \over Z}$.

It is time to define the task of variational inference (VI), also known
as variational Bayes (VB).

*Definition*. Variational inference is concerned with maximising the
ELBO $L(w, q)$.

There are mainly two versions of VI, the half Bayesian and the fully
Bayesian cases. Half Bayesian VI, to which expectation-maximisation
algorithms (EM) apply, instantiates (1.3) with

$$\begin{aligned}
Z &= p(x; \theta)\\
w &= p(x, z; \theta)\\
q &= q(z)
\end{aligned}$$

and the dummy variable $x$ in Equation (0) is substituted with $z$.

Fully Bayesian VI, often just called VI, has the following
instantiations:

$$\begin{aligned}
Z &= p(x) \\
w &= p(x, z, \theta) \\
q &= q(z, \theta)
\end{aligned}$$

and $x$ in Equation (0) is substituted with $(z, \theta)$.

In both cases $\theta$ are parameters and $z$ are latent variables.

*Remark on the naming of things*. The term "variational" comes from the
fact that we perform calculus of variations: maximise some functional
($L(w, q)$) over a set of functions ($q$). Note however, most of the VI
/ VB algorithms do not concern any techniques in calculus of variations,
but only uses Jensen's inequality / the fact the $D(q||p)$ reaches
minimum when $p = q$. Due to this reasoning of the naming, EM is also a
kind of VI, even though in the literature VI often referes to its fully
Bayesian version.

** EM
   :PROPERTIES:
   :CUSTOM_ID: em
   :ID:       6d694b38-56c2-4e10-8a1f-1f82e309073f
   :END:
To illustrate the EM algorithms, we first define the mixture model.

*Definition (mixture model)*. Given dataset $x_{1 : m}$, we assume the
data has some underlying latent variable $z_{1 : m}$ that may take a
value from a finite set $\{1, 2, ..., n_z\}$. Let $p(z_{i}; \pi)$ be
categorically distributed according to the probability vector $\pi$.
That is, $p(z_{i} = k; \pi) = \pi_k$. Also assume
$p(x_{i} | z_{i} = k; \eta) = p(x_{i}; \eta_k)$. Find
$\theta = (\pi, \eta)$ that maximises the likelihood
$p(x_{1 : m}; \theta)$.

Represented as a DAG (a.k.a the plate notations), the model looks like
this:

[[/assets/mixture-model.png]]

where the boxes with $m$ mean repitition for $m$ times, since there $m$
indepdent pairs of $(x, z)$, and the same goes for $\eta$.

The direct maximisation

$$\max_\theta \sum_i \log p(x_{i}; \theta) = \max_\theta \sum_i \log \int p(x_{i} | z_i; \theta) p(z_i; \theta) dz_i$$

is hard because of the integral in the log.

We can fit this problem in (1.3) by having $Z = p(x_{1 : m}; \theta)$
and $w = p(z_{1 : m}, x_{1 : m}; \theta)$. The plan is to update
$\theta$ repeatedly so that $L(p(z, x; \theta_t), q(z))$ is non
decreasing over time $t$.

Equation (1.3) at time $t$ for the $i$th datapoint is

$$\log p(x_{i}; \theta_t) \ge L(p(z_i, x_{i}; \theta_t), q(z_i)) \qquad (2)$$

Each timestep consists of two steps, the E-step and the M-step.

At E-step, we set

$$q(z_{i}) = p(z_{i}|x_{i}; \theta_t), $$

to turn the inequality into equality. We denote $r_{ik} = q(z_i = k)$
and call them responsibilities, so the posterior $q(z_i)$ is categorical
distribution with parameter $r_i = r_{i, 1 : n_z}$.

At M-step, we maximise $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_{i}))$
over $\theta$:

$$\begin{aligned}
\theta_{t + 1} &= \text{argmax}_\theta \sum_i L(p(x_{i}, z_{i}; \theta), p(z_{i} | x_{i}; \theta_t)) \\
&= \text{argmax}_\theta \sum_i \mathbb E_{p(z_{i} | x_{i}; \theta_t)} \log p(x_{i}, z_{i}; \theta) \qquad (2.3)
\end{aligned}$$

So $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_i))$ is non-decreasing at both
the E-step and the M-step.

We can see from this derivation that EM is half-Bayesian. The E-step is
Bayesian it computes the posterior of the latent variables and the
M-step is frequentist because it performs maximum likelihood estimate of
$\theta$.

It is clear that the ELBO sum coverges as it is nondecreasing with an
upper bound, but it is not clear whether the sum converges to the
correct value, i.e. $\max_\theta p(x_{1 : m}; \theta)$. In fact it is
said that the EM does get stuck in local maximum sometimes.

A different way of describing EM, which will be useful in hidden Markov
model is:

- At E-step, one writes down the formula
  $$\sum_i \mathbb E_{p(z_i | x_{i}; \theta_t)} \log p(x_{i}, z_i; \theta). \qquad (2.5)$$

- At M-setp, one finds $\theta_{t + 1}$ to be the $\theta$ that
  maximises the above formula.

*** GMM
    :PROPERTIES:
    :CUSTOM_ID: gmm
    :ID:       5d5265f6-c2b9-42f1-a4a1-0d87417f0b02
    :END:
Gaussian mixture model (GMM) is an example of mixture models.

The space of the data is $\mathbb R^n$. We use the hypothesis that the
data is Gaussian conditioned on the latent variable:

$$(x_i; \eta_k) \sim N(\mu_k, \Sigma_k),$$

so we write $\eta_k = (\mu_k, \Sigma_k)$,

During E-step, the $q(z_i)$ can be directly computed using Bayes'
theorem:

$$r_{ik} = q(z_i = k) = \mathbb P(z_i = k | x_{i}; \theta_t)
= {g_{\mu_{t, k}, \Sigma_{t, k}} (x_{i}) \pi_{t, k} \over \sum_{j = 1 : n_z} g_{\mu_{t, j}, \Sigma_{t, j}} (x_{i}) \pi_{t, j}},$$

where
$g_{\mu, \Sigma} (x) = (2 \pi)^{- n / 2} \det \Sigma^{-1 / 2} \exp(- {1 \over 2} (x - \mu)^T \Sigma^{-1} (x - \mu))$
is the pdf of the Gaussian distribution $N(\mu, \Sigma)$.

During M-step, we need to compute

$$\text{argmax}_{\Sigma, \mu, \pi} \sum_{i = 1 : m} \sum_{k = 1 : n_z} r_{ik} \log (g_{\mu_k, \Sigma_k}(x_{i}) \pi_k).$$

This is similar to the quadratic discriminant analysis, and the solution
is

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_{i = 1 : m} r_{ik}, \\
\mu_{k} &= {\sum_i r_{ik} x_{i} \over \sum_i r_{ik}}, \\
\Sigma_{k} &= {\sum_i r_{ik} (x_{i} - \mu_{t, k}) (x_{i} - \mu_{t, k})^T \over \sum_i r_{ik}}.
\end{aligned}$$

*Remark*. The k-means algorithm is the $\epsilon \to 0$ limit of the GMM
with constraints $\Sigma_k = \epsilon I$. See Section 9.3.2 of Bishop
2006 for derivation. It is also briefly mentioned there that a variant
in this setting where the covariance matrix is not restricted to
$\epsilon I$ is called elliptical k-means algorithm.

*** SMM
    :PROPERTIES:
    :CUSTOM_ID: smm
    :ID:       f4b3a462-8ae7-44f2-813c-58b007eaa047
    :END:
As a transition to the next models to study, let us consider a simpler
mixture model obtained by making one modification to GMM: change
$(x; \eta_k) \sim N(\mu_k, \Sigma_k)$ to
$\mathbb P(x = w; \eta_k) = \eta_{kw}$ where $\eta$ is a stochastic
matrix and $w$ is an arbitrary element of the space for $x$. So now the
space for both $x$ and $z$ are finite. We call this model the simple
mixture model (SMM).

As in GMM, at E-step $r_{ik}$ can be explicitly computed using Bayes'
theorem.

It is not hard to write down the solution to the M-step in this case:

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_i r_{ik}, \qquad (2.7)\\
\eta_{k, w} &= {\sum_i r_{ik} 1_{x_i = w} \over \sum_i r_{ik}}. \qquad (2.8)
\end{aligned}$$

where $1_{x_i = w}$ is the
[[https://en.wikipedia.org/wiki/Indicator_function][indicator
function]], and evaluates to $1$ if $x_i = w$ and $0$ otherwise.

Two trivial variants of the SMM are the two versions of probabilistic
latent semantic analysis (pLSA), which we call pLSA1 and pLSA2.

The model pLSA1 is a probabilistic version of latent semantic analysis,
which is basically a simple matrix factorisation model in collaborative
filtering, whereas pLSA2 has a fully Bayesian version called latent
Dirichlet allocation (LDA), not to be confused with the other LDA
(linear discriminant analysis).

*** pLSA
    :PROPERTIES:
    :CUSTOM_ID: plsa
    :ID:       d4f58158-dcb6-4ba1-a9e2-bf53bff6012e
    :END:
The pLSA model (Hoffman 2000) is a mixture model, where the dataset is
now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$
are words and $d$ are documents, and a pair $(d, x)$ represent an
ocurrance of word $x$ in document $d$.

For each datapoint $(d_{i}, x_{i})$,

$$\begin{aligned}
p(d_i, x_i; \theta) &= \sum_{z_i} p(z_i; \theta) p(d_i | z_i; \theta) p(x_i | z_i; \theta) \qquad (2.91)\\
&= p(d_i; \theta) \sum_z p(x_i | z_i; \theta) p (z_i | d_i; \theta) \qquad (2.92).
\end{aligned}$$

Of the two formulations, (2.91) corresponds to pLSA type 1, and (2.92)
corresponds to type 2.

**** pLSA1
     :PROPERTIES:
     :CUSTOM_ID: plsa1
     :ID:       969f470e-5bbe-464e-a3b7-f996c8f04de3
     :END:
The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted
with $(d_i, x_i)$, which conditioned on $z_i$ are independently
categorically distributed:

$$p(d_i = u, x_i = w | z_i = k; \theta) = p(d_i ; \xi_k) p(x_i; \eta_k) = \xi_{ku} \eta_{kw}.$$

The model can be illustrated in the plate notations:

[[/assets/plsa1.png]]

So the solution of the M-step is

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_i r_{ik} \\
\xi_{k, u} &= {\sum_i r_{ik} 1_{d_{i} = u} \over \sum_i r_{ik}} \\
\eta_{k, w} &= {\sum_i r_{ik} 1_{x_{i} = w} \over \sum_i r_{ik}}.
\end{aligned}$$

*Remark*. pLSA1 is the probabilistic version of LSA, also known as
matrix factorisation.

Let $n_d$ and $n_x$ be the number of values $d_i$ and $x_i$ can take.

*Problem* (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix
$s \le \min\{n_d, n_x\}$. Find $n_d \times s$ matrix $D$ and
$n_x \times s$ matrix $X$ that minimises

$$J(D, X) = \|R - D X^T\|_F.$$

where $\|\cdot\|_F$ is the Frobenius norm.

*Claim*. Let $R = U \Sigma V^T$ be the SVD of $R$, then the solution to
the above problem is $D = U_s \Sigma_s^{{1 \over 2}}$ and
$X = V_s \Sigma_s^{{1 \over 2}}$, where $U_s$ (resp. $V_s$) is the
matrix of the first $s$ columns of $U$ (resp. $V$) and $\Sigma_s$ is the
$s \times s$ submatrix of $\Sigma$.

One can compare pLSA1 with LSA. Both procedures produce embeddings of
$d$ and $x$: in pLSA we obtain $n_z$ dimensional embeddings
$\xi_{\cdot, u}$ and $\eta_{\cdot, w}$, whereas in LSA we obtain $s$
dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$.

**** pLSA2
     :PROPERTIES:
     :CUSTOM_ID: plsa2
     :ID:       eef3249a-c45d-4a07-876f-68b2a2e957e5
     :END:
Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite
it as

$$p(x_i | d_i; \theta) = \sum_{z_i} p(x_i | z_i; \theta) p(z_i | d_i; \theta).$$

To simplify notations, we collect all the $x_i$s with the corresponding
$d_i$ equal to 1 (suppose there are $m_1$ of them), and write them as
$(x_{1, j})_{j = 1 : m_1}$. In the same fashion we construct
$x_{2, 1 : m_2}, x_{3, 1 : m_3}, ... x_{n_d, 1 : m_{n_d}}$. Similarly,
we relabel the corresponding $d_i$ and $z_i$ accordingly.

With almost no loss of generality, we assume all $m_\ell$s are equal and
write them as $m$.

Now the model becomes

$$p(x_{\ell, i} | d_{\ell, i} = \ell; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k | d_{\ell, i} = \ell; \theta).$$

Since we have regrouped the $x$'s and $z$'s whose indices record the
values of the $d$'s, we can remove the $d$'s from the equation
altogether:

$$p(x_{\ell, i}; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k; \theta).$$

It is effectively a modification of SMM by making $n_d$ copies of $\pi$.
More specifically the parameters are
$\theta = (\pi_{1 : n_d, 1 : n_z}, \eta_{1 : n_z, 1 : n_x})$, where we
model $(z | d = \ell) \sim \text{Cat}(\pi_{\ell, \cdot})$ and, as in
pLSA1, $(x | z = k) \sim \text{Cat}(\eta_{k, \cdot})$.

Illustrated in the plate notations, pLSA2 is:

[[/assets/plsa2.png]]

The computation is basically adding an index $\ell$ to the computation
of SMM wherever applicable.

The updates at the E-step is

$$r_{\ell i k} = p(z_{\ell i} = k | x_{\ell i}; \theta) \propto \pi_{\ell k} \eta_{k, x_{\ell i}}.$$

And at the M-step

$$\begin{aligned}
\pi_{\ell k} &= {1 \over m} \sum_i r_{\ell  i k} \\
\eta_{k w} &= {\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w} \over \sum_{\ell, i} r_{\ell i k}}.
\end{aligned}$$

*** HMM
    :PROPERTIES:
    :CUSTOM_ID: hmm
    :ID:       16d00eda-7136-49f5-8427-c775c7a91317
    :END:
The hidden markov model (HMM) is a sequential version of SMM, in the
same sense that recurrent neural networks are sequential versions of
feed-forward neural networks.

HMM is an example where the posterior $p(z_i | x_i; \theta)$ is not easy
to compute, and one has to utilise properties of the underlying Bayesian
network to go around it.

Now each sample is a sequence $x_i = (x_{ij})_{j = 1 : T}$, and so are
the latent variables $z_i = (z_{ij})_{j = 1 : T}$.

The latent variables are assumed to form a Markov chain with transition
matrix $(\xi_{k \ell})_{k \ell}$, and $x_{ij}$ is completely dependent
on $z_{ij}$:

$$\begin{aligned}
p(z_{ij} | z_{i, j - 1}) &= \xi_{z_{i, j - 1}, z_{ij}},\\
p(x_{ij} | z_{ij}) &= \eta_{z_{ij}, x_{ij}}.
\end{aligned}$$

Also, the distribution of $z_{i1}$ is again categorical with parameter
$\pi$:

$$p(z_{i1}) = \pi_{z_{i1}}$$

So the parameters are $\theta = (\pi, \xi, \eta)$. And HMM can be shown
in plate notations as:

[[/assets/hmm.png]]

Now we apply EM to HMM, which is called the
[[https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm][Baum-Welch
algorithm]]. Unlike the previous examples, it is too messy to compute
$p(z_i | x_{i}; \theta)$, so during the E-step we instead write down
formula (2.5) directly in hope of simplifying it:

$$\begin{aligned}
\mathbb E_{p(z_i | x_i; \theta_t)} \log p(x_i, z_i; \theta_t) &=\mathbb E_{p(z_i | x_i; \theta_t)} \left(\log \pi_{z_{i1}} + \sum_{j = 2 : T} \log \xi_{z_{i, j - 1}, z_{ij}} + \sum_{j = 1 : T} \log \eta_{z_{ij}, x_{ij}}\right). \qquad (3)
\end{aligned}$$

Let us compute the summand in second term:

$$\begin{aligned}
\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \xi_{z_{i, j - 1}, z_{ij}} &= \sum_{k, \ell} (\log \xi_{k, \ell}) \mathbb E_{p(z_{i} | x_{i}; \theta_t)} 1_{z_{i, j - 1} = k, z_{i, j} = \ell} \\
&= \sum_{k, \ell} p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t) \log \xi_{k, \ell}. \qquad (4)
\end{aligned}$$

Similarly, one can write down the first term and the summand in the
third term to obtain

$$\begin{aligned}
\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \pi_{z_{i1}} &= \sum_k p(z_{i1} = k | x_{i}; \theta_t), \qquad (5) \\
\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \eta_{z_{i, j}, x_{i, j}} &= \sum_{k, w} 1_{x_{ij} = w} p(z_{i, j} = k | x_i; \theta_t) \log \eta_{k, w}. \qquad (6)
\end{aligned}$$

plugging (4)(5)(6) back into (3) and summing over $j$, we obtain the
formula to maximise over $\theta$ on:

$$\sum_k \sum_i r_{i1k} \log \pi_k + \sum_{k, \ell} \sum_{j = 2 : T, i} s_{ijk\ell} \log \xi_{k, \ell} + \sum_{k, w} \sum_{j = 1 : T, i} r_{ijk} 1_{x_{ij} = w} \log \eta_{k, w},$$

where

$$\begin{aligned}
r_{ijk} &:= p(z_{ij} = k | x_{i}; \theta_t), \\
s_{ijk\ell} &:= p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t).
\end{aligned}$$

Now we proceed to the M-step. Since each of the
$\pi_k, \xi_{k, \ell}, \eta_{k, w}$ is nicely confined in the inner sum
of each term, together with the constraint
$\sum_k \pi_k = \sum_\ell \xi_{k, \ell} = \sum_w \eta_{k, w} = 1$ it is
not hard to find the argmax at time $t + 1$ (the same way one finds the
MLE for any categorical distribution):

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_i r_{i1k}, \qquad (6.1) \\
\xi_{k, \ell} &= {\sum_{j = 2 : T, i} s_{ijk\ell} \over \sum_{j = 1 : T - 1, i} r_{ijk}}, \qquad(6.2) \\
\eta_{k, w} &= {\sum_{ij} 1_{x_{ij} = w} r_{ijk} \over \sum_{ij} r_{ijk}}. \qquad(6.3)
\end{aligned}$$

Note that (6.1)(6.3) are almost identical to (2.7)(2.8). This makes
sense as the only modification HMM makes over SMM is the added
dependencies between the latent variables.

What remains is to compute $r$ and $s$.

This is done by using the forward and backward procedures which takes
advantage of the conditioned independence / topology of the underlying
Bayesian network. It is out of scope of this post, but for the sake of
completeness I include it here.

Let

$$\begin{aligned}
\alpha_k(i, j) &:= p(x_{i, 1 : j}, z_{ij} = k; \theta_t), \\
\beta_k(i, j) &:= p(x_{i, j + 1 : T} | z_{ij} = k; \theta_t).
\end{aligned}$$

They can be computed recursively as

$$\begin{aligned}
\alpha_k(i, j) &= \begin{cases}
\eta_{k, x_{1j}} \pi_k, & j = 1; \\
\eta_{k, x_{ij}} \sum_\ell \alpha_\ell(j - 1, i) \xi_{k\ell}, & j \ge 2.
\end{cases}\\
\beta_k(i, j) &= \begin{cases}
1, & j = T;\\
\sum_\ell \xi_{k\ell} \beta_\ell(j + 1, i) \eta_{\ell, x_{i, j + 1}}, & j < T.
\end{cases}
\end{aligned}$$

Then

$$\begin{aligned}
p(z_{ij} = k, x_{i}; \theta_t) &= \alpha_k(i, j) \beta_k(i, j), \qquad (7)\\
p(x_{i}; \theta_t) &= \sum_k \alpha_k(i, j) \beta_k(i, j),\forall j = 1 : T \qquad (8)\\
p(z_{i, j - 1} = k, z_{i, j} = \ell, x_{i}; \theta_t) &= \alpha_k(i, j) \xi_{k\ell} \beta_\ell(i, j + 1) \eta_{\ell, x_{j + 1, i}}. \qquad (9)
\end{aligned}$$

And this yields $r_{ijk}$ and $s_{ijk\ell}$ since they can be computed
as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively.

** Fully Bayesian EM / MFA
   :PROPERTIES:
   :CUSTOM_ID: fully-bayesian-em-mfa
   :ID:       77f1d7ae-3785-45d4-b88f-18478e41f3b9
   :END:
Let us now venture into the realm of full Bayesian.

In EM we aim to maximise the ELBO

$$\int q(z) \log {p(x, z; \theta) \over q(z)} dz d\theta$$

alternately over $q$ and $\theta$. As mentioned before, the E-step of
maximising over $q$ is Bayesian, in that it computes the posterior of
$z$, whereas the M-step of maximising over $\theta$ is maximum
likelihood and frequentist.

The fully Bayesian EM makes the M-step Bayesian by making $\theta$ a
random variable, so the ELBO becomes

$$L(p(x, z, \theta), q(z, \theta)) = \int q(z, \theta) \log {p(x, z, \theta) \over q(z, \theta)} dz d\theta$$

We further assume $q$ can be factorised into distributions on $z$ and
$\theta$: $q(z, \theta) = q_1(z) q_2(\theta)$. So the above formula is
rewritten as

$$L(p(x, z, \theta), q(z, \theta)) = \int q_1(z) q_2(\theta) \log {p(x, z, \theta) \over q_1(z) q_2(\theta)} dz d\theta$$

To find argmax over $q_1$, we rewrite

$$\begin{aligned}
L(p(x, z, \theta), q(z, \theta)) &= \int q_1(z) \left(\int q_2(\theta) \log p(x, z, \theta) d\theta\right) dz - \int q_1(z) \log q_1(z) dz - \int q_2(\theta) \log q_2(\theta) \\&= - D(q_1(z) || p_x(z)) + C,
\end{aligned}$$

where $p_x$ is a density in $z$ with

$$\log p_x(z) = \mathbb E_{q_2(\theta)} \log p(x, z, \theta) + C.$$

So the $q_1$ that maximises the ELBO is $q_1^* = p_x$.

Similarly, the optimal $q_2$ is such that

$$\log q_2^*(\theta) = \mathbb E_{q_1(z)} \log p(x, z, \theta) + C.$$

The fully Bayesian EM thus alternately evaluates $q_1^*$ (E-step) and
$q_2^*$ (M-step).

It is also called mean field approximation (MFA), and can be easily
generalised to models with more than two groups of latent variables, see
e.g. Section 10.1 of Bishop 2006.

*** Application to mixture models
    :PROPERTIES:
    :CUSTOM_ID: application-to-mixture-models
    :ID:       52bf6025-1180-44dc-8272-e6af6e228bf3
    :END:
*Definition (Fully Bayesian mixture model)*. The relations between
$\pi$, $\eta$, $x$, $z$ are the same as in the definition of mixture
models. Furthermore, we assume the distribution of $(x | \eta_k)$
belongs to the
[[https://en.wikipedia.org/wiki/Exponential_family][exponential family]]
(the definition of the exponential family is briefly touched at the end
of this section). But now both $\pi$ and $\eta$ are random variables.
Let the prior distribution $p(\pi)$ is Dirichlet with parameter
$(\alpha, \alpha, ..., \alpha)$. Let the prior $p(\eta_k)$ be the
conjugate prior of $(x | \eta_k)$, with parameter $\beta$, we will see
later in this section that the posterior $q(\eta_k)$ belongs to the same
family as $p(\eta_k)$. Represented in a plate notations, a fully
Bayesian mixture model looks like:

[[/assets/fully-bayesian-mm.png]]

Given this structure we can write down the mean-field approximation:

$$\log q(z) = \mathbb E_{q(\eta)q(\pi)} (\log(x | z, \eta) + \log(z | \pi)) + C.$$

Both sides can be factored into per-sample expressions, giving us

$$\log q(z_i) = \mathbb E_{q(\eta)} \log p(x_i | z_i, \eta) + \mathbb E_{q(\pi)} \log p(z_i | \pi) + C$$

Therefore

$$\log r_{ik} = \log q(z_i = k) = \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k) + \mathbb E_{q(\pi)} \log \pi_k + C. \qquad (9.1)$$

So the posterior of each $z_i$ is categorical regardless of the $p$s and
$q$s.

Computing the posterior of $\pi$ and $\eta$:

$$\log q(\pi) + \log q(\eta) = \log p(\pi) + \log p(\eta) + \sum_i \mathbb E_{q(z_i)} p(x_i | z_i, \eta) + \sum_i \mathbb E_{q(z_i)} p(z_i | \pi) + C.$$

So we can separate the terms involving $\pi$ and those involving $\eta$.
First compute the posterior of $\pi$:

$$\log q(\pi) = \log p(\pi) + \sum_i \mathbb E_{q(z_i)} \log p(z_i | \pi) = \log p(\pi) + \sum_i \sum_k r_{ik} \log \pi_k + C.$$

The right hand side is the log of a Dirichlet modulus the constant $C$,
from which we can update the posterior parameter $\phi^\pi$:

$$\phi^\pi_k = \alpha + \sum_i r_{ik}. \qquad (9.3)$$

Similarly we can obtain the posterior of $\eta$:

$$\log q(\eta) = \log p(\eta) + \sum_i \sum_k r_{ik} \log p(x_i | \eta_k) + C.$$

Again we can factor the terms with respect to $k$ and get:

$$\log q(\eta_k) = \log p(\eta_k) + \sum_i r_{ik} \log p(x_i | \eta_k) + C. \qquad (9.5)$$

Here we can see why conjugate prior works. Mathematically, given a
probability distribution $p(x | \theta)$, the distribution $p(\theta)$
is called conjugate prior of $p(x | \theta)$ if
$\log p(\theta) + \log p(x | \theta)$ has the same form as
$\log p(\theta)$.

For example, the conjugate prior for the exponential family
$p(x | \theta) = h(x) \exp(\theta \cdot T(x) - A(\theta))$ where $T$,
$A$ and $h$ are some functions is
$p(\theta; \chi, \nu) \propto \exp(\chi \cdot \theta - \nu A(\theta))$.

Here what we want is a bit different from conjugate priors because of
the coefficients $r_{ik}$. But the computation carries over to the
conjugate priors of the exponential family (try it yourself and you'll
see). That is, if $p(x_i | \eta_k)$ belongs to the exponential family

$$p(x_i | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$

and if $p(\eta_k)$ is the conjugate prior of $p(x_i | \eta_k)$

$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k))$$

then $q(\eta_k)$ has the same form as $p(\eta_k)$, and from (9.5) we can
compute the updates of $\phi^{\eta_k}$:

$$\begin{aligned}
\phi^{\eta_k}_1 &= \chi + \sum_i r_{ik} T(x_i), \qquad (9.7) \\
\phi^{\eta_k}_2 &= \nu + \sum_i r_{ik}. \qquad (9.9)
\end{aligned}$$

So the mean field approximation for the fully Bayesian mixture model is
the alternate iteration of (9.1) (E-step) and (9.3)(9.7)(9.9) (M-step)
until convergence.

*** Fully Bayesian GMM
    :PROPERTIES:
    :CUSTOM_ID: fully-bayesian-gmm
    :ID:       814289c0-2527-42a0-914b-d64ad62ecd05
    :END:
A typical example of fully Bayesian mixture models is the fully Bayesian
Gaussian mixture model (Attias 2000, also called variational GMM in the
literature). It is defined by applying the same modification to GMM as
the difference between Fully Bayesian mixture model and the vanilla
mixture model.

More specifically:

- $p(z_{i}) = \text{Cat}(\pi)$ as in vanilla GMM
- $p(\pi) = \text{Dir}(\alpha, \alpha, ..., \alpha)$ has Dirichlet
  distribution, the conjugate prior to the parameters of the categorical
  distribution.
- $p(x_i | z_i = k) = p(x_i | \eta_k) = N(\mu_{k}, \Sigma_{k})$ as in
  vanilla GMM
- $p(\mu_k, \Sigma_k) = \text{NIW} (\mu_0, \lambda, \Psi, \nu)$ is the
  normal-inverse-Wishart distribution, the conjugate prior to the mean
  and covariance matrix of the Gaussian distribution.

The E-step and M-step can be computed using (9.1) and (9.3)(9.7)(9.9) in
the previous section. The details of the computation can be found in
Chapter 10.2 of Bishop 2006 or Attias 2000.

*** LDA
    :PROPERTIES:
    :CUSTOM_ID: lda
    :ID:       7d752891-ef33-4b58-9dc3-d6a61325bfa6
    :END:
As the second example of fully Bayesian mixture models, Latent Dirichlet
allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of
pLSA2, with the following plate notations:

[[/assets/lda.png]]

It is the smoothed version in the paper.

More specifically, on the basis of pLSA2, we add prior distributions to
$\eta$ and $\pi$:

$$\begin{aligned}
p(\eta_k) &= \text{Dir} (\beta, ..., \beta), \qquad k = 1 : n_z \\
p(\pi_\ell) &= \text{Dir} (\alpha, ..., \alpha), \qquad \ell = 1 : n_d \\
\end{aligned}$$

And as before, the prior of $z$ is

$$p(z_{\ell, i}) = \text{Cat} (\pi_\ell), \qquad \ell = 1 : n_d, i = 1 : m$$

We also denote posterior distribution

$$\begin{aligned}
q(\eta_k) &= \text{Dir} (\phi^{\eta_k}), \qquad k = 1 : n_z \\
q(\pi_\ell) &= \text{Dir} (\phi^{\pi_\ell}), \qquad \ell = 1 : n_d \\
p(z_{\ell, i}) &= \text{Cat} (r_{\ell, i}), \qquad \ell = 1 : n_d, i = 1 : m
\end{aligned}$$

As before, in E-step we update $r$, and in M-step we update $\lambda$
and $\gamma$.

But in the LDA paper, one treats optimisation over $r$, $\lambda$ and
$\gamma$ as a E-step, and treats $\alpha$ and $\beta$ as parameters,
which is optmised over at M-step. This makes it more akin to the
classical EM where the E-step is Bayesian and M-step MLE. This is more
complicated, and we do not consider it this way here.

Plugging in (9.1) we obtain the updates at E-step

$$r_{\ell i k} \propto \exp(\psi(\phi^{\pi_\ell}_k) + \psi(\phi^{\eta_k}_{x_{\ell i}}) - \psi(\sum_w \phi^{\eta_k}_w)), \qquad (10)$$

where $\psi$ is the digamma function. Similarly, plugging in
(9.3)(9.7)(9.9), at M-step, we update the posterior of $\pi$ and $\eta$:

$$\begin{aligned}
\phi^{\pi_\ell}_k &= \alpha + \sum_i r_{\ell i k}. \qquad (11)\\
%%}}$
%%As for $\eta$, we have
%%{{$%align%
%%\log q(\eta) &= \sum_k \log p(\eta_k) + \sum_{\ell, i} \mathbb E_{q(z_{\ell i})} \log p(x_{\ell i} | z_{\ell i}, \eta) \\
%%&= \sum_{k, j} (\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = j} + \beta - 1) \log \eta_{k j}
%%}}$
%%which gives us
%%{{$
\phi^{\eta_k}_w &= \beta + \sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w}. \qquad (12)
\end{aligned}$$

So the algorithm iterates over (10) and (11)(12) until convergence.

*** DPMM
    :PROPERTIES:
    :CUSTOM_ID: dpmm
    :ID:       187cb168-b3f8-428e-962a-80ad5966f844
    :END:
The Dirichlet process mixture model (DPMM) is like the fully Bayesian
mixture model except $n_z = \infty$, i.e. $z$ can take any positive
integer value.

The probability of $z_i = k$ is defined using the so called
stick-breaking process: let $v_i \sim \text{Beta} (\alpha, \beta)$ be
i.i.d. random variables with Beta distributions, then

$$\mathbb P(z_i = k | v_{1:\infty}) = (1 - v_1) (1 - v_2) ... (1 - v_{k - 1}) v_k.$$

So $v$ plays a similar role to $\pi$ in the previous models.

As before, we have that the distribution of $x$ belongs to the
exponential family:

$$p(x | z = k, \eta) = p(x | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$

so the prior of $\eta_k$ is

$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k)).$$

Because of the infinities we can't directly apply the formulas in the
general fully Bayesian mixture models. So let us carefully derive the
whole thing again.

As before, we can write down the ELBO:

$$L(p(x, z, \theta), q(z, \theta)) = \mathbb E_{q(\theta)} \log {p(\theta) \over q(\theta)} + \mathbb E_{q(\theta) q(z)} \log {p(x, z | \theta) \over q(z)}.$$

Both terms are infinite series:

$$L(p, q) = \sum_{k = 1 : \infty} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_{i = 1 : m} \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}.$$

There are several ways to deal with the infinities. One is to fix some
level $T > 0$ and set $v_T = 1$ almost surely (Blei-Jordan 2006). This
effectively turns the model into a finite one, and both terms become
finite sums over $k = 1 : T$.

Another walkaround (Kurihara-Welling-Vlassis 2007) is also a kind of
truncation, but less heavy-handed: setting the posterior
$q(\theta) = q(\eta) q(v)$ to be:

$$q(\theta) = q(\theta_{1 : T}) p(\theta_{T + 1 : \infty}) =: q(\theta_{\le T}) p(\theta_{> T}).$$

That is, tie the posterior after $T$ to the prior. This effectively
turns the first term in the ELBO to a finite sum over $k = 1 : T$, while
keeping the second sum an infinite series:

$$L(p, q) = \sum_{k = 1 : T} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_i \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}. \qquad (13)$$

The plate notation of this model is:

[[/assets/dpmm.png]]

As it turns out, the infinities can be tamed in this case.

As before, the optimal $q(z_i)$ is computed as

$$r_{ik} = q(z_i = k) = s_{ik} / S_i$$

where

$$\begin{aligned}
s_{ik} &= \exp(\mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta)) \\
S_i &= \sum_{k = 1 : \infty} s_{ik}.
\end{aligned}$$

Plugging this back to (13) we have

$$\begin{aligned}
\sum_{k = 1 : \infty} r_{ik} &\mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over r_{ik}} \\
&= \sum_{k = 1 : \infty} r_{ik} \mathbb E_{q(\theta)} (\log p(x_i, z_i = k | \theta) - \mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta) + \log S_i) = \log S_i.
\end{aligned}$$

So it all rests upon $S_i$ being finite.

For $k \le T + 1$, we compute the quantity $s_{ik}$ directly. For
$k > T$, it is not hard to show that

$$s_{ik} = s_{i, T + 1} \exp((k - T - 1) \mathbb E_{p(w)} \log (1 - w)),$$

where $w$ is a random variable with same distribution as $p(v_k)$, i.e.
$\text{Beta}(\alpha, \beta)$.

Hence

$$S_i = \sum_{k = 1 : T} s_{ik} + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}$$

is indeed finite. Similarly we also obtain

$$q(z_i > k) = S^{-1} \left(\sum_{\ell = k + 1 : T} s_\ell + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}\right), k \le T \qquad (14)$$

Now let us compute the posterior of $\theta_{\le T}$. In the following
we exchange the integrals without justifying them (c.f. Fubini's
Theorem). Equation (13) can be rewritten as

$$L(p, q) = \mathbb E_{q(\theta_{\le T})} \left(\log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} - \log q(\theta_{\le T})\right).$$

Note that unlike the derivation of the mean-field approximation, we keep
the $- \mathbb E_{q(z)} \log q(z)$ term even though we are only
interested in $\theta$ at this stage. This is again due to the problem
of infinities: as in the computation of $S$, we would like to cancel out
some undesirable unbounded terms using $q(z)$. We now have

$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} + C.$$

By plugging in $q(z = k)$ we obtain

$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_{k = 1 : \infty} q(z_i = k) \left(\mathbb E_{p(\theta_{> T})} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)} - \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}\right) + C.$$

Again, we separate the $v_k$'s and the $\eta_k$'s to obtain

$$q(v_{\le T}) = \log p(v_{\le T}) + \sum_i \sum_k q(z_i = k) \left(\mathbb E_{p(v_{> T})} \log p(z_i = k | v) - \mathbb E_{q(v)} \log p (z_i = k | v)\right).$$

Denote by $D_k$ the difference between the two expetations on the right
hand side. It is easy to show that

$$D_k = \begin{cases}
\log(1 - v_1) ... (1 - v_{k - 1}) v_k - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_{k - 1}) v_k & k \le T\\
\log(1 - v_1) ... (1 - v_T) - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_T) & k > T
\end{cases}$$

so $D_k$ is bounded. With this we can derive the update for
$\phi^{v, 1}$ and $\phi^{v, 2}$:

$$\begin{aligned}
\phi^{v, 1}_k &= \alpha + \sum_i q(z_i = k) \\
\phi^{v, 2}_k &= \beta + \sum_i q(z_i > k),
\end{aligned}$$

where $q(z_i > k)$ can be computed as in (14).

When it comes to $\eta$, we have

$$\log q(\eta_{\le T}) = \log p(\eta_{\le T}) + \sum_i \sum_{k = 1 : \infty} q(z_i = k) (\mathbb E_{p(\eta_k)} \log p(x_i | \eta_k) - \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k)).$$

Since $q(\eta_k) = p(\eta_k)$ for $k > T$, the inner sum on the right
hand side is a finite sum over $k = 1 : T$. By factorising
$q(\eta_{\le T})$ and $p(\eta_{\le T})$, we have

$$\log q(\eta_k) = \log p(\eta_k) + \sum_i q(z_i = k) \log (x_i | \eta_k) + C,$$

which gives us

$$\begin{aligned}
\phi^{\eta, 1}_k &= \xi + \sum_i q(z_i = k) T(x_i) \\
\phi^{\eta, 2}_k &= \nu + \sum_i q(z_i = k).
\end{aligned}$$

** SVI
   :PROPERTIES:
   :CUSTOM_ID: svi
   :ID:       47efee6c-67ac-44eb-92fb-4d576ae2ec99
   :END:
In variational inference, the computation of some parameters are more
expensive than others.

For example, the computation of M-step is often much more expensive than
that of E-step:

- In the vanilla mixture models with the EM algorithm, the update of
  $\theta$ requires the computation of $r_{ik}$ for all $i = 1 : m$, see
  Eq (2.3).
- In the fully Bayesian mixture model with mean field approximation, the
  updates of $\phi^\pi$ and $\phi^\eta$ require the computation of a sum
  over all samples (see Eq (9.3)(9.7)(9.9)).

Similarly, in pLSA2 (resp. LDA), the updates of $\eta_k$ (resp.
$\phi^{\eta_k}$) requires a sum over $\ell = 1 : n_d$, whereas the
updates of other parameters do not.

In these cases, the parameter that requires more computations are called
global and the other ones local.

Stochastic variational inference (SVI, Hoffman-Blei-Wang-Paisley 2012)
addresses this problem in the same way as stochastic gradient descent
improves efficiency of gradient descent.

Each time SVI picks a sample, updates the corresponding local
parameters, and computes the update of the global parameters as if all
the $m$ samples are identical to the picked sample. Finally it
incorporates this global parameter value into previous computations of
the global parameters, by means of an exponential moving average.

As an example, here's SVI applied to LDA:

1. Set $t = 1$.

2. Pick $\ell$ uniformly from $\{1, 2, ..., n_d\}$.

3. Repeat until convergence:

   1. Compute $(r_{\ell i k})_{i = 1 : m, k = 1 : n_z}$ using (10).
   2. Compute $(\phi^{\pi_\ell}_k)_{k = 1 : n_z}$ using (11).

4. Compute $(\tilde \phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$ using
   the following formula (compare with (12))
   $$\tilde \phi^{\eta_k}_w = \beta + n_d \sum_{i} r_{\ell i k} 1_{x_{\ell i} = w}$$

5. Update the exponential moving average
   $(\phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$:
   $$\phi^{\eta_k}_w = (1 - \rho_t) \phi^{\eta_k}_w + \rho_t \tilde \phi^{\eta_k}_w$$

6. Increment $t$ and go back to Step 2.

In the original paper, $\rho_t$ needs to satisfy some conditions that
guarantees convergence of the global parameters:

$$\begin{aligned}
\sum_t \rho_t = \infty \\
\sum_t \rho_t^2 < \infty
\end{aligned}$$

and the choice made there is

$$\rho_t = (t + \tau)^{-\kappa}$$

for some $\kappa \in (.5, 1]$ and $\tau \ge 0$.

** AEVB
   :PROPERTIES:
   :CUSTOM_ID: aevb
   :ID:       a196df8f-1574-4390-83a4-dd22d8fcecaf
   :END:
SVI adds to variational inference stochastic updates similar to
stochastic gradient descent. Why not just use neural networks with
stochastic gradient descent while we are at it? Autoencoding variational
Bayes (AEVB) (Kingma-Welling 2013) is such an algorithm.

Let's look back to the original problem of maximising the ELBO:

$$\max_{\theta, q} \sum_{i = 1 : m} L(p(x_i | z_i; \theta) p(z_i; \theta), q(z_i))$$

Since for any given $\theta$, the optimal $q(z_i)$ is the posterior
$p(z_i | x_i; \theta)$, the problem reduces to

$$\max_{\theta} \sum_i L(p(x_i | z_i; \theta) p(z_i; \theta), p(z_i | x_i; \theta))$$

Let us assume $p(z_i; \theta) = p(z_i)$ is independent of $\theta$ to
simplify the problem. In the old mixture models, we have
$p(x_i | z_i; \theta) = p(x_i; \eta_{z_i})$, which we can generalise to
$p(x_i; f(\theta, z_i))$ for some function $f$. Using Beyes' theorem we
can also write down $p(z_i | x_i; \theta) = q(z_i; g(\theta, x_i))$ for
some function $g$. So the problem becomes

$$\max_{\theta} \sum_i L(p(x_i; f(\theta, z_i)) p(z_i), q(z_i; g(\theta, x_i)))$$

In some cases $g$ can be hard to write down or compute. AEVB addresses
this problem by replacing $g(\theta, x_i)$ with a neural network
$g_\phi(x_i)$ with input $x_i$ and some separate parameters $\phi$. It
also replaces $f(\theta, z_i)$ with a neural network $f_\theta(z_i)$
with input $z_i$ and parameters $\theta$. And now the problem becomes

$$\max_{\theta, \phi} \sum_i L(p(x_i; f_\theta(z_i)) p(z_i), q(z_i; g_\phi(x_i))).$$

The objective function can be written as

$$\sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - D(q(z_i; g_\phi(x_i)) || p(z_i)).$$

The first term is called the negative reconstruction error, like the
$- \|decoder(encoder(x)) - x\|$ in autoencoders, which is where the
"autoencoder" in the name comes from.

The second term is a regularisation term that penalises the posterior
$q(z_i)$ that is very different from the prior $p(z_i)$. We assume this
term can be computed analytically.

So only the first term requires computing.

We can approximate the sum over $i$ in a similar fashion as SVI: pick
$j$ uniformly randomly from $\{1 ... m\}$ and treat the whole dataset as
$m$ replicates of $x_j$, and approximate the expectation using
Monte-Carlo:

$$U(x_i, \theta, \phi) := \sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) \approx m \mathbb E_{q(z_j; g_\phi(x_j))} \log p(x_j; f_\theta(z_j)) \approx {m \over L} \sum_{\ell = 1}^L \log p(x_j; f_\theta(z_{j, \ell})),$$

where each $z_{j, \ell}$ is sampled from $q(z_j; g_\phi(x_j))$.

But then it is not easy to approximate the gradient over $\phi$. One can
use the log trick as in policy gradients, but it has the problem of high
variance. In policy gradients this is overcome by using baseline
subtractions. In the AEVB paper it is tackled with the
reparameterisation trick.

Assume there exists a transformation $T_\phi$ and a random variable
$\epsilon$ with distribution independent of $\phi$ or $\theta$, such
that $T_\phi(x_i, \epsilon)$ has distribution $q(z_i; g_\phi(x_i))$. In
this case we can rewrite $U(x, \phi, \theta)$ as

$$\sum_i \mathbb E_{\epsilon \sim p(\epsilon)} \log p(x_i; f_\theta(T_\phi(x_i, \epsilon))),$$

This way one can use Monte-Carlo to approximate
$\nabla_\phi U(x, \phi, \theta)$:

$$\nabla_\phi U(x, \phi, \theta) \approx {m \over L} \sum_{\ell = 1 : L} \nabla_\phi \log p(x_j; f_\theta(T_\phi(x_j, \epsilon_\ell))),$$

where each $\epsilon_{\ell}$ is sampled from $p(\epsilon)$. The
approximation of $U(x, \phi, \theta)$ itself can be done similarly.

*** VAE
    :PROPERTIES:
    :CUSTOM_ID: vae
    :ID:       59e07ae5-a4d3-4b95-949f-0b4348f2b70b
    :END:
As an example of AEVB, the paper introduces variational autoencoder
(VAE), with the following instantiations:

- The prior $p(z_i) = N(0, I)$ is standard normal, thus independent of
  $\theta$.
- The distribution $p(x_i; \eta)$ is either Gaussian or categorical.
- The distribution $q(z_i; \mu, \Sigma)$ is Gaussian with diagonal
  covariance matrix. So
  $g_\phi(z_i) = (\mu_\phi(x_i), \text{diag}(\sigma^2_\phi(x_i)_{1 : d}))$.
  Thus in the reparameterisation trick $\epsilon \sim N(0, I)$ and
  $T_\phi(x_i, \epsilon) = \epsilon \odot \sigma_\phi(x_i) + \mu_\phi(x_i)$,
  where $\odot$ is elementwise multiplication.
- The KL divergence can be easily computed analytically as
  $- D(q(z_i; g_\phi(x_i)) || p(z_i)) = {d \over 2} + \sum_{j = 1 : d} \log\sigma_\phi(x_i)_j - {1 \over 2} \sum_{j = 1 : d} (\mu_\phi(x_i)_j^2 + \sigma_\phi(x_i)_j^2)$.

With this, one can use backprop to maximise the ELBO.

*** Fully Bayesian AEVB
    :PROPERTIES:
    :CUSTOM_ID: fully-bayesian-aevb
    :ID:       0fb4f75b-4b62-440f-adc7-996b2d7f718a
    :END:
Let us turn to fully Bayesian version of AEVB. Again, we first recall
the ELBO of the fully Bayesian mixture models:

$$L(p(x, z, \pi, \eta; \alpha, \beta), q(z, \pi, \eta; r, \phi)) = L(p(x | z, \eta) p(z | \pi) p(\pi; \alpha) p(\eta; \beta), q(z; r) q(\eta; \phi^\eta) q(\pi; \phi^\pi)).$$

We write $\theta = (\pi, \eta)$, rewrite $\alpha := (\alpha, \beta)$,
$\phi := r$, and $\gamma := (\phi^\eta, \phi^\pi)$. Furthermore, as in
the half-Bayesian version we assume $p(z | \theta) = p(z)$, i.e. $z$
does not depend on $\theta$. Similarly we also assume
$p(\theta; \alpha) = p(\theta)$. Now we have

$$L(p(x, z, \theta; \alpha), q(z, \theta; \phi, \gamma)) = L(p(x | z, \theta) p(z) p(\theta), q(z; \phi) q(\theta; \gamma)).$$

And the objective is to maximise it over $\phi$ and $\gamma$. We no
longer maximise over $\theta$, because it is now a random variable, like
$z$. Now let us transform it to a neural network model, as in the
half-Bayesian case:

$$L\left(\left(\prod_{i = 1 : m} p(x_i; f_\theta(z_i))\right) \left(\prod_{i = 1 : m} p(z_i) \right) p(\theta), \left(\prod_{i = 1 : m} q(z_i; g_\phi(x_i))\right) q(\theta; h_\gamma(x))\right).$$

where $f_\theta$, $g_\phi$ and $h_\gamma$ are neural networks. Again, by
separating out KL-divergence terms, the above formula becomes

$$\sum_i \mathbb E_{q(\theta; h_\gamma(x))q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - \sum_i D(q(z_i; g_\phi(x_i)) || p(z_i)) - D(q(\theta; h_\gamma(x)) || p(\theta)).$$

Again, we assume the latter two terms can be computed analytically.
Using reparameterisation trick, we write

$$\begin{aligned}
\theta &= R_\gamma(\zeta, x) \\
z_i &= T_\phi(\epsilon, x_i)
\end{aligned}$$

for some transformations $R_\gamma$ and $T_\phi$ and random variables
$\zeta$ and $\epsilon$ so that the output has the desired distributions.

Then the first term can be written as

$$\mathbb E_{\zeta, \epsilon} \log p(x_i; f_{R_\gamma(\zeta, x)} (T_\phi(\epsilon, x_i))),$$

so that the gradients can be computed accordingly.

Again, one may use Monte-Carlo to approximate this expetation.

** References
   :PROPERTIES:
   :CUSTOM_ID: references
   :ID:       df1567c9-b0e1-499f-a9d1-c0c915b2b98d
   :END:

- Attias, Hagai. "A variational baysian framework for graphical models."
  In Advances in neural information processing systems, pp.
  209-215. 2000.
- Bishop, Christopher M. Neural networks for pattern recognition.
  Springer. 2006.
- Blei, David M., and Michael I. Jordan. "Variational Inference for
  Dirichlet Process Mixtures." Bayesian Analysis 1, no. 1 (March 2006):
  121--43. [[https://doi.org/10.1214/06-BA104]].
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent Dirichlet
  Allocation." Journal of Machine Learning Research 3, no. Jan (2003):
  993--1022.
- Hofmann, Thomas. "Latent Semantic Models for Collaborative Filtering."
  ACM Transactions on Information Systems 22, no. 1 (January 1, 2004):
  89--115. [[https://doi.org/10.1145/963770.963774]].
- Hofmann, Thomas. "Learning the similarity of documents: An
  information-geometric approach to document retrieval and
  categorization." In Advances in neural information processing systems,
  pp. 914-920. 2000.
- Hoffman, Matt, David M. Blei, Chong Wang, and John Paisley.
  "Stochastic Variational Inference." ArXiv:1206.7051 [Cs, Stat], June
  29, 2012. [[http://arxiv.org/abs/1206.7051]].
- Kingma, Diederik P., and Max Welling. "Auto-Encoding Variational
  Bayes." ArXiv:1312.6114 [Cs, Stat], December 20, 2013.
  [[http://arxiv.org/abs/1312.6114]].
- Kurihara, Kenichi, Max Welling, and Nikos Vlassis. "Accelerated
  variational Dirichlet process mixtures." In Advances in neural
  information processing systems, pp. 761-768. 2007.
- Sudderth, Erik Blaine. "Graphical models for visual object recognition
  and tracking." PhD diss., Massachusetts Institute of Technology, 2006.
# Copyright (C) 2013-2021 Yuchen Pei.

# Permission is granted to copy, distribute and/or modify this
# document under the terms of the GNU Free Documentation License,
# Version 1.3 or any later version published by the Free Software
# Foundation; with no Invariant Sections, no Front-Cover Texts, and
# no Back-Cover Texts. You should have received a copy of the GNU
# Free Documentation License. If not, see <https://www.gnu.org/licenses/>.

# This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

#+title: Raise your ELBO

#+date: <2019-02-14>

In this post I give an introduction to variational inference, which is
about maximising the evidence lower bound (ELBO).

I use a top-down approach, starting with the KL divergence and the ELBO,
to lay the mathematical framework of all the models in this post.

Then I define mixture models and the EM algorithm, with Gaussian mixture
model (GMM), probabilistic latent semantic analysis (pLSA) and the
hidden markov model (HMM) as examples.

After that I present the fully Bayesian version of EM, also known as
mean field approximation (MFA), and apply it to fully Bayesian mixture
models, with fully Bayesian GMM (also known as variational GMM), latent
Dirichlet allocation (LDA) and Dirichlet process mixture model (DPMM) as
examples.

Then I explain stochastic variational inference, a modification of EM
and MFA to improve efficiency.

Finally I talk about autoencoding variational Bayes (AEVB), a
Monte-Carlo + neural network approach to raising the ELBO, exemplified
by the variational autoencoder (VAE). I also show its fully Bayesian
version.

*Acknowledgement*. The following texts and resources were illuminating
during the writing of this post: the Stanford CS228 notes
([[https://ermongroup.github.io/cs228-notes/inference/variational/][1]],[[https://ermongroup.github.io/cs228-notes/learning/latent/][2]]),
the
[[https://www.cs.tau.ac.il/~rshamir/algmb/presentations/EM-BW-Ron-16%20.pdf][Tel
Aviv Algorithms in Molecular Biology slides]] (clear explanations of the
connection between EM and Baum-Welch), Chapter 10 of
[[https://www.springer.com/us/book/9780387310732][Bishop's book]]
(brilliant introduction to variational GMM), Section 2.5 of
[[http://cs.brown.edu/~sudderth/papers/sudderthPhD.pdf][Sudderth's
thesis]] and [[https://metacademy.org][metacademy]]. Also thanks to
Josef Lindman Hörnlund for discussions. The research was done while
working at KTH mathematics department.

/If you are reading on a mobile device, you may need to "request desktop
site" for the equations to be properly displayed. This post is licensed
under CC BY-SA and GNU FDL./

** KL divergence and ELBO
   :PROPERTIES:
   :CUSTOM_ID: kl-divergence-and-elbo
   :ID:       2bb0d405-f6b4-483f-9f2d-c0e945faa3ac
   :END:
Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL)
divergence is defined as

$$D(q||p) = E_q \log{q \over p}.$$

It achieves minimum $0$ when $p = q$.

If $p$ can be further written as

$$p(x) = {w(x) \over Z}, \qquad (0)$$

where $Z$ is a normaliser, then

$$\log Z = D(q||p) + L(w, q), \qquad(1)$$

where $L(w, q)$ is called the evidence lower bound (ELBO), defined by

$$L(w, q) = E_q \log{w \over q}. \qquad (1.25)$$

From (1), we see that to minimise the nonnegative term $D(q || p)$, one
can maximise the ELBO.

To this end, we can simply discard $D(q || p)$ in (1) and obtain:

$$\log Z \ge L(w, q) \qquad (1.3)$$

and keep in mind that the inequality becomes an equality when
$q = {w \over Z}$.

It is time to define the task of variational inference (VI), also known
as variational Bayes (VB).

*Definition*. Variational inference is concerned with maximising the
ELBO $L(w, q)$.

There are mainly two versions of VI, the half Bayesian and the fully
Bayesian cases. Half Bayesian VI, to which expectation-maximisation
algorithms (EM) apply, instantiates (1.3) with

$$\begin{aligned}
Z &= p(x; \theta)\\
w &= p(x, z; \theta)\\
q &= q(z)
\end{aligned}$$

and the dummy variable $x$ in Equation (0) is substituted with $z$.

Fully Bayesian VI, often just called VI, has the following
instantiations:

$$\begin{aligned}
Z &= p(x) \\
w &= p(x, z, \theta) \\
q &= q(z, \theta)
\end{aligned}$$

and $x$ in Equation (0) is substituted with $(z, \theta)$.

In both cases $\theta$ are parameters and $z$ are latent variables.

*Remark on the naming of things*. The term "variational" comes from the
fact that we perform calculus of variations: maximise some functional
($L(w, q)$) over a set of functions ($q$). Note however, most of the VI
/ VB algorithms do not concern any techniques in calculus of variations,
but only uses Jensen's inequality / the fact the $D(q||p)$ reaches
minimum when $p = q$. Due to this reasoning of the naming, EM is also a
kind of VI, even though in the literature VI often referes to its fully
Bayesian version.

** EM
   :PROPERTIES:
   :CUSTOM_ID: em
   :ID:       6d694b38-56c2-4e10-8a1f-1f82e309073f
   :END:
To illustrate the EM algorithms, we first define the mixture model.

*Definition (mixture model)*. Given dataset $x_{1 : m}$, we assume the
data has some underlying latent variable $z_{1 : m}$ that may take a
value from a finite set $\{1, 2, ..., n_z\}$. Let $p(z_{i}; \pi)$ be
categorically distributed according to the probability vector $\pi$.
That is, $p(z_{i} = k; \pi) = \pi_k$. Also assume
$p(x_{i} | z_{i} = k; \eta) = p(x_{i}; \eta_k)$. Find
$\theta = (\pi, \eta)$ that maximises the likelihood
$p(x_{1 : m}; \theta)$.

Represented as a DAG (a.k.a the plate notations), the model looks like
this:

[[/assets/mixture-model.png]]

where the boxes with $m$ mean repitition for $m$ times, since there $m$
indepdent pairs of $(x, z)$, and the same goes for $\eta$.

The direct maximisation

$$\max_\theta \sum_i \log p(x_{i}; \theta) = \max_\theta \sum_i \log \int p(x_{i} | z_i; \theta) p(z_i; \theta) dz_i$$

is hard because of the integral in the log.

We can fit this problem in (1.3) by having $Z = p(x_{1 : m}; \theta)$
and $w = p(z_{1 : m}, x_{1 : m}; \theta)$. The plan is to update
$\theta$ repeatedly so that $L(p(z, x; \theta_t), q(z))$ is non
decreasing over time $t$.

Equation (1.3) at time $t$ for the $i$th datapoint is

$$\log p(x_{i}; \theta_t) \ge L(p(z_i, x_{i}; \theta_t), q(z_i)) \qquad (2)$$

Each timestep consists of two steps, the E-step and the M-step.

At E-step, we set

$$q(z_{i}) = p(z_{i}|x_{i}; \theta_t), $$

to turn the inequality into equality. We denote $r_{ik} = q(z_i = k)$
and call them responsibilities, so the posterior $q(z_i)$ is categorical
distribution with parameter $r_i = r_{i, 1 : n_z}$.

At M-step, we maximise $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_{i}))$
over $\theta$:

$$\begin{aligned}
\theta_{t + 1} &= \text{argmax}_\theta \sum_i L(p(x_{i}, z_{i}; \theta), p(z_{i} | x_{i}; \theta_t)) \\
&= \text{argmax}_\theta \sum_i \mathbb E_{p(z_{i} | x_{i}; \theta_t)} \log p(x_{i}, z_{i}; \theta) \qquad (2.3)
\end{aligned}$$

So $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_i))$ is non-decreasing at both
the E-step and the M-step.

We can see from this derivation that EM is half-Bayesian. The E-step is
Bayesian it computes the posterior of the latent variables and the
M-step is frequentist because it performs maximum likelihood estimate of
$\theta$.

It is clear that the ELBO sum coverges as it is nondecreasing with an
upper bound, but it is not clear whether the sum converges to the
correct value, i.e. $\max_\theta p(x_{1 : m}; \theta)$. In fact it is
said that the EM does get stuck in local maximum sometimes.

A different way of describing EM, which will be useful in hidden Markov
model is:

- At E-step, one writes down the formula
  $$\sum_i \mathbb E_{p(z_i | x_{i}; \theta_t)} \log p(x_{i}, z_i; \theta). \qquad (2.5)$$

- At M-setp, one finds $\theta_{t + 1}$ to be the $\theta$ that
  maximises the above formula.

*** GMM
    :PROPERTIES:
    :CUSTOM_ID: gmm
    :ID:       5d5265f6-c2b9-42f1-a4a1-0d87417f0b02
    :END:
Gaussian mixture model (GMM) is an example of mixture models.

The space of the data is $\mathbb R^n$. We use the hypothesis that the
data is Gaussian conditioned on the latent variable:

$$(x_i; \eta_k) \sim N(\mu_k, \Sigma_k),$$

so we write $\eta_k = (\mu_k, \Sigma_k)$,

During E-step, the $q(z_i)$ can be directly computed using Bayes'
theorem:

$$r_{ik} = q(z_i = k) = \mathbb P(z_i = k | x_{i}; \theta_t)
= {g_{\mu_{t, k}, \Sigma_{t, k}} (x_{i}) \pi_{t, k} \over \sum_{j = 1 : n_z} g_{\mu_{t, j}, \Sigma_{t, j}} (x_{i}) \pi_{t, j}},$$

where
$g_{\mu, \Sigma} (x) = (2 \pi)^{- n / 2} \det \Sigma^{-1 / 2} \exp(- {1 \over 2} (x - \mu)^T \Sigma^{-1} (x - \mu))$
is the pdf of the Gaussian distribution $N(\mu, \Sigma)$.

During M-step, we need to compute

$$\text{argmax}_{\Sigma, \mu, \pi} \sum_{i = 1 : m} \sum_{k = 1 : n_z} r_{ik} \log (g_{\mu_k, \Sigma_k}(x_{i}) \pi_k).$$

This is similar to the quadratic discriminant analysis, and the solution
is

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_{i = 1 : m} r_{ik}, \\
\mu_{k} &= {\sum_i r_{ik} x_{i} \over \sum_i r_{ik}}, \\
\Sigma_{k} &= {\sum_i r_{ik} (x_{i} - \mu_{t, k}) (x_{i} - \mu_{t, k})^T \over \sum_i r_{ik}}.
\end{aligned}$$

*Remark*. The k-means algorithm is the $\epsilon \to 0$ limit of the GMM
with constraints $\Sigma_k = \epsilon I$. See Section 9.3.2 of Bishop
2006 for derivation. It is also briefly mentioned there that a variant
in this setting where the covariance matrix is not restricted to
$\epsilon I$ is called elliptical k-means algorithm.

*** SMM
    :PROPERTIES:
    :CUSTOM_ID: smm
    :ID:       f4b3a462-8ae7-44f2-813c-58b007eaa047
    :END:
As a transition to the next models to study, let us consider a simpler
mixture model obtained by making one modification to GMM: change
$(x; \eta_k) \sim N(\mu_k, \Sigma_k)$ to
$\mathbb P(x = w; \eta_k) = \eta_{kw}$ where $\eta$ is a stochastic
matrix and $w$ is an arbitrary element of the space for $x$. So now the
space for both $x$ and $z$ are finite. We call this model the simple
mixture model (SMM).

As in GMM, at E-step $r_{ik}$ can be explicitly computed using Bayes'
theorem.

It is not hard to write down the solution to the M-step in this case:

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_i r_{ik}, \qquad (2.7)\\
\eta_{k, w} &= {\sum_i r_{ik} 1_{x_i = w} \over \sum_i r_{ik}}. \qquad (2.8)
\end{aligned}$$

where $1_{x_i = w}$ is the
[[https://en.wikipedia.org/wiki/Indicator_function][indicator
function]], and evaluates to $1$ if $x_i = w$ and $0$ otherwise.

Two trivial variants of the SMM are the two versions of probabilistic
latent semantic analysis (pLSA), which we call pLSA1 and pLSA2.

The model pLSA1 is a probabilistic version of latent semantic analysis,
which is basically a simple matrix factorisation model in collaborative
filtering, whereas pLSA2 has a fully Bayesian version called latent
Dirichlet allocation (LDA), not to be confused with the other LDA
(linear discriminant analysis).

*** pLSA
    :PROPERTIES:
    :CUSTOM_ID: plsa
    :ID:       d4f58158-dcb6-4ba1-a9e2-bf53bff6012e
    :END:
The pLSA model (Hoffman 2000) is a mixture model, where the dataset is
now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$
are words and $d$ are documents, and a pair $(d, x)$ represent an
ocurrance of word $x$ in document $d$.

For each datapoint $(d_{i}, x_{i})$,

$$\begin{aligned}
p(d_i, x_i; \theta) &= \sum_{z_i} p(z_i; \theta) p(d_i | z_i; \theta) p(x_i | z_i; \theta) \qquad (2.91)\\
&= p(d_i; \theta) \sum_z p(x_i | z_i; \theta) p (z_i | d_i; \theta) \qquad (2.92).
\end{aligned}$$

Of the two formulations, (2.91) corresponds to pLSA type 1, and (2.92)
corresponds to type 2.

**** pLSA1
     :PROPERTIES:
     :CUSTOM_ID: plsa1
     :ID:       969f470e-5bbe-464e-a3b7-f996c8f04de3
     :END:
The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted
with $(d_i, x_i)$, which conditioned on $z_i$ are independently
categorically distributed:

$$p(d_i = u, x_i = w | z_i = k; \theta) = p(d_i ; \xi_k) p(x_i; \eta_k) = \xi_{ku} \eta_{kw}.$$

The model can be illustrated in the plate notations:

[[/assets/plsa1.png]]

So the solution of the M-step is

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_i r_{ik} \\
\xi_{k, u} &= {\sum_i r_{ik} 1_{d_{i} = u} \over \sum_i r_{ik}} \\
\eta_{k, w} &= {\sum_i r_{ik} 1_{x_{i} = w} \over \sum_i r_{ik}}.
\end{aligned}$$

*Remark*. pLSA1 is the probabilistic version of LSA, also known as
matrix factorisation.

Let $n_d$ and $n_x$ be the number of values $d_i$ and $x_i$ can take.

*Problem* (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix
$s \le \min\{n_d, n_x\}$. Find $n_d \times s$ matrix $D$ and
$n_x \times s$ matrix $X$ that minimises

$$J(D, X) = \|R - D X^T\|_F.$$

where $\|\cdot\|_F$ is the Frobenius norm.

*Claim*. Let $R = U \Sigma V^T$ be the SVD of $R$, then the solution to
the above problem is $D = U_s \Sigma_s^{{1 \over 2}}$ and
$X = V_s \Sigma_s^{{1 \over 2}}$, where $U_s$ (resp. $V_s$) is the
matrix of the first $s$ columns of $U$ (resp. $V$) and $\Sigma_s$ is the
$s \times s$ submatrix of $\Sigma$.

One can compare pLSA1 with LSA. Both procedures produce embeddings of
$d$ and $x$: in pLSA we obtain $n_z$ dimensional embeddings
$\xi_{\cdot, u}$ and $\eta_{\cdot, w}$, whereas in LSA we obtain $s$
dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$.

**** pLSA2
     :PROPERTIES:
     :CUSTOM_ID: plsa2
     :ID:       eef3249a-c45d-4a07-876f-68b2a2e957e5
     :END:
Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite
it as

$$p(x_i | d_i; \theta) = \sum_{z_i} p(x_i | z_i; \theta) p(z_i | d_i; \theta).$$

To simplify notations, we collect all the $x_i$s with the corresponding
$d_i$ equal to 1 (suppose there are $m_1$ of them), and write them as
$(x_{1, j})_{j = 1 : m_1}$. In the same fashion we construct
$x_{2, 1 : m_2}, x_{3, 1 : m_3}, ... x_{n_d, 1 : m_{n_d}}$. Similarly,
we relabel the corresponding $d_i$ and $z_i$ accordingly.

With almost no loss of generality, we assume all $m_\ell$s are equal and
write them as $m$.

Now the model becomes

$$p(x_{\ell, i} | d_{\ell, i} = \ell; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k | d_{\ell, i} = \ell; \theta).$$

Since we have regrouped the $x$'s and $z$'s whose indices record the
values of the $d$'s, we can remove the $d$'s from the equation
altogether:

$$p(x_{\ell, i}; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k; \theta).$$

It is effectively a modification of SMM by making $n_d$ copies of $\pi$.
More specifically the parameters are
$\theta = (\pi_{1 : n_d, 1 : n_z}, \eta_{1 : n_z, 1 : n_x})$, where we
model $(z | d = \ell) \sim \text{Cat}(\pi_{\ell, \cdot})$ and, as in
pLSA1, $(x | z = k) \sim \text{Cat}(\eta_{k, \cdot})$.

Illustrated in the plate notations, pLSA2 is:

[[/assets/plsa2.png]]

The computation is basically adding an index $\ell$ to the computation
of SMM wherever applicable.

The updates at the E-step is

$$r_{\ell i k} = p(z_{\ell i} = k | x_{\ell i}; \theta) \propto \pi_{\ell k} \eta_{k, x_{\ell i}}.$$

And at the M-step

$$\begin{aligned}
\pi_{\ell k} &= {1 \over m} \sum_i r_{\ell  i k} \\
\eta_{k w} &= {\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w} \over \sum_{\ell, i} r_{\ell i k}}.
\end{aligned}$$

*** HMM
    :PROPERTIES:
    :CUSTOM_ID: hmm
    :ID:       16d00eda-7136-49f5-8427-c775c7a91317
    :END:
The hidden markov model (HMM) is a sequential version of SMM, in the
same sense that recurrent neural networks are sequential versions of
feed-forward neural networks.

HMM is an example where the posterior $p(z_i | x_i; \theta)$ is not easy
to compute, and one has to utilise properties of the underlying Bayesian
network to go around it.

Now each sample is a sequence $x_i = (x_{ij})_{j = 1 : T}$, and so are
the latent variables $z_i = (z_{ij})_{j = 1 : T}$.

The latent variables are assumed to form a Markov chain with transition
matrix $(\xi_{k \ell})_{k \ell}$, and $x_{ij}$ is completely dependent
on $z_{ij}$:

$$\begin{aligned}
p(z_{ij} | z_{i, j - 1}) &= \xi_{z_{i, j - 1}, z_{ij}},\\
p(x_{ij} | z_{ij}) &= \eta_{z_{ij}, x_{ij}}.
\end{aligned}$$

Also, the distribution of $z_{i1}$ is again categorical with parameter
$\pi$:

$$p(z_{i1}) = \pi_{z_{i1}}$$

So the parameters are $\theta = (\pi, \xi, \eta)$. And HMM can be shown
in plate notations as:

[[/assets/hmm.png]]

Now we apply EM to HMM, which is called the
[[https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm][Baum-Welch
algorithm]]. Unlike the previous examples, it is too messy to compute
$p(z_i | x_{i}; \theta)$, so during the E-step we instead write down
formula (2.5) directly in hope of simplifying it:

$$\begin{aligned}
\mathbb E_{p(z_i | x_i; \theta_t)} \log p(x_i, z_i; \theta_t) &=\mathbb E_{p(z_i | x_i; \theta_t)} \left(\log \pi_{z_{i1}} + \sum_{j = 2 : T} \log \xi_{z_{i, j - 1}, z_{ij}} + \sum_{j = 1 : T} \log \eta_{z_{ij}, x_{ij}}\right). \qquad (3)
\end{aligned}$$

Let us compute the summand in second term:

$$\begin{aligned}
\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \xi_{z_{i, j - 1}, z_{ij}} &= \sum_{k, \ell} (\log \xi_{k, \ell}) \mathbb E_{p(z_{i} | x_{i}; \theta_t)} 1_{z_{i, j - 1} = k, z_{i, j} = \ell} \\
&= \sum_{k, \ell} p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t) \log \xi_{k, \ell}. \qquad (4)
\end{aligned}$$

Similarly, one can write down the first term and the summand in the
third term to obtain

$$\begin{aligned}
\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \pi_{z_{i1}} &= \sum_k p(z_{i1} = k | x_{i}; \theta_t), \qquad (5) \\
\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \eta_{z_{i, j}, x_{i, j}} &= \sum_{k, w} 1_{x_{ij} = w} p(z_{i, j} = k | x_i; \theta_t) \log \eta_{k, w}. \qquad (6)
\end{aligned}$$

plugging (4)(5)(6) back into (3) and summing over $j$, we obtain the
formula to maximise over $\theta$ on:

$$\sum_k \sum_i r_{i1k} \log \pi_k + \sum_{k, \ell} \sum_{j = 2 : T, i} s_{ijk\ell} \log \xi_{k, \ell} + \sum_{k, w} \sum_{j = 1 : T, i} r_{ijk} 1_{x_{ij} = w} \log \eta_{k, w},$$

where

$$\begin{aligned}
r_{ijk} &:= p(z_{ij} = k | x_{i}; \theta_t), \\
s_{ijk\ell} &:= p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t).
\end{aligned}$$

Now we proceed to the M-step. Since each of the
$\pi_k, \xi_{k, \ell}, \eta_{k, w}$ is nicely confined in the inner sum
of each term, together with the constraint
$\sum_k \pi_k = \sum_\ell \xi_{k, \ell} = \sum_w \eta_{k, w} = 1$ it is
not hard to find the argmax at time $t + 1$ (the same way one finds the
MLE for any categorical distribution):

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_i r_{i1k}, \qquad (6.1) \\
\xi_{k, \ell} &= {\sum_{j = 2 : T, i} s_{ijk\ell} \over \sum_{j = 1 : T - 1, i} r_{ijk}}, \qquad(6.2) \\
\eta_{k, w} &= {\sum_{ij} 1_{x_{ij} = w} r_{ijk} \over \sum_{ij} r_{ijk}}. \qquad(6.3)
\end{aligned}$$

Note that (6.1)(6.3) are almost identical to (2.7)(2.8). This makes
sense as the only modification HMM makes over SMM is the added
dependencies between the latent variables.

What remains is to compute $r$ and $s$.

This is done by using the forward and backward procedures which takes
advantage of the conditioned independence / topology of the underlying
Bayesian network. It is out of scope of this post, but for the sake of
completeness I include it here.

Let

$$\begin{aligned}
\alpha_k(i, j) &:= p(x_{i, 1 : j}, z_{ij} = k; \theta_t), \\
\beta_k(i, j) &:= p(x_{i, j + 1 : T} | z_{ij} = k; \theta_t).
\end{aligned}$$

They can be computed recursively as

$$\begin{aligned}
\alpha_k(i, j) &= \begin{cases}
\eta_{k, x_{1j}} \pi_k, & j = 1; \\
\eta_{k, x_{ij}} \sum_\ell \alpha_\ell(j - 1, i) \xi_{k\ell}, & j \ge 2.
\end{cases}\\
\beta_k(i, j) &= \begin{cases}
1, & j = T;\\
\sum_\ell \xi_{k\ell} \beta_\ell(j + 1, i) \eta_{\ell, x_{i, j + 1}}, & j < T.
\end{cases}
\end{aligned}$$

Then

$$\begin{aligned}
p(z_{ij} = k, x_{i}; \theta_t) &= \alpha_k(i, j) \beta_k(i, j), \qquad (7)\\
p(x_{i}; \theta_t) &= \sum_k \alpha_k(i, j) \beta_k(i, j),\forall j = 1 : T \qquad (8)\\
p(z_{i, j - 1} = k, z_{i, j} = \ell, x_{i}; \theta_t) &= \alpha_k(i, j) \xi_{k\ell} \beta_\ell(i, j + 1) \eta_{\ell, x_{j + 1, i}}. \qquad (9)
\end{aligned}$$

And this yields $r_{ijk}$ and $s_{ijk\ell}$ since they can be computed
as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively.

** Fully Bayesian EM / MFA
   :PROPERTIES:
   :CUSTOM_ID: fully-bayesian-em-mfa
   :ID:       77f1d7ae-3785-45d4-b88f-18478e41f3b9
   :END:
Let us now venture into the realm of full Bayesian.

In EM we aim to maximise the ELBO

$$\int q(z) \log {p(x, z; \theta) \over q(z)} dz d\theta$$

alternately over $q$ and $\theta$. As mentioned before, the E-step of
maximising over $q$ is Bayesian, in that it computes the posterior of
$z$, whereas the M-step of maximising over $\theta$ is maximum
likelihood and frequentist.

The fully Bayesian EM makes the M-step Bayesian by making $\theta$ a
random variable, so the ELBO becomes

$$L(p(x, z, \theta), q(z, \theta)) = \int q(z, \theta) \log {p(x, z, \theta) \over q(z, \theta)} dz d\theta$$

We further assume $q$ can be factorised into distributions on $z$ and
$\theta$: $q(z, \theta) = q_1(z) q_2(\theta)$. So the above formula is
rewritten as

$$L(p(x, z, \theta), q(z, \theta)) = \int q_1(z) q_2(\theta) \log {p(x, z, \theta) \over q_1(z) q_2(\theta)} dz d\theta$$

To find argmax over $q_1$, we rewrite

$$\begin{aligned}
L(p(x, z, \theta), q(z, \theta)) &= \int q_1(z) \left(\int q_2(\theta) \log p(x, z, \theta) d\theta\right) dz - \int q_1(z) \log q_1(z) dz - \int q_2(\theta) \log q_2(\theta) \\&= - D(q_1(z) || p_x(z)) + C,
\end{aligned}$$

where $p_x$ is a density in $z$ with

$$\log p_x(z) = \mathbb E_{q_2(\theta)} \log p(x, z, \theta) + C.$$

So the $q_1$ that maximises the ELBO is $q_1^* = p_x$.

Similarly, the optimal $q_2$ is such that

$$\log q_2^*(\theta) = \mathbb E_{q_1(z)} \log p(x, z, \theta) + C.$$

The fully Bayesian EM thus alternately evaluates $q_1^*$ (E-step) and
$q_2^*$ (M-step).

It is also called mean field approximation (MFA), and can be easily
generalised to models with more than two groups of latent variables, see
e.g. Section 10.1 of Bishop 2006.

*** Application to mixture models
    :PROPERTIES:
    :CUSTOM_ID: application-to-mixture-models
    :ID:       52bf6025-1180-44dc-8272-e6af6e228bf3
    :END:
*Definition (Fully Bayesian mixture model)*. The relations between
$\pi$, $\eta$, $x$, $z$ are the same as in the definition of mixture
models. Furthermore, we assume the distribution of $(x | \eta_k)$
belongs to the
[[https://en.wikipedia.org/wiki/Exponential_family][exponential family]]
(the definition of the exponential family is briefly touched at the end
of this section). But now both $\pi$ and $\eta$ are random variables.
Let the prior distribution $p(\pi)$ is Dirichlet with parameter
$(\alpha, \alpha, ..., \alpha)$. Let the prior $p(\eta_k)$ be the
conjugate prior of $(x | \eta_k)$, with parameter $\beta$, we will see
later in this section that the posterior $q(\eta_k)$ belongs to the same
family as $p(\eta_k)$. Represented in a plate notations, a fully
Bayesian mixture model looks like:

[[/assets/fully-bayesian-mm.png]]

Given this structure we can write down the mean-field approximation:

$$\log q(z) = \mathbb E_{q(\eta)q(\pi)} (\log(x | z, \eta) + \log(z | \pi)) + C.$$

Both sides can be factored into per-sample expressions, giving us

$$\log q(z_i) = \mathbb E_{q(\eta)} \log p(x_i | z_i, \eta) + \mathbb E_{q(\pi)} \log p(z_i | \pi) + C$$

Therefore

$$\log r_{ik} = \log q(z_i = k) = \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k) + \mathbb E_{q(\pi)} \log \pi_k + C. \qquad (9.1)$$

So the posterior of each $z_i$ is categorical regardless of the $p$s and
$q$s.

Computing the posterior of $\pi$ and $\eta$:

$$\log q(\pi) + \log q(\eta) = \log p(\pi) + \log p(\eta) + \sum_i \mathbb E_{q(z_i)} p(x_i | z_i, \eta) + \sum_i \mathbb E_{q(z_i)} p(z_i | \pi) + C.$$

So we can separate the terms involving $\pi$ and those involving $\eta$.
First compute the posterior of $\pi$:

$$\log q(\pi) = \log p(\pi) + \sum_i \mathbb E_{q(z_i)} \log p(z_i | \pi) = \log p(\pi) + \sum_i \sum_k r_{ik} \log \pi_k + C.$$

The right hand side is the log of a Dirichlet modulus the constant $C$,
from which we can update the posterior parameter $\phi^\pi$:

$$\phi^\pi_k = \alpha + \sum_i r_{ik}. \qquad (9.3)$$

Similarly we can obtain the posterior of $\eta$:

$$\log q(\eta) = \log p(\eta) + \sum_i \sum_k r_{ik} \log p(x_i | \eta_k) + C.$$

Again we can factor the terms with respect to $k$ and get:

$$\log q(\eta_k) = \log p(\eta_k) + \sum_i r_{ik} \log p(x_i | \eta_k) + C. \qquad (9.5)$$

Here we can see why conjugate prior works. Mathematically, given a
probability distribution $p(x | \theta)$, the distribution $p(\theta)$
is called conjugate prior of $p(x | \theta)$ if
$\log p(\theta) + \log p(x | \theta)$ has the same form as
$\log p(\theta)$.

For example, the conjugate prior for the exponential family
$p(x | \theta) = h(x) \exp(\theta \cdot T(x) - A(\theta))$ where $T$,
$A$ and $h$ are some functions is
$p(\theta; \chi, \nu) \propto \exp(\chi \cdot \theta - \nu A(\theta))$.

Here what we want is a bit different from conjugate priors because of
the coefficients $r_{ik}$. But the computation carries over to the
conjugate priors of the exponential family (try it yourself and you'll
see). That is, if $p(x_i | \eta_k)$ belongs to the exponential family

$$p(x_i | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$

and if $p(\eta_k)$ is the conjugate prior of $p(x_i | \eta_k)$

$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k))$$

then $q(\eta_k)$ has the same form as $p(\eta_k)$, and from (9.5) we can
compute the updates of $\phi^{\eta_k}$:

$$\begin{aligned}
\phi^{\eta_k}_1 &= \chi + \sum_i r_{ik} T(x_i), \qquad (9.7) \\
\phi^{\eta_k}_2 &= \nu + \sum_i r_{ik}. \qquad (9.9)
\end{aligned}$$

So the mean field approximation for the fully Bayesian mixture model is
the alternate iteration of (9.1) (E-step) and (9.3)(9.7)(9.9) (M-step)
until convergence.

*** Fully Bayesian GMM
    :PROPERTIES:
    :CUSTOM_ID: fully-bayesian-gmm
    :ID:       814289c0-2527-42a0-914b-d64ad62ecd05
    :END:
A typical example of fully Bayesian mixture models is the fully Bayesian
Gaussian mixture model (Attias 2000, also called variational GMM in the
literature). It is defined by applying the same modification to GMM as
the difference between Fully Bayesian mixture model and the vanilla
mixture model.

More specifically:

- $p(z_{i}) = \text{Cat}(\pi)$ as in vanilla GMM
- $p(\pi) = \text{Dir}(\alpha, \alpha, ..., \alpha)$ has Dirichlet
  distribution, the conjugate prior to the parameters of the categorical
  distribution.
- $p(x_i | z_i = k) = p(x_i | \eta_k) = N(\mu_{k}, \Sigma_{k})$ as in
  vanilla GMM
- $p(\mu_k, \Sigma_k) = \text{NIW} (\mu_0, \lambda, \Psi, \nu)$ is the
  normal-inverse-Wishart distribution, the conjugate prior to the mean
  and covariance matrix of the Gaussian distribution.

The E-step and M-step can be computed using (9.1) and (9.3)(9.7)(9.9) in
the previous section. The details of the computation can be found in
Chapter 10.2 of Bishop 2006 or Attias 2000.

*** LDA
    :PROPERTIES:
    :CUSTOM_ID: lda
    :ID:       7d752891-ef33-4b58-9dc3-d6a61325bfa6
    :END:
As the second example of fully Bayesian mixture models, Latent Dirichlet
allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of
pLSA2, with the following plate notations:

[[/assets/lda.png]]

It is the smoothed version in the paper.

More specifically, on the basis of pLSA2, we add prior distributions to
$\eta$ and $\pi$:

$$\begin{aligned}
p(\eta_k) &= \text{Dir} (\beta, ..., \beta), \qquad k = 1 : n_z \\
p(\pi_\ell) &= \text{Dir} (\alpha, ..., \alpha), \qquad \ell = 1 : n_d \\
\end{aligned}$$

And as before, the prior of $z$ is

$$p(z_{\ell, i}) = \text{Cat} (\pi_\ell), \qquad \ell = 1 : n_d, i = 1 : m$$

We also denote posterior distribution

$$\begin{aligned}
q(\eta_k) &= \text{Dir} (\phi^{\eta_k}), \qquad k = 1 : n_z \\
q(\pi_\ell) &= \text{Dir} (\phi^{\pi_\ell}), \qquad \ell = 1 : n_d \\
p(z_{\ell, i}) &= \text{Cat} (r_{\ell, i}), \qquad \ell = 1 : n_d, i = 1 : m
\end{aligned}$$

As before, in E-step we update $r$, and in M-step we update $\lambda$
and $\gamma$.

But in the LDA paper, one treats optimisation over $r$, $\lambda$ and
$\gamma$ as a E-step, and treats $\alpha$ and $\beta$ as parameters,
which is optmised over at M-step. This makes it more akin to the
classical EM where the E-step is Bayesian and M-step MLE. This is more
complicated, and we do not consider it this way here.

Plugging in (9.1) we obtain the updates at E-step

$$r_{\ell i k} \propto \exp(\psi(\phi^{\pi_\ell}_k) + \psi(\phi^{\eta_k}_{x_{\ell i}}) - \psi(\sum_w \phi^{\eta_k}_w)), \qquad (10)$$

where $\psi$ is the digamma function. Similarly, plugging in
(9.3)(9.7)(9.9), at M-step, we update the posterior of $\pi$ and $\eta$:

$$\begin{aligned}
\phi^{\pi_\ell}_k &= \alpha + \sum_i r_{\ell i k}. \qquad (11)\\
%%}}$
%%As for $\eta$, we have
%%{{$%align%
%%\log q(\eta) &= \sum_k \log p(\eta_k) + \sum_{\ell, i} \mathbb E_{q(z_{\ell i})} \log p(x_{\ell i} | z_{\ell i}, \eta) \\
%%&= \sum_{k, j} (\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = j} + \beta - 1) \log \eta_{k j}
%%}}$
%%which gives us
%%{{$
\phi^{\eta_k}_w &= \beta + \sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w}. \qquad (12)
\end{aligned}$$

So the algorithm iterates over (10) and (11)(12) until convergence.

*** DPMM
    :PROPERTIES:
    :CUSTOM_ID: dpmm
    :ID:       187cb168-b3f8-428e-962a-80ad5966f844
    :END:
The Dirichlet process mixture model (DPMM) is like the fully Bayesian
mixture model except $n_z = \infty$, i.e. $z$ can take any positive
integer value.

The probability of $z_i = k$ is defined using the so called
stick-breaking process: let $v_i \sim \text{Beta} (\alpha, \beta)$ be
i.i.d. random variables with Beta distributions, then

$$\mathbb P(z_i = k | v_{1:\infty}) = (1 - v_1) (1 - v_2) ... (1 - v_{k - 1}) v_k.$$

So $v$ plays a similar role to $\pi$ in the previous models.

As before, we have that the distribution of $x$ belongs to the
exponential family:

$$p(x | z = k, \eta) = p(x | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$

so the prior of $\eta_k$ is

$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k)).$$

Because of the infinities we can't directly apply the formulas in the
general fully Bayesian mixture models. So let us carefully derive the
whole thing again.

As before, we can write down the ELBO:

$$L(p(x, z, \theta), q(z, \theta)) = \mathbb E_{q(\theta)} \log {p(\theta) \over q(\theta)} + \mathbb E_{q(\theta) q(z)} \log {p(x, z | \theta) \over q(z)}.$$

Both terms are infinite series:

$$L(p, q) = \sum_{k = 1 : \infty} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_{i = 1 : m} \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}.$$

There are several ways to deal with the infinities. One is to fix some
level $T > 0$ and set $v_T = 1$ almost surely (Blei-Jordan 2006). This
effectively turns the model into a finite one, and both terms become
finite sums over $k = 1 : T$.

Another walkaround (Kurihara-Welling-Vlassis 2007) is also a kind of
truncation, but less heavy-handed: setting the posterior
$q(\theta) = q(\eta) q(v)$ to be:

$$q(\theta) = q(\theta_{1 : T}) p(\theta_{T + 1 : \infty}) =: q(\theta_{\le T}) p(\theta_{> T}).$$

That is, tie the posterior after $T$ to the prior. This effectively
turns the first term in the ELBO to a finite sum over $k = 1 : T$, while
keeping the second sum an infinite series:

$$L(p, q) = \sum_{k = 1 : T} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_i \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}. \qquad (13)$$

The plate notation of this model is:

[[/assets/dpmm.png]]

As it turns out, the infinities can be tamed in this case.

As before, the optimal $q(z_i)$ is computed as

$$r_{ik} = q(z_i = k) = s_{ik} / S_i$$

where

$$\begin{aligned}
s_{ik} &= \exp(\mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta)) \\
S_i &= \sum_{k = 1 : \infty} s_{ik}.
\end{aligned}$$

Plugging this back to (13) we have

$$\begin{aligned}
\sum_{k = 1 : \infty} r_{ik} &\mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over r_{ik}} \\
&= \sum_{k = 1 : \infty} r_{ik} \mathbb E_{q(\theta)} (\log p(x_i, z_i = k | \theta) - \mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta) + \log S_i) = \log S_i.
\end{aligned}$$

So it all rests upon $S_i$ being finite.

For $k \le T + 1$, we compute the quantity $s_{ik}$ directly. For
$k > T$, it is not hard to show that

$$s_{ik} = s_{i, T + 1} \exp((k - T - 1) \mathbb E_{p(w)} \log (1 - w)),$$

where $w$ is a random variable with same distribution as $p(v_k)$, i.e.
$\text{Beta}(\alpha, \beta)$.

Hence

$$S_i = \sum_{k = 1 : T} s_{ik} + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}$$

is indeed finite. Similarly we also obtain

$$q(z_i > k) = S^{-1} \left(\sum_{\ell = k + 1 : T} s_\ell + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}\right), k \le T \qquad (14)$$

Now let us compute the posterior of $\theta_{\le T}$. In the following
we exchange the integrals without justifying them (c.f. Fubini's
Theorem). Equation (13) can be rewritten as

$$L(p, q) = \mathbb E_{q(\theta_{\le T})} \left(\log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} - \log q(\theta_{\le T})\right).$$

Note that unlike the derivation of the mean-field approximation, we keep
the $- \mathbb E_{q(z)} \log q(z)$ term even though we are only
interested in $\theta$ at this stage. This is again due to the problem
of infinities: as in the computation of $S$, we would like to cancel out
some undesirable unbounded terms using $q(z)$. We now have

$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} + C.$$

By plugging in $q(z = k)$ we obtain

$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_{k = 1 : \infty} q(z_i = k) \left(\mathbb E_{p(\theta_{> T})} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)} - \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}\right) + C.$$

Again, we separate the $v_k$'s and the $\eta_k$'s to obtain

$$q(v_{\le T}) = \log p(v_{\le T}) + \sum_i \sum_k q(z_i = k) \left(\mathbb E_{p(v_{> T})} \log p(z_i = k | v) - \mathbb E_{q(v)} \log p (z_i = k | v)\right).$$

Denote by $D_k$ the difference between the two expetations on the right
hand side. It is easy to show that

$$D_k = \begin{cases}
\log(1 - v_1) ... (1 - v_{k - 1}) v_k - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_{k - 1}) v_k & k \le T\\
\log(1 - v_1) ... (1 - v_T) - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_T) & k > T
\end{cases}$$

so $D_k$ is bounded. With this we can derive the update for
$\phi^{v, 1}$ and $\phi^{v, 2}$:

$$\begin{aligned}
\phi^{v, 1}_k &= \alpha + \sum_i q(z_i = k) \\
\phi^{v, 2}_k &= \beta + \sum_i q(z_i > k),
\end{aligned}$$

where $q(z_i > k)$ can be computed as in (14).

When it comes to $\eta$, we have

$$\log q(\eta_{\le T}) = \log p(\eta_{\le T}) + \sum_i \sum_{k = 1 : \infty} q(z_i = k) (\mathbb E_{p(\eta_k)} \log p(x_i | \eta_k) - \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k)).$$

Since $q(\eta_k) = p(\eta_k)$ for $k > T$, the inner sum on the right
hand side is a finite sum over $k = 1 : T$. By factorising
$q(\eta_{\le T})$ and $p(\eta_{\le T})$, we have

$$\log q(\eta_k) = \log p(\eta_k) + \sum_i q(z_i = k) \log (x_i | \eta_k) + C,$$

which gives us

$$\begin{aligned}
\phi^{\eta, 1}_k &= \xi + \sum_i q(z_i = k) T(x_i) \\
\phi^{\eta, 2}_k &= \nu + \sum_i q(z_i = k).
\end{aligned}$$

** SVI
   :PROPERTIES:
   :CUSTOM_ID: svi
   :ID:       47efee6c-67ac-44eb-92fb-4d576ae2ec99
   :END:
In variational inference, the computation of some parameters are more
expensive than others.

For example, the computation of M-step is often much more expensive than
that of E-step:

- In the vanilla mixture models with the EM algorithm, the update of
  $\theta$ requires the computation of $r_{ik}$ for all $i = 1 : m$, see
  Eq (2.3).
- In the fully Bayesian mixture model with mean field approximation, the
  updates of $\phi^\pi$ and $\phi^\eta$ require the computation of a sum
  over all samples (see Eq (9.3)(9.7)(9.9)).

Similarly, in pLSA2 (resp. LDA), the updates of $\eta_k$ (resp.
$\phi^{\eta_k}$) requires a sum over $\ell = 1 : n_d$, whereas the
updates of other parameters do not.

In these cases, the parameter that requires more computations are called
global and the other ones local.

Stochastic variational inference (SVI, Hoffman-Blei-Wang-Paisley 2012)
addresses this problem in the same way as stochastic gradient descent
improves efficiency of gradient descent.

Each time SVI picks a sample, updates the corresponding local
parameters, and computes the update of the global parameters as if all
the $m$ samples are identical to the picked sample. Finally it
incorporates this global parameter value into previous computations of
the global parameters, by means of an exponential moving average.

As an example, here's SVI applied to LDA:

1. Set $t = 1$.

2. Pick $\ell$ uniformly from $\{1, 2, ..., n_d\}$.

3. Repeat until convergence:

   1. Compute $(r_{\ell i k})_{i = 1 : m, k = 1 : n_z}$ using (10).
   2. Compute $(\phi^{\pi_\ell}_k)_{k = 1 : n_z}$ using (11).

4. Compute $(\tilde \phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$ using
   the following formula (compare with (12))
   $$\tilde \phi^{\eta_k}_w = \beta + n_d \sum_{i} r_{\ell i k} 1_{x_{\ell i} = w}$$

5. Update the exponential moving average
   $(\phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$:
   $$\phi^{\eta_k}_w = (1 - \rho_t) \phi^{\eta_k}_w + \rho_t \tilde \phi^{\eta_k}_w$$

6. Increment $t$ and go back to Step 2.

In the original paper, $\rho_t$ needs to satisfy some conditions that
guarantees convergence of the global parameters:

$$\begin{aligned}
\sum_t \rho_t = \infty \\
\sum_t \rho_t^2 < \infty
\end{aligned}$$

and the choice made there is

$$\rho_t = (t + \tau)^{-\kappa}$$

for some $\kappa \in (.5, 1]$ and $\tau \ge 0$.

** AEVB
   :PROPERTIES:
   :CUSTOM_ID: aevb
   :ID:       a196df8f-1574-4390-83a4-dd22d8fcecaf
   :END:
SVI adds to variational inference stochastic updates similar to
stochastic gradient descent. Why not just use neural networks with
stochastic gradient descent while we are at it? Autoencoding variational
Bayes (AEVB) (Kingma-Welling 2013) is such an algorithm.

Let's look back to the original problem of maximising the ELBO:

$$\max_{\theta, q} \sum_{i = 1 : m} L(p(x_i | z_i; \theta) p(z_i; \theta), q(z_i))$$

Since for any given $\theta$, the optimal $q(z_i)$ is the posterior
$p(z_i | x_i; \theta)$, the problem reduces to

$$\max_{\theta} \sum_i L(p(x_i | z_i; \theta) p(z_i; \theta), p(z_i | x_i; \theta))$$

Let us assume $p(z_i; \theta) = p(z_i)$ is independent of $\theta$ to
simplify the problem. In the old mixture models, we have
$p(x_i | z_i; \theta) = p(x_i; \eta_{z_i})$, which we can generalise to
$p(x_i; f(\theta, z_i))$ for some function $f$. Using Beyes' theorem we
can also write down $p(z_i | x_i; \theta) = q(z_i; g(\theta, x_i))$ for
some function $g$. So the problem becomes

$$\max_{\theta} \sum_i L(p(x_i; f(\theta, z_i)) p(z_i), q(z_i; g(\theta, x_i)))$$

In some cases $g$ can be hard to write down or compute. AEVB addresses
this problem by replacing $g(\theta, x_i)$ with a neural network
$g_\phi(x_i)$ with input $x_i$ and some separate parameters $\phi$. It
also replaces $f(\theta, z_i)$ with a neural network $f_\theta(z_i)$
with input $z_i$ and parameters $\theta$. And now the problem becomes

$$\max_{\theta, \phi} \sum_i L(p(x_i; f_\theta(z_i)) p(z_i), q(z_i; g_\phi(x_i))).$$

The objective function can be written as

$$\sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - D(q(z_i; g_\phi(x_i)) || p(z_i)).$$

The first term is called the negative reconstruction error, like the
$- \|decoder(encoder(x)) - x\|$ in autoencoders, which is where the
"autoencoder" in the name comes from.

The second term is a regularisation term that penalises the posterior
$q(z_i)$ that is very different from the prior $p(z_i)$. We assume this
term can be computed analytically.

So only the first term requires computing.

We can approximate the sum over $i$ in a similar fashion as SVI: pick
$j$ uniformly randomly from $\{1 ... m\}$ and treat the whole dataset as
$m$ replicates of $x_j$, and approximate the expectation using
Monte-Carlo:

$$U(x_i, \theta, \phi) := \sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) \approx m \mathbb E_{q(z_j; g_\phi(x_j))} \log p(x_j; f_\theta(z_j)) \approx {m \over L} \sum_{\ell = 1}^L \log p(x_j; f_\theta(z_{j, \ell})),$$

where each $z_{j, \ell}$ is sampled from $q(z_j; g_\phi(x_j))$.

But then it is not easy to approximate the gradient over $\phi$. One can
use the log trick as in policy gradients, but it has the problem of high
variance. In policy gradients this is overcome by using baseline
subtractions. In the AEVB paper it is tackled with the
reparameterisation trick.

Assume there exists a transformation $T_\phi$ and a random variable
$\epsilon$ with distribution independent of $\phi$ or $\theta$, such
that $T_\phi(x_i, \epsilon)$ has distribution $q(z_i; g_\phi(x_i))$. In
this case we can rewrite $U(x, \phi, \theta)$ as

$$\sum_i \mathbb E_{\epsilon \sim p(\epsilon)} \log p(x_i; f_\theta(T_\phi(x_i, \epsilon))),$$

This way one can use Monte-Carlo to approximate
$\nabla_\phi U(x, \phi, \theta)$:

$$\nabla_\phi U(x, \phi, \theta) \approx {m \over L} \sum_{\ell = 1 : L} \nabla_\phi \log p(x_j; f_\theta(T_\phi(x_j, \epsilon_\ell))),$$

where each $\epsilon_{\ell}$ is sampled from $p(\epsilon)$. The
approximation of $U(x, \phi, \theta)$ itself can be done similarly.

*** VAE
    :PROPERTIES:
    :CUSTOM_ID: vae
    :ID:       59e07ae5-a4d3-4b95-949f-0b4348f2b70b
    :END:
As an example of AEVB, the paper introduces variational autoencoder
(VAE), with the following instantiations:

- The prior $p(z_i) = N(0, I)$ is standard normal, thus independent of
  $\theta$.
- The distribution $p(x_i; \eta)$ is either Gaussian or categorical.
- The distribution $q(z_i; \mu, \Sigma)$ is Gaussian with diagonal
  covariance matrix. So
  $g_\phi(z_i) = (\mu_\phi(x_i), \text{diag}(\sigma^2_\phi(x_i)_{1 : d}))$.
  Thus in the reparameterisation trick $\epsilon \sim N(0, I)$ and
  $T_\phi(x_i, \epsilon) = \epsilon \odot \sigma_\phi(x_i) + \mu_\phi(x_i)$,
  where $\odot$ is elementwise multiplication.
- The KL divergence can be easily computed analytically as
  $- D(q(z_i; g_\phi(x_i)) || p(z_i)) = {d \over 2} + \sum_{j = 1 : d} \log\sigma_\phi(x_i)_j - {1 \over 2} \sum_{j = 1 : d} (\mu_\phi(x_i)_j^2 + \sigma_\phi(x_i)_j^2)$.

With this, one can use backprop to maximise the ELBO.

*** Fully Bayesian AEVB
    :PROPERTIES:
    :CUSTOM_ID: fully-bayesian-aevb
    :ID:       0fb4f75b-4b62-440f-adc7-996b2d7f718a
    :END:
Let us turn to fully Bayesian version of AEVB. Again, we first recall
the ELBO of the fully Bayesian mixture models:

$$L(p(x, z, \pi, \eta; \alpha, \beta), q(z, \pi, \eta; r, \phi)) = L(p(x | z, \eta) p(z | \pi) p(\pi; \alpha) p(\eta; \beta), q(z; r) q(\eta; \phi^\eta) q(\pi; \phi^\pi)).$$

We write $\theta = (\pi, \eta)$, rewrite $\alpha := (\alpha, \beta)$,
$\phi := r$, and $\gamma := (\phi^\eta, \phi^\pi)$. Furthermore, as in
the half-Bayesian version we assume $p(z | \theta) = p(z)$, i.e. $z$
does not depend on $\theta$. Similarly we also assume
$p(\theta; \alpha) = p(\theta)$. Now we have

$$L(p(x, z, \theta; \alpha), q(z, \theta; \phi, \gamma)) = L(p(x | z, \theta) p(z) p(\theta), q(z; \phi) q(\theta; \gamma)).$$

And the objective is to maximise it over $\phi$ and $\gamma$. We no
longer maximise over $\theta$, because it is now a random variable, like
$z$. Now let us transform it to a neural network model, as in the
half-Bayesian case:

$$L\left(\left(\prod_{i = 1 : m} p(x_i; f_\theta(z_i))\right) \left(\prod_{i = 1 : m} p(z_i) \right) p(\theta), \left(\prod_{i = 1 : m} q(z_i; g_\phi(x_i))\right) q(\theta; h_\gamma(x))\right).$$

where $f_\theta$, $g_\phi$ and $h_\gamma$ are neural networks. Again, by
separating out KL-divergence terms, the above formula becomes

$$\sum_i \mathbb E_{q(\theta; h_\gamma(x))q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - \sum_i D(q(z_i; g_\phi(x_i)) || p(z_i)) - D(q(\theta; h_\gamma(x)) || p(\theta)).$$

Again, we assume the latter two terms can be computed analytically.
Using reparameterisation trick, we write

$$\begin{aligned}
\theta &= R_\gamma(\zeta, x) \\
z_i &= T_\phi(\epsilon, x_i)
\end{aligned}$$

for some transformations $R_\gamma$ and $T_\phi$ and random variables
$\zeta$ and $\epsilon$ so that the output has the desired distributions.

Then the first term can be written as

$$\mathbb E_{\zeta, \epsilon} \log p(x_i; f_{R_\gamma(\zeta, x)} (T_\phi(\epsilon, x_i))),$$

so that the gradients can be computed accordingly.

Again, one may use Monte-Carlo to approximate this expetation.

** References
   :PROPERTIES:
   :CUSTOM_ID: references
   :ID:       df1567c9-b0e1-499f-a9d1-c0c915b2b98d
   :END:

- Attias, Hagai. "A variational baysian framework for graphical models."
  In Advances in neural information processing systems, pp.
  209-215. 2000.
- Bishop, Christopher M. Neural networks for pattern recognition.
  Springer. 2006.
- Blei, David M., and Michael I. Jordan. "Variational Inference for
  Dirichlet Process Mixtures." Bayesian Analysis 1, no. 1 (March 2006):
  121--43. [[https://doi.org/10.1214/06-BA104]].
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent Dirichlet
  Allocation." Journal of Machine Learning Research 3, no. Jan (2003):
  993--1022.
- Hofmann, Thomas. "Latent Semantic Models for Collaborative Filtering."
  ACM Transactions on Information Systems 22, no. 1 (January 1, 2004):
  89--115. [[https://doi.org/10.1145/963770.963774]].
- Hofmann, Thomas. "Learning the similarity of documents: An
  information-geometric approach to document retrieval and
  categorization." In Advances in neural information processing systems,
  pp. 914-920. 2000.
- Hoffman, Matt, David M. Blei, Chong Wang, and John Paisley.
  "Stochastic Variational Inference." ArXiv:1206.7051 [Cs, Stat], June
  29, 2012. [[http://arxiv.org/abs/1206.7051]].
- Kingma, Diederik P., and Max Welling. "Auto-Encoding Variational
  Bayes." ArXiv:1312.6114 [Cs, Stat], December 20, 2013.
  [[http://arxiv.org/abs/1312.6114]].
- Kurihara, Kenichi, Max Welling, and Nikos Vlassis. "Accelerated
  variational Dirichlet process mixtures." In Advances in neural
  information processing systems, pp. 761-768. 2007.
- Sudderth, Erik Blaine. "Graphical models for visual object recognition
  and tracking." PhD diss., Massachusetts Institute of Technology, 2006.