aboutsummaryrefslogtreecommitdiff
path: root/posts/2019-02-14-raise-your-elbo.org
diff options
context:
space:
mode:
Diffstat (limited to 'posts/2019-02-14-raise-your-elbo.org')
-rw-r--r--posts/2019-02-14-raise-your-elbo.org18
1 files changed, 18 insertions, 0 deletions
diff --git a/posts/2019-02-14-raise-your-elbo.org b/posts/2019-02-14-raise-your-elbo.org
index 9e15552..f0de7d1 100644
--- a/posts/2019-02-14-raise-your-elbo.org
+++ b/posts/2019-02-14-raise-your-elbo.org
@@ -47,6 +47,7 @@ under CC BY-SA and GNU FDL./
** KL divergence and ELBO
:PROPERTIES:
:CUSTOM_ID: kl-divergence-and-elbo
+ :ID: 2bb0d405-f6b4-483f-9f2d-c0e945faa3ac
:END:
Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL)
divergence is defined as
@@ -120,6 +121,7 @@ Bayesian version.
** EM
:PROPERTIES:
:CUSTOM_ID: em
+ :ID: 6d694b38-56c2-4e10-8a1f-1f82e309073f
:END:
To illustrate the EM algorithms, we first define the mixture model.
@@ -198,6 +200,7 @@ model is:
*** GMM
:PROPERTIES:
:CUSTOM_ID: gmm
+ :ID: 5d5265f6-c2b9-42f1-a4a1-0d87417f0b02
:END:
Gaussian mixture model (GMM) is an example of mixture models.
@@ -240,6 +243,7 @@ $\epsilon I$ is called elliptical k-means algorithm.
*** SMM
:PROPERTIES:
:CUSTOM_ID: smm
+ :ID: f4b3a462-8ae7-44f2-813c-58b007eaa047
:END:
As a transition to the next models to study, let us consider a simpler
mixture model obtained by making one modification to GMM: change
@@ -275,6 +279,7 @@ Dirichlet allocation (LDA), not to be confused with the other LDA
*** pLSA
:PROPERTIES:
:CUSTOM_ID: plsa
+ :ID: d4f58158-dcb6-4ba1-a9e2-bf53bff6012e
:END:
The pLSA model (Hoffman 2000) is a mixture model, where the dataset is
now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$
@@ -294,6 +299,7 @@ corresponds to type 2.
**** pLSA1
:PROPERTIES:
:CUSTOM_ID: plsa1
+ :ID: 969f470e-5bbe-464e-a3b7-f996c8f04de3
:END:
The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted
with $(d_i, x_i)$, which conditioned on $z_i$ are independently
@@ -340,6 +346,7 @@ dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$.
**** pLSA2
:PROPERTIES:
:CUSTOM_ID: plsa2
+ :ID: eef3249a-c45d-4a07-876f-68b2a2e957e5
:END:
Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite
it as
@@ -392,6 +399,7 @@ $$\begin{aligned}
*** HMM
:PROPERTIES:
:CUSTOM_ID: hmm
+ :ID: 16d00eda-7136-49f5-8427-c775c7a91317
:END:
The hidden markov model (HMM) is a sequential version of SMM, in the
same sense that recurrent neural networks are sequential versions of
@@ -518,6 +526,7 @@ as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively.
** Fully Bayesian EM / MFA
:PROPERTIES:
:CUSTOM_ID: fully-bayesian-em-mfa
+ :ID: 77f1d7ae-3785-45d4-b88f-18478e41f3b9
:END:
Let us now venture into the realm of full Bayesian.
@@ -567,6 +576,7 @@ e.g. Section 10.1 of Bishop 2006.
*** Application to mixture models
:PROPERTIES:
:CUSTOM_ID: application-to-mixture-models
+ :ID: 52bf6025-1180-44dc-8272-e6af6e228bf3
:END:
*Definition (Fully Bayesian mixture model)*. The relations between
$\pi$, $\eta$, $x$, $z$ are the same as in the definition of mixture
@@ -658,6 +668,7 @@ until convergence.
*** Fully Bayesian GMM
:PROPERTIES:
:CUSTOM_ID: fully-bayesian-gmm
+ :ID: 814289c0-2527-42a0-914b-d64ad62ecd05
:END:
A typical example of fully Bayesian mixture models is the fully Bayesian
Gaussian mixture model (Attias 2000, also called variational GMM in the
@@ -684,6 +695,7 @@ Chapter 10.2 of Bishop 2006 or Attias 2000.
*** LDA
:PROPERTIES:
:CUSTOM_ID: lda
+ :ID: 7d752891-ef33-4b58-9dc3-d6a61325bfa6
:END:
As the second example of fully Bayesian mixture models, Latent Dirichlet
allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of
@@ -747,6 +759,7 @@ So the algorithm iterates over (10) and (11)(12) until convergence.
*** DPMM
:PROPERTIES:
:CUSTOM_ID: dpmm
+ :ID: 187cb168-b3f8-428e-962a-80ad5966f844
:END:
The Dirichlet process mixture model (DPMM) is like the fully Bayesian
mixture model except $n_z = \infty$, i.e. $z$ can take any positive
@@ -900,6 +913,7 @@ $$\begin{aligned}
** SVI
:PROPERTIES:
:CUSTOM_ID: svi
+ :ID: 47efee6c-67ac-44eb-92fb-4d576ae2ec99
:END:
In variational inference, the computation of some parameters are more
expensive than others.
@@ -969,6 +983,7 @@ for some $\kappa \in (.5, 1]$ and $\tau \ge 0$.
** AEVB
:PROPERTIES:
:CUSTOM_ID: aevb
+ :ID: a196df8f-1574-4390-83a4-dd22d8fcecaf
:END:
SVI adds to variational inference stochastic updates similar to
stochastic gradient descent. Why not just use neural networks with
@@ -1048,6 +1063,7 @@ approximation of $U(x, \phi, \theta)$ itself can be done similarly.
*** VAE
:PROPERTIES:
:CUSTOM_ID: vae
+ :ID: 59e07ae5-a4d3-4b95-949f-0b4348f2b70b
:END:
As an example of AEVB, the paper introduces variational autoencoder
(VAE), with the following instantiations:
@@ -1069,6 +1085,7 @@ With this, one can use backprop to maximise the ELBO.
*** Fully Bayesian AEVB
:PROPERTIES:
:CUSTOM_ID: fully-bayesian-aevb
+ :ID: 0fb4f75b-4b62-440f-adc7-996b2d7f718a
:END:
Let us turn to fully Bayesian version of AEVB. Again, we first recall
the ELBO of the fully Bayesian mixture models:
@@ -1117,6 +1134,7 @@ Again, one may use Monte-Carlo to approximate this expetation.
** References
:PROPERTIES:
:CUSTOM_ID: references
+ :ID: df1567c9-b0e1-499f-a9d1-c0c915b2b98d
:END:
- Attias, Hagai. "A variational baysian framework for graphical models."