From e9795c6b134eed858ddb73c036ff5c941d7e9838 Mon Sep 17 00:00:00 2001 From: Yuchen Pei Date: Fri, 18 Jun 2021 17:47:12 +1000 Subject: Updated. --- posts/2019-02-14-raise-your-elbo.org | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'posts/2019-02-14-raise-your-elbo.org') diff --git a/posts/2019-02-14-raise-your-elbo.org b/posts/2019-02-14-raise-your-elbo.org index 9e15552..f0de7d1 100644 --- a/posts/2019-02-14-raise-your-elbo.org +++ b/posts/2019-02-14-raise-your-elbo.org @@ -47,6 +47,7 @@ under CC BY-SA and GNU FDL./ ** KL divergence and ELBO :PROPERTIES: :CUSTOM_ID: kl-divergence-and-elbo + :ID: 2bb0d405-f6b4-483f-9f2d-c0e945faa3ac :END: Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL) divergence is defined as @@ -120,6 +121,7 @@ Bayesian version. ** EM :PROPERTIES: :CUSTOM_ID: em + :ID: 6d694b38-56c2-4e10-8a1f-1f82e309073f :END: To illustrate the EM algorithms, we first define the mixture model. @@ -198,6 +200,7 @@ model is: *** GMM :PROPERTIES: :CUSTOM_ID: gmm + :ID: 5d5265f6-c2b9-42f1-a4a1-0d87417f0b02 :END: Gaussian mixture model (GMM) is an example of mixture models. @@ -240,6 +243,7 @@ $\epsilon I$ is called elliptical k-means algorithm. *** SMM :PROPERTIES: :CUSTOM_ID: smm + :ID: f4b3a462-8ae7-44f2-813c-58b007eaa047 :END: As a transition to the next models to study, let us consider a simpler mixture model obtained by making one modification to GMM: change @@ -275,6 +279,7 @@ Dirichlet allocation (LDA), not to be confused with the other LDA *** pLSA :PROPERTIES: :CUSTOM_ID: plsa + :ID: d4f58158-dcb6-4ba1-a9e2-bf53bff6012e :END: The pLSA model (Hoffman 2000) is a mixture model, where the dataset is now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$ @@ -294,6 +299,7 @@ corresponds to type 2. **** pLSA1 :PROPERTIES: :CUSTOM_ID: plsa1 + :ID: 969f470e-5bbe-464e-a3b7-f996c8f04de3 :END: The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted with $(d_i, x_i)$, which conditioned on $z_i$ are independently @@ -340,6 +346,7 @@ dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$. **** pLSA2 :PROPERTIES: :CUSTOM_ID: plsa2 + :ID: eef3249a-c45d-4a07-876f-68b2a2e957e5 :END: Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite it as @@ -392,6 +399,7 @@ $$\begin{aligned} *** HMM :PROPERTIES: :CUSTOM_ID: hmm + :ID: 16d00eda-7136-49f5-8427-c775c7a91317 :END: The hidden markov model (HMM) is a sequential version of SMM, in the same sense that recurrent neural networks are sequential versions of @@ -518,6 +526,7 @@ as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively. ** Fully Bayesian EM / MFA :PROPERTIES: :CUSTOM_ID: fully-bayesian-em-mfa + :ID: 77f1d7ae-3785-45d4-b88f-18478e41f3b9 :END: Let us now venture into the realm of full Bayesian. @@ -567,6 +576,7 @@ e.g. Section 10.1 of Bishop 2006. *** Application to mixture models :PROPERTIES: :CUSTOM_ID: application-to-mixture-models + :ID: 52bf6025-1180-44dc-8272-e6af6e228bf3 :END: *Definition (Fully Bayesian mixture model)*. The relations between $\pi$, $\eta$, $x$, $z$ are the same as in the definition of mixture @@ -658,6 +668,7 @@ until convergence. *** Fully Bayesian GMM :PROPERTIES: :CUSTOM_ID: fully-bayesian-gmm + :ID: 814289c0-2527-42a0-914b-d64ad62ecd05 :END: A typical example of fully Bayesian mixture models is the fully Bayesian Gaussian mixture model (Attias 2000, also called variational GMM in the @@ -684,6 +695,7 @@ Chapter 10.2 of Bishop 2006 or Attias 2000. *** LDA :PROPERTIES: :CUSTOM_ID: lda + :ID: 7d752891-ef33-4b58-9dc3-d6a61325bfa6 :END: As the second example of fully Bayesian mixture models, Latent Dirichlet allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of @@ -747,6 +759,7 @@ So the algorithm iterates over (10) and (11)(12) until convergence. *** DPMM :PROPERTIES: :CUSTOM_ID: dpmm + :ID: 187cb168-b3f8-428e-962a-80ad5966f844 :END: The Dirichlet process mixture model (DPMM) is like the fully Bayesian mixture model except $n_z = \infty$, i.e. $z$ can take any positive @@ -900,6 +913,7 @@ $$\begin{aligned} ** SVI :PROPERTIES: :CUSTOM_ID: svi + :ID: 47efee6c-67ac-44eb-92fb-4d576ae2ec99 :END: In variational inference, the computation of some parameters are more expensive than others. @@ -969,6 +983,7 @@ for some $\kappa \in (.5, 1]$ and $\tau \ge 0$. ** AEVB :PROPERTIES: :CUSTOM_ID: aevb + :ID: a196df8f-1574-4390-83a4-dd22d8fcecaf :END: SVI adds to variational inference stochastic updates similar to stochastic gradient descent. Why not just use neural networks with @@ -1048,6 +1063,7 @@ approximation of $U(x, \phi, \theta)$ itself can be done similarly. *** VAE :PROPERTIES: :CUSTOM_ID: vae + :ID: 59e07ae5-a4d3-4b95-949f-0b4348f2b70b :END: As an example of AEVB, the paper introduces variational autoencoder (VAE), with the following instantiations: @@ -1069,6 +1085,7 @@ With this, one can use backprop to maximise the ELBO. *** Fully Bayesian AEVB :PROPERTIES: :CUSTOM_ID: fully-bayesian-aevb + :ID: 0fb4f75b-4b62-440f-adc7-996b2d7f718a :END: Let us turn to fully Bayesian version of AEVB. Again, we first recall the ELBO of the fully Bayesian mixture models: @@ -1117,6 +1134,7 @@ Again, one may use Monte-Carlo to approximate this expetation. ** References :PROPERTIES: :CUSTOM_ID: references + :ID: df1567c9-b0e1-499f-a9d1-c0c915b2b98d :END: - Attias, Hagai. "A variational baysian framework for graphical models." -- cgit v1.2.3