posts/2019-02-14-raise-your-elbo.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121

---
title: Raise your ELBO
date: 2019-02-14
template: post
comments: true
---

In this post I give an introduction to variational inference, which is
about maximising the evidence lower bound (ELBO).

I use a top-down approach, starting with the KL divergence and the ELBO,
to lay the mathematical framework of all the models in this post.

Then I define mixture models and the EM algorithm, with Gaussian mixture
model (GMM), probabilistic latent semantic analysis (pLSA) and the hidden
markov model (HMM) as examples.

After that I present the fully Bayesian version of EM, also known as
mean field approximation (MFA), and apply it to fully Bayesian mixture
models, with fully Bayesian GMM (also known as variational GMM), latent
Dirichlet allocation (LDA) and Dirichlet process mixture model (DPMM) as
examples.

Then I explain stochastic variational inference, a modification of EM
and MFA to improve efficiency.

Finally I talk about autoencoding variational Bayes (AEVB), a
Monte-Carlo + neural network approach to raising the ELBO, exemplified
by the variational autoencoder (VAE). I also show its fully Bayesian
version.

**Acknowledgement**. The following texts and
resources were illuminating during the writing of this post: the
Stanford CS228 notes
([1](https://ermongroup.github.io/cs228-notes/inference/variational/),[2](https://ermongroup.github.io/cs228-notes/learning/latent/)),
the [Tel Aviv Algorithms in Molecular Biology
slides](https://www.cs.tau.ac.il/~rshamir/algmb/presentations/EM-BW-Ron-16%20.pdf)
(clear explanations of the connection between EM and Baum-Welch),
Chapter 10 of [Bishop\'s
book](https://www.springer.com/us/book/9780387310732) (brilliant
introduction to variational GMM), Section 2.5 of [Sudderth\'s
thesis](http://cs.brown.edu/~sudderth/papers/sudderthPhD.pdf) and
[metacademy](https://metacademy.org). Also thanks to Josef Lindman
Hörnlund for discussions. The research was done while working at KTH
mathematics department.

*If you are reading on a mobile device, you may need to \"request
desktop site\" for the equations to be properly displayed. This post is
licensed under CC BY-SA and GNU FDL.*

KL divergence and ELBO 
----------------------

Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL)
divergence is defined as

$$D(q||p) = E_q \log{q \over p}.$$

It achieves minimum $0$ when $p = q$.

If $p$ can be further written as

$$p(x) = {w(x) \over Z}, \qquad (0)$$

where $Z$ is a normaliser, then

$$\log Z = D(q||p) + L(w, q), \qquad(1)$$

where $L(w, q)$ is called the evidence lower bound (ELBO), defined by

$$L(w, q) = E_q \log{w \over q}. \qquad (1.25)$$

From (1), we see that to minimise the nonnegative term $D(q || p)$, one
can maximise the ELBO.

To this end, we can simply discard $D(q || p)$ in (1) and obtain:

$$\log Z \ge L(w, q) \qquad (1.3)$$

and keep in mind that the inequality becomes an equality when
$q = {w \over Z}$.

It is time to define the task of variational inference (VI), also known
as variational Bayes (VB).

**Definition**. Variational inference is concerned with
maximising the ELBO $L(w, q)$.

There are mainly two versions of VI, the half Bayesian and the fully
Bayesian cases. Half Bayesian VI, to which expectation-maximisation
algorithms (EM) apply, instantiates (1.3) with

$$\begin{aligned}
Z &= p(x; \theta)\\
w &= p(x, z; \theta)\\
q &= q(z)
\end{aligned}$$

and the dummy variable $x$ in Equation (0) is substituted with $z$.

Fully Bayesian VI, often just called VI, has the following
instantiations:

$$\begin{aligned}
Z &= p(x) \\
w &= p(x, z, \theta) \\
q &= q(z, \theta)
\end{aligned}$$

and $x$ in Equation (0) is substituted with $(z, \theta)$.

In both cases $\theta$ are parameters and $z$ are latent variables.

**Remark on the naming of things**.
The term \"variational\" comes from the fact that we perform calculus of
variations: maximise some functional ($L(w, q)$) over a set of functions
($q$). Note however, most of the VI / VB algorithms do not concern any
techniques in calculus of variations, but only uses Jensen\'s inequality
/ the fact the $D(q||p)$ reaches minimum when $p = q$. Due to this
reasoning of the naming, EM is also a kind of VI, even though in the
literature VI often referes to its fully Bayesian version.

EM 
--

To illustrate the EM algorithms, we first define the mixture model.

**Definition (mixture model)**. Given
dataset $x_{1 : m}$, we assume the data has some underlying latent
variable $z_{1 : m}$ that may take a value from a finite set
$\{1, 2, ..., n_z\}$. Let $p(z_{i}; \pi)$ be categorically distributed
according to the probability vector $\pi$. That is,
$p(z_{i} = k; \pi) = \pi_k$. Also assume
$p(x_{i} | z_{i} = k; \eta) = p(x_{i}; \eta_k)$. Find
$\theta = (\pi, \eta)$ that maximises the likelihood
$p(x_{1 : m}; \theta)$.

Represented as a DAG (a.k.a the plate notations), the model looks like
this:

![](/assets/resources/mixture-model.png){style="width:250px"}

where the boxes with $m$ mean repitition for $m$ times, since there $m$
indepdent pairs of $(x, z)$, and the same goes for $\eta$.

The direct maximisation

$$\max_\theta \sum_i \log p(x_{i}; \theta) = \max_\theta \sum_i \log \int p(x_{i} | z_i; \theta) p(z_i; \theta) dz_i$$

is hard because of the integral in the log.

We can fit this problem in (1.3) by having $Z = p(x_{1 : m}; \theta)$
and $w = p(z_{1 : m}, x_{1 : m}; \theta)$. The plan is to update
$\theta$ repeatedly so that $L(p(z, x; \theta_t), q(z))$ is non
decreasing over time $t$.

Equation (1.3) at time $t$ for the $i$th datapoint is

$$\log p(x_{i}; \theta_t) \ge L(p(z_i, x_{i}; \theta_t), q(z_i)) \qquad (2)$$

Each timestep consists of two steps, the E-step and the M-step.

At E-step, we set

$$q(z_{i}) = p(z_{i}|x_{i}; \theta_t), $$

to turn the inequality into equality. We denote $r_{ik} = q(z_i = k)$
and call them responsibilities, so the posterior $q(z_i)$ is categorical
distribution with parameter $r_i = r_{i, 1 : n_z}$.

At M-step, we maximise $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_{i}))$
over $\theta$:

$$\begin{aligned}
\theta_{t + 1} &= \text{argmax}_\theta \sum_i L(p(x_{i}, z_{i}; \theta), p(z_{i} | x_{i}; \theta_t)) \\
&= \text{argmax}_\theta \sum_i \mathbb E_{p(z_{i} | x_{i}; \theta_t)} \log p(x_{i}, z_{i}; \theta) \qquad (2.3)
\end{aligned}$$

So $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_i))$ is non-decreasing at both the
E-step and the M-step.

We can see from this derivation that EM is half-Bayesian. The E-step is
Bayesian it computes the posterior of the latent variables and the
M-step is frequentist because it performs maximum likelihood estimate of
$\theta$.

It is clear that the ELBO sum coverges as it is nondecreasing with an
upper bound, but it is not clear whether the sum converges to the
correct value, i.e. $\max_\theta p(x_{1 : m}; \theta)$. In fact it is
said that the EM does get stuck in local maximum sometimes.

A different way of describing EM, which will be useful in hidden Markov
model is:

-   At E-step, one writes down the formula
    $$\sum_i \mathbb E_{p(z_i | x_{i}; \theta_t)} \log p(x_{i}, z_i; \theta). \qquad (2.5)$$

-   At M-setp, one finds $\theta_{t + 1}$ to be the $\theta$ that
    maximises the above formula.

### GMM 

Gaussian mixture model (GMM) is an example of mixture models.

The space of the data is $\mathbb R^n$. We use the hypothesis that the
data is Gaussian conditioned on the latent variable:

$$(x_i; \eta_k) \sim N(\mu_k, \Sigma_k),$$

so we write $\eta_k = (\mu_k, \Sigma_k)$,

During E-step, the $q(z_i)$ can be directly computed using Bayes'
theorem:

$$r_{ik} = q(z_i = k) = \mathbb P(z_i = k | x_{i}; \theta_t)
= {g_{\mu_{t, k}, \Sigma_{t, k}} (x_{i}) \pi_{t, k} \over \sum_{j = 1 : n_z} g_{\mu_{t, j}, \Sigma_{t, j}} (x_{i}) \pi_{t, j}},$$

where
$g_{\mu, \Sigma} (x) = (2 \pi)^{- n / 2} \det \Sigma^{-1 / 2} \exp(- {1 \over 2} (x - \mu)^T \Sigma^{-1} (x - \mu))$
is the pdf of the Gaussian distribution $N(\mu, \Sigma)$.

During M-step, we need to compute

$$\text{argmax}_{\Sigma, \mu, \pi} \sum_{i = 1 : m} \sum_{k = 1 : n_z} r_{ik} \log (g_{\mu_k, \Sigma_k}(x_{i}) \pi_k).$$

This is similar to the quadratic discriminant analysis, and the solution
is

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_{i = 1 : m} r_{ik}, \\
\mu_{k} &= {\sum_i r_{ik} x_{i} \over \sum_i r_{ik}}, \\
\Sigma_{k} &= {\sum_i r_{ik} (x_{i} - \mu_{t, k}) (x_{i} - \mu_{t, k})^T \over \sum_i r_{ik}}.
\end{aligned}$$

**Remark**. The k-means algorithm is the $\epsilon \to 0$
limit of the GMM with constraints $\Sigma_k = \epsilon I$. See Section
9.3.2 of Bishop 2006 for derivation. It is also briefly mentioned there
that a variant in this setting where the covariance matrix is not
restricted to $\epsilon I$ is called elliptical k-means algorithm.

### SMM 

As a transition to the next models to study, let us consider a simpler
mixture model obtained by making one modification to GMM: change
$(x; \eta_k) \sim N(\mu_k, \Sigma_k)$ to
$\mathbb P(x = w; \eta_k) = \eta_{kw}$ where $\eta$ is a stochastic matrix
and $w$ is an arbitrary element of the space for $x$.
So now the space for both $x$ and $z$ are finite. We call this model the
simple mixture model (SMM).


As in GMM, at E-step $r_{ik}$ can be explicitly computed using
Bayes\' theorem.

It is not hard to write down the solution to the M-step in this case:

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_i r_{ik}, \qquad (2.7)\\
\eta_{k, w} &= {\sum_i r_{ik} 1_{x_i = w} \over \sum_i r_{ik}}. \qquad (2.8)
\end{aligned}$$

where $1_{x_i = w}$ is the [indicator function](https://en.wikipedia.org/wiki/Indicator_function),
and evaluates to $1$ if $x_i = w$ and $0$ otherwise.

Two trivial variants of the SMM are the two versions of probabilistic
latent semantic analysis (pLSA), which we call pLSA1 and pLSA2.

The model pLSA1 is a probabilistic version of latent semantic analysis,
which is basically a simple matrix factorisation model in collaborative
filtering, whereas pLSA2 has a fully Bayesian version called latent
Dirichlet allocation (LDA), not to be confused with the other LDA
(linear discriminant analysis).

### pLSA 

The pLSA model (Hoffman 2000) is a mixture model, where the dataset is
now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$
are words and $d$ are documents, and a pair $(d, x)$ represent an
ocurrance of word $x$ in document $d$.

For each datapoint $(d_{i}, x_{i})$,

$$\begin{aligned}
p(d_i, x_i; \theta) &= \sum_{z_i} p(z_i; \theta) p(d_i | z_i; \theta) p(x_i | z_i; \theta) \qquad (2.91)\\
&= p(d_i; \theta) \sum_z p(x_i | z_i; \theta) p (z_i | d_i; \theta) \qquad (2.92).
\end{aligned}$$

Of the two formulations, (2.91) corresponds to pLSA type 1, and (2.92)
corresponds to type 2.

#### pLSA1 

The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted
with $(d_i, x_i)$, which conditioned on $z_i$ are independently
categorically distributed:

$$p(d_i = u, x_i = w | z_i = k; \theta) = p(d_i ; \xi_k) p(x_i; \eta_k) = \xi_{ku} \eta_{kw}.$$

The model can be illustrated in the plate notations:

![](/assets/resources/plsa1.png){style="width:350px"}

So the solution of the M-step is

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_i r_{ik} \\
\xi_{k, u} &= {\sum_i r_{ik} 1_{d_{i} = u} \over \sum_i r_{ik}} \\
\eta_{k, w} &= {\sum_i r_{ik} 1_{x_{i} = w} \over \sum_i r_{ik}}.
\end{aligned}$$

**Remark**. pLSA1 is the probabilistic version of LSA, also
known as matrix factorisation.

Let $n_d$ and $n_x$ be the number of values $d_i$ and $x_i$ can take.

**Problem** (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix
$s \le \min\{n_d, n_x\}$. Find $n_d \times s$ matrix $D$ and
$n_x \times s$ matrix $X$ that minimises

$$J(D, X) = \|R - D X^T\|_F.$$

where $\|\cdot\|_F$ is the Frobenius norm.

**Claim**. Let $R = U \Sigma V^T$ be the SVD of $R$, then the
solution to the above problem is $D = U_s \Sigma_s^{{1 \over 2}}$ and
$X = V_s \Sigma_s^{{1 \over 2}}$, where $U_s$ (resp. $V_s$) is the
matrix of the first $s$ columns of $U$ (resp. $V$) and $\Sigma_s$ is the
$s \times s$ submatrix of $\Sigma$.

One can compare pLSA1 with LSA. Both procedures produce embeddings of
$d$ and $x$: in pLSA we obtain $n_z$ dimensional embeddings
$\xi_{\cdot, u}$ and $\eta_{\cdot, w}$, whereas in LSA we obtain $s$
dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$.

#### pLSA2 

Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite
it as

$$p(x_i | d_i; \theta) = \sum_{z_i} p(x_i | z_i; \theta) p(z_i | d_i; \theta).$$

To simplify notations, we collect all the $x_i$s with the corresponding
$d_i$ equal to 1 (suppose there are $m_1$ of them), and write them as
$(x_{1, j})_{j = 1 : m_1}$. In the same fashion we construct
$x_{2, 1 : m_2}, x_{3, 1 : m_3}, ... x_{n_d, 1 : m_{n_d}}$.
Similarly, we relabel the corresponding $d_i$ and $z_i$ accordingly.

With almost no loss of generality, we assume all $m_\ell$s are equal and
write them as $m$.

Now the model becomes

$$p(x_{\ell, i} | d_{\ell, i} = \ell; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k | d_{\ell, i} = \ell; \theta).$$

Since we have regrouped the $x$'s and $z$'s whose indices record the values
of the $d$'s, we can remove the $d$'s from the equation altogether:

$$p(x_{\ell, i}; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k; \theta).$$

It is effectively a modification of SMM by making $n_d$ copies of $\pi$.
More specifically the parameters are
$\theta = (\pi_{1 : n_d, 1 : n_z}, \eta_{1 : n_z, 1 : n_x})$, where we
model $(z | d = \ell) \sim \text{Cat}(\pi_{\ell, \cdot})$ and, as in
pLSA1, $(x | z = k) \sim \text{Cat}(\eta_{k, \cdot})$.

Illustrated in the plate notations, pLSA2 is:

![](/assets/resources/plsa2.png){style="width:350px"}

The computation is basically adding an index $\ell$ to the computation
of SMM wherever applicable.

The updates at the E-step is

$$r_{\ell i k} = p(z_{\ell i} = k | x_{\ell i}; \theta) \propto \pi_{\ell k} \eta_{k, x_{\ell i}}.$$

And at the M-step

$$\begin{aligned}
\pi_{\ell k} &= {1 \over m} \sum_i r_{\ell  i k} \\
\eta_{k w} &= {\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w} \over \sum_{\ell, i} r_{\ell i k}}.
\end{aligned}$$

### HMM 

The hidden markov model (HMM) is a sequential version of SMM, in the
same sense that recurrent neural networks are sequential versions of
feed-forward neural networks.

HMM is an example where the posterior $p(z_i | x_i; \theta)$ is not easy
to compute, and one has to utilise properties of the underlying Bayesian
network to go around it.

Now each sample is a sequence $x_i = (x_{ij})_{j = 1 : T}$, and so are
the latent variables $z_i = (z_{ij})_{j = 1 : T}$.

The latent variables are assumed to form a Markov chain with transition
matrix $(\xi_{k \ell})_{k \ell}$, and $x_{ij}$ is completely dependent
on $z_{ij}$:

$$\begin{aligned}
p(z_{ij} | z_{i, j - 1}) &= \xi_{z_{i, j - 1}, z_{ij}},\\
p(x_{ij} | z_{ij}) &= \eta_{z_{ij}, x_{ij}}.
\end{aligned}$$

Also, the distribution of $z_{i1}$ is again categorical with parameter
$\pi$:

$$p(z_{i1}) = \pi_{z_{i1}}$$

So the parameters are $\theta = (\pi, \xi, \eta)$. And HMM can be shown
in plate notations as:

![](/assets/resources/hmm.png){style="width:350px"}

Now we apply EM to HMM, which is called the [Baum-Welch
algorithm](https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm).
Unlike the previous examples, it is too messy to compute
$p(z_i | x_{i}; \theta)$, so during the E-step we instead write down
formula (2.5) directly in hope of simplifying it:

$$\begin{aligned}
\mathbb E_{p(z_i | x_i; \theta_t)} \log p(x_i, z_i; \theta_t) &=\mathbb E_{p(z_i | x_i; \theta_t)} \left(\log \pi_{z_{i1}} + \sum_{j = 2 : T} \log \xi_{z_{i, j - 1}, z_{ij}} + \sum_{j = 1 : T} \log \eta_{z_{ij}, x_{ij}}\right). \qquad (3)
\end{aligned}$$

Let us compute the summand in second term:

$$\begin{aligned}
\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \xi_{z_{i, j - 1}, z_{ij}} &= \sum_{k, \ell} (\log \xi_{k, \ell}) \mathbb E_{p(z_{i} | x_{i}; \theta_t)} 1_{z_{i, j - 1} = k, z_{i, j} = \ell} \\
&= \sum_{k, \ell} p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t) \log \xi_{k, \ell}. \qquad (4)
\end{aligned}$$

Similarly, one can write down the first term and the summand in the
third term to obtain

$$\begin{aligned}
\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \pi_{z_{i1}} &= \sum_k p(z_{i1} = k | x_{i}; \theta_t), \qquad (5) \\
\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \eta_{z_{i, j}, x_{i, j}} &= \sum_{k, w} 1_{x_{ij} = w} p(z_{i, j} = k | x_i; \theta_t) \log \eta_{k, w}. \qquad (6)
\end{aligned}$$

plugging (4)(5)(6) back into (3) and summing over $j$, we obtain the
formula to maximise over $\theta$ on:

$$\sum_k \sum_i r_{i1k} \log \pi_k + \sum_{k, \ell} \sum_{j = 2 : T, i} s_{ijk\ell} \log \xi_{k, \ell} + \sum_{k, w} \sum_{j = 1 : T, i} r_{ijk} 1_{x_{ij} = w} \log \eta_{k, w},$$

where

$$\begin{aligned}
r_{ijk} &:= p(z_{ij} = k | x_{i}; \theta_t), \\
s_{ijk\ell} &:= p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t).
\end{aligned}$$

Now we proceed to the M-step. Since each of the
$\pi_k, \xi_{k, \ell}, \eta_{k, w}$ is nicely confined in the inner sum
of each term, together with the constraint
$\sum_k \pi_k = \sum_\ell \xi_{k, \ell} = \sum_w \eta_{k, w} = 1$ it is
not hard to find the argmax at time $t + 1$ (the same way one finds the
MLE for any categorical distribution):

$$\begin{aligned}
\pi_{k} &= {1 \over m} \sum_i r_{i1k}, \qquad (6.1) \\
\xi_{k, \ell} &= {\sum_{j = 2 : T, i} s_{ijk\ell} \over \sum_{j = 1 : T - 1, i} r_{ijk}}, \qquad(6.2) \\
\eta_{k, w} &= {\sum_{ij} 1_{x_{ij} = w} r_{ijk} \over \sum_{ij} r_{ijk}}. \qquad(6.3)
\end{aligned}$$

Note that (6.1)(6.3) are almost identical to (2.7)(2.8). This makes
sense as the only modification HMM makes over SMM is the added
dependencies between the latent variables.

What remains is to compute $r$ and $s$.

This is done by using the forward and backward procedures which takes
advantage of the conditioned independence / topology of the underlying
Bayesian network. It is out of scope of this post, but for the sake of
completeness I include it here.

Let

$$\begin{aligned}
\alpha_k(i, j) &:= p(x_{i, 1 : j}, z_{ij} = k; \theta_t), \\
\beta_k(i, j) &:= p(x_{i, j + 1 : T} | z_{ij} = k; \theta_t).
\end{aligned}$$

They can be computed recursively as

$$\begin{aligned}
\alpha_k(i, j) &= \begin{cases}
\eta_{k, x_{1j}} \pi_k, & j = 1; \\
\eta_{k, x_{ij}} \sum_\ell \alpha_\ell(j - 1, i) \xi_{k\ell}, & j \ge 2.
\end{cases}\\
\beta_k(i, j) &= \begin{cases}
1, & j = T;\\
\sum_\ell \xi_{k\ell} \beta_\ell(j + 1, i) \eta_{\ell, x_{i, j + 1}}, & j < T.
\end{cases}
\end{aligned}$$

Then

$$\begin{aligned}
p(z_{ij} = k, x_{i}; \theta_t) &= \alpha_k(i, j) \beta_k(i, j), \qquad (7)\\
p(x_{i}; \theta_t) &= \sum_k \alpha_k(i, j) \beta_k(i, j),\forall j = 1 : T \qquad (8)\\
p(z_{i, j - 1} = k, z_{i, j} = \ell, x_{i}; \theta_t) &= \alpha_k(i, j) \xi_{k\ell} \beta_\ell(i, j + 1) \eta_{\ell, x_{j + 1, i}}. \qquad (9)
\end{aligned}$$

And this yields $r_{ijk}$ and $s_{ijk\ell}$ since they can be computed
as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively.

Fully Bayesian EM / MFA 
-----------------------

Let us now venture into the realm of full Bayesian.

In EM we aim to maximise the ELBO

$$\int q(z) \log {p(x, z; \theta) \over q(z)} dz d\theta$$

alternately over $q$ and $\theta$. As mentioned before, the E-step of
maximising over $q$ is Bayesian, in that it computes the posterior of
$z$, whereas the M-step of maximising over $\theta$ is maximum
likelihood and frequentist.

The fully Bayesian EM makes the M-step Bayesian by making $\theta$ a
random variable, so the ELBO becomes

$$L(p(x, z, \theta), q(z, \theta)) = \int q(z, \theta) \log {p(x, z, \theta) \over q(z, \theta)} dz d\theta$$

We further assume $q$ can be factorised into distributions on $z$ and
$\theta$: $q(z, \theta) = q_1(z) q_2(\theta)$. So the above formula is
rewritten as

$$L(p(x, z, \theta), q(z, \theta)) = \int q_1(z) q_2(\theta) \log {p(x, z, \theta) \over q_1(z) q_2(\theta)} dz d\theta$$

To find argmax over $q_1$, we rewrite

$$\begin{aligned}
L(p(x, z, \theta), q(z, \theta)) &= \int q_1(z) \left(\int q_2(\theta) \log p(x, z, \theta) d\theta\right) dz - \int q_1(z) \log q_1(z) dz - \int q_2(\theta) \log q_2(\theta) \\&= - D(q_1(z) || p_x(z)) + C,
\end{aligned}$$

where $p_x$ is a density in $z$ with

$$\log p_x(z) = \mathbb E_{q_2(\theta)} \log p(x, z, \theta) + C.$$

So the $q_1$ that maximises the ELBO is $q_1^* = p_x$.

Similarly, the optimal $q_2$ is such that

$$\log q_2^*(\theta) = \mathbb E_{q_1(z)} \log p(x, z, \theta) + C.$$

The fully Bayesian EM thus alternately evaluates $q_1^*$ (E-step) and
$q_2^*$ (M-step).

It is also called mean field approximation (MFA), and can be easily
generalised to models with more than two groups of latent variables, see
e.g. Section 10.1 of Bishop 2006.

### Application to mixture models 

**Definition (Fully
Bayesian mixture model)**. The relations between $\pi$, $\eta$, $x$, $z$
are the same as in the definition of mixture models. Furthermore, we
assume the distribution of $(x | \eta_k)$ belongs to the [exponential
family](https://en.wikipedia.org/wiki/Exponential_family) (the
definition of the exponential family is briefly touched at the end of
this section). But now both $\pi$ and $\eta$ are random variables. Let
the prior distribution $p(\pi)$ is Dirichlet with parameter
$(\alpha, \alpha, ..., \alpha)$. Let the prior $p(\eta_k)$ be the
conjugate prior of $(x | \eta_k)$, with parameter $\beta$, we will see
later in this section that the posterior $q(\eta_k)$ belongs to the
same family as $p(\eta_k)$. Represented in a plate notations, a fully
Bayesian mixture model looks like:

![](/assets/resources/fully-bayesian-mm.png){style="width:450px"}

Given this structure we can write down the mean-field approximation:

$$\log q(z) = \mathbb E_{q(\eta)q(\pi)} (\log(x | z, \eta) + \log(z | \pi)) + C.$$

Both sides can be factored into per-sample expressions, giving us

$$\log q(z_i) = \mathbb E_{q(\eta)} \log p(x_i | z_i, \eta) + \mathbb E_{q(\pi)} \log p(z_i | \pi) + C$$

Therefore

$$\log r_{ik} = \log q(z_i = k) = \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k) + \mathbb E_{q(\pi)} \log \pi_k + C. \qquad (9.1)$$

So the posterior of each $z_i$ is categorical regardless of the $p$s and
$q$s.

Computing the posterior of $\pi$ and $\eta$:

$$\log q(\pi) + \log q(\eta) = \log p(\pi) + \log p(\eta) + \sum_i \mathbb E_{q(z_i)} p(x_i | z_i, \eta) + \sum_i \mathbb E_{q(z_i)} p(z_i | \pi) + C.$$

So we can separate the terms involving $\pi$ and those involving $\eta$.
First compute the posterior of $\pi$:

$$\log q(\pi) = \log p(\pi) + \sum_i \mathbb E_{q(z_i)} \log p(z_i | \pi) = \log p(\pi) + \sum_i \sum_k r_{ik} \log \pi_k + C.$$

The right hand side is the log of a Dirichlet modulus the constant $C$,
from which we can update the posterior parameter $\phi^\pi$:

$$\phi^\pi_k = \alpha + \sum_i r_{ik}. \qquad (9.3)$$

Similarly we can obtain the posterior of $\eta$:

$$\log q(\eta) = \log p(\eta) + \sum_i \sum_k r_{ik} \log p(x_i | \eta_k) + C.$$

Again we can factor the terms with respect to $k$ and get:

$$\log q(\eta_k) = \log p(\eta_k) + \sum_i r_{ik} \log p(x_i | \eta_k) + C. \qquad (9.5)$$

Here we can see why conjugate prior works. Mathematically, given a
probability distribution $p(x | \theta)$, the distribution $p(\theta)$
is called conjugate prior of $p(x | \theta)$ if
$\log p(\theta) + \log p(x | \theta)$ has the same form as
$\log p(\theta)$.

For example, the conjugate prior for the exponential family
$p(x | \theta) = h(x) \exp(\theta \cdot T(x) - A(\theta))$ where $T$,
$A$ and $h$ are some functions is
$p(\theta; \chi, \nu) \propto \exp(\chi \cdot \theta - \nu A(\theta))$.

Here what we want is a bit different from conjugate priors because of
the coefficients $r_{ik}$. But the computation carries over to the
conjugate priors of the exponential family (try it yourself and you\'ll
see). That is, if $p(x_i | \eta_k)$ belongs to the exponential family

$$p(x_i | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$

and if $p(\eta_k)$ is the conjugate prior of $p(x_i | \eta_k)$

$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k))$$

then $q(\eta_k)$ has the same form as $p(\eta_k)$, and from (9.5) we can
compute the updates of $\phi^{\eta_k}$:

$$\begin{aligned}
\phi^{\eta_k}_1 &= \chi + \sum_i r_{ik} T(x_i), \qquad (9.7) \\
\phi^{\eta_k}_2 &= \nu + \sum_i r_{ik}. \qquad (9.9)
\end{aligned}$$

So the mean field approximation for the fully Bayesian mixture model is
the alternate iteration of (9.1) (E-step) and (9.3)(9.7)(9.9) (M-step)
until convergence.

### Fully Bayesian GMM 

A typical example of fully Bayesian mixture models is the fully Bayesian
Gaussian mixture model (Attias 2000, also called variational GMM in the
literature). It is defined by applying the same modification to GMM as
the difference between Fully Bayesian mixture model and the vanilla
mixture model.

More specifically:

-   $p(z_{i}) = \text{Cat}(\pi)$ as in vanilla GMM
-   $p(\pi) = \text{Dir}(\alpha, \alpha, ..., \alpha)$ has Dirichlet
    distribution, the conjugate prior to the parameters of the
    categorical distribution.
-   $p(x_i | z_i = k) = p(x_i | \eta_k) = N(\mu_{k}, \Sigma_{k})$ as in
    vanilla GMM
-   $p(\mu_k, \Sigma_k) = \text{NIW} (\mu_0, \lambda, \Psi, \nu)$ is the
    normal-inverse-Wishart distribution, the conjugate prior to the mean
    and covariance matrix of the Gaussian distribution.

The E-step and M-step can be computed using (9.1) and (9.3)(9.7)(9.9) in
the previous section. The details of the computation can be found in
Chapter 10.2 of Bishop 2006 or Attias 2000.

### LDA 

As the second example of fully Bayesian mixture models, Latent Dirichlet
allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of
pLSA2, with the following plate notations:

![](/assets/resources/lda.png){style="width:450px"}

It is the smoothed version in the paper.

More specifically, on the basis of pLSA2, we add prior distributions to
$\eta$ and $\pi$:

$$\begin{aligned}
p(\eta_k) &= \text{Dir} (\beta, ..., \beta), \qquad k = 1 : n_z \\
p(\pi_\ell) &= \text{Dir} (\alpha, ..., \alpha), \qquad \ell = 1 : n_d \\
\end{aligned}$$

And as before, the prior of $z$ is

$$p(z_{\ell, i}) = \text{Cat} (\pi_\ell), \qquad \ell = 1 : n_d, i = 1 : m$$

We also denote posterior distribution

$$\begin{aligned}
q(\eta_k) &= \text{Dir} (\phi^{\eta_k}), \qquad k = 1 : n_z \\
q(\pi_\ell) &= \text{Dir} (\phi^{\pi_\ell}), \qquad \ell = 1 : n_d \\
p(z_{\ell, i}) &= \text{Cat} (r_{\ell, i}), \qquad \ell = 1 : n_d, i = 1 : m
\end{aligned}$$

As before, in E-step we update $r$, and in M-step we update $\lambda$
and $\gamma$.

But in the LDA paper, one treats optimisation over $r$, $\lambda$ and
$\gamma$ as a E-step, and treats $\alpha$ and $\beta$ as parameters,
which is optmised over at M-step. This makes it more akin to the
classical EM where the E-step is Bayesian and M-step MLE. This is more
complicated, and we do not consider it this way here.

Plugging in (9.1) we obtain the updates at E-step

$$r_{\ell i k} \propto \exp(\psi(\phi^{\pi_\ell}_k) + \psi(\phi^{\eta_k}_{x_{\ell i}}) - \psi(\sum_w \phi^{\eta_k}_w)), \qquad (10)$$

where $\psi$ is the digamma function.
Similarly, plugging in (9.3)(9.7)(9.9), at M-step, we update the
posterior of $\pi$ and $\eta$:

$$\begin{aligned}
\phi^{\pi_\ell}_k &= \alpha + \sum_i r_{\ell i k}. \qquad (11)\\
%%}}$
%%As for $\eta$, we have
%%{{$%align%
%%\log q(\eta) &= \sum_k \log p(\eta_k) + \sum_{\ell, i} \mathbb E_{q(z_{\ell i})} \log p(x_{\ell i} | z_{\ell i}, \eta) \\
%%&= \sum_{k, j} (\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = j} + \beta - 1) \log \eta_{k j}
%%}}$
%%which gives us
%%{{$
\phi^{\eta_k}_w &= \beta + \sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w}. \qquad (12)
\end{aligned}$$

So the algorithm iterates over (10) and (11)(12) until convergence.

### DPMM 

The Dirichlet process mixture model (DPMM) is like the fully Bayesian
mixture model except $n_z = \infty$, i.e. $z$ can take any positive
integer value.

The probability of $z_i = k$ is defined using the so called
stick-breaking process: let $v_i \sim \text{Beta} (\alpha, \beta)$ be
i.i.d. random variables with Beta distributions, then

$$\mathbb P(z_i = k | v_{1:\infty}) = (1 - v_1) (1 - v_2) ... (1 - v_{k - 1}) v_k.$$

So $v$ plays a similar role to $\pi$ in the previous models.

As before, we have that the distribution of $x$ belongs to the
exponential family:

$$p(x | z = k, \eta) = p(x | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$

so the prior of $\eta_k$ is

$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k)).$$

Because of the infinities we can\'t directly apply the formulas in the
general fully Bayesian mixture models. So let us carefully derive the
whole thing again.

As before, we can write down the ELBO:

$$L(p(x, z, \theta), q(z, \theta)) = \mathbb E_{q(\theta)} \log {p(\theta) \over q(\theta)} + \mathbb E_{q(\theta) q(z)} \log {p(x, z | \theta) \over q(z)}.$$

Both terms are infinite series:

$$L(p, q) = \sum_{k = 1 : \infty} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_{i = 1 : m} \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}.$$

There are several ways to deal with the infinities. One is to fix some level $T > 0$ and set
$v_T = 1$ almost surely (Blei-Jordan 2006). This effectively turns the
model into a finite one, and both terms become finite sums over
$k = 1 : T$.

Another walkaround (Kurihara-Welling-Vlassis 2007) is also a kind of
truncation, but less heavy-handed: setting the posterior
$q(\theta) = q(\eta) q(v)$ to be:

$$q(\theta) = q(\theta_{1 : T}) p(\theta_{T + 1 : \infty}) =: q(\theta_{\le T}) p(\theta_{> T}).$$

That is, tie the posterior after $T$ to the prior. This effectively
turns the first term in the ELBO to a finite sum over $k = 1 : T$, while
keeping the second sum an infinite series:

$$L(p, q) = \sum_{k = 1 : T} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_i \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}. \qquad (13)$$

The plate notation of this model is:

![](/assets/resources/dpmm.png){style="width:450px"}

As it turns out, the infinities can be tamed in this case.

As before, the optimal $q(z_i)$ is computed as

$$r_{ik} = q(z_i = k) = s_{ik} / S_i$$

where

$$\begin{aligned}
s_{ik} &= \exp(\mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta)) \\
S_i &= \sum_{k = 1 : \infty} s_{ik}.
\end{aligned}$$

Plugging this back to (13) we have

$$\begin{aligned}
\sum_{k = 1 : \infty} r_{ik} &\mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over r_{ik}} \\
&= \sum_{k = 1 : \infty} r_{ik} \mathbb E_{q(\theta)} (\log p(x_i, z_i = k | \theta) - \mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta) + \log S_i) = \log S_i.
\end{aligned}$$

So it all rests upon $S_i$ being finite.

For $k \le T + 1$, we compute the quantity $s_{ik}$ directly. For
$k > T$, it is not hard to show that

$$s_{ik} = s_{i, T + 1} \exp((k - T - 1) \mathbb E_{p(w)} \log (1 - w)),$$

where $w$ is a random variable with same distribution as $p(v_k)$, i.e.
$\text{Beta}(\alpha, \beta)$.

Hence

$$S_i = \sum_{k = 1 : T} s_{ik} + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}$$

is indeed finite. Similarly we also obtain

$$q(z_i > k) = S^{-1} \left(\sum_{\ell = k + 1 : T} s_\ell + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}\right), k \le T \qquad (14)$$

Now let us compute the posterior of $\theta_{\le T}$. In the following
we exchange the integrals without justifying them (c.f. Fubini\'s
Theorem). Equation (13) can be rewritten as

$$L(p, q) = \mathbb E_{q(\theta_{\le T})} \left(\log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} - \log q(\theta_{\le T})\right).$$

Note that unlike the derivation of the mean-field approximation, we keep
the $- \mathbb E_{q(z)} \log q(z)$ term even though we are only
interested in $\theta$ at this stage. This is again due to the problem
of infinities: as in the computation of $S$, we would like to cancel out
some undesirable unbounded terms using $q(z)$. We now have

$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} + C.$$

By plugging in $q(z = k)$ we obtain

$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_{k = 1 : \infty} q(z_i = k) \left(\mathbb E_{p(\theta_{> T})} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)} - \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}\right) + C.$$

Again, we separate the $v_k$\'s and the $\eta_k$\'s to obtain

$$q(v_{\le T}) = \log p(v_{\le T}) + \sum_i \sum_k q(z_i = k) \left(\mathbb E_{p(v_{> T})} \log p(z_i = k | v) - \mathbb E_{q(v)} \log p (z_i = k | v)\right).$$

Denote by $D_k$ the difference between the two expetations on the right
hand side. It is easy to show that

$$D_k = \begin{cases}
\log(1 - v_1) ... (1 - v_{k - 1}) v_k - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_{k - 1}) v_k & k \le T\\
\log(1 - v_1) ... (1 - v_T) - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_T) & k > T
\end{cases}$$

so $D_k$ is bounded. With this we can derive the update for
$\phi^{v, 1}$ and $\phi^{v, 2}$:

$$\begin{aligned}
\phi^{v, 1}_k &= \alpha + \sum_i q(z_i = k) \\
\phi^{v, 2}_k &= \beta + \sum_i q(z_i > k),
\end{aligned}$$

where $q(z_i > k)$ can be computed as in (14).

When it comes to $\eta$, we have

$$\log q(\eta_{\le T}) = \log p(\eta_{\le T}) + \sum_i \sum_{k = 1 : \infty} q(z_i = k) (\mathbb E_{p(\eta_k)} \log p(x_i | \eta_k) - \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k)).$$

Since $q(\eta_k) = p(\eta_k)$ for $k > T$, the inner sum on the right
hand side is a finite sum over $k = 1 : T$. By factorising
$q(\eta_{\le T})$ and $p(\eta_{\le T})$, we have

$$\log q(\eta_k) = \log p(\eta_k) + \sum_i q(z_i = k) \log (x_i | \eta_k) + C,$$

which gives us

$$\begin{aligned}
\phi^{\eta, 1}_k &= \xi + \sum_i q(z_i = k) T(x_i) \\
\phi^{\eta, 2}_k &= \nu + \sum_i q(z_i = k).
\end{aligned}$$

SVI 
---

In variational inference, the computation of some parameters are more
expensive than others.

For example, the computation of M-step is often much more expensive than
that of E-step:

-   In the vanilla mixture models with the EM algorithm, the update of
    $\theta$ requires the computation of $r_{ik}$ for all $i = 1 : m$,
    see Eq (2.3).
-   In the fully Bayesian mixture model with mean field approximation,
    the updates of $\phi^\pi$ and $\phi^\eta$ require the computation of
    a sum over all samples (see Eq (9.3)(9.7)(9.9)).

Similarly, in pLSA2 (resp. LDA), the updates of $\eta_k$ (resp.
$\phi^{\eta_k}$) requires a sum over $\ell = 1 : n_d$, whereas the
updates of other parameters do not.

In these cases, the parameter that requires more computations are called
global and the other ones local.

Stochastic variational inference (SVI, Hoffman-Blei-Wang-Paisley 2012)
addresses this problem in the same way as stochastic gradient descent
improves efficiency of gradient descent.

Each time SVI picks a sample, updates the corresponding local
parameters, and computes the update of the global parameters as if all
the $m$ samples are identical to the picked sample. Finally it
incorporates this global parameter value into previous computations of
the global parameters, by means of an exponential moving average.

As an example, here\'s SVI applied to LDA:

1.  Set $t = 1$.
2.  Pick $\ell$ uniformly from $\{1, 2, ..., n_d\}$.
3.  Repeat until convergence:
    1.  Compute $(r_{\ell i k})_{i = 1 : m, k = 1 : n_z}$ using (10).
    2.  Compute $(\phi^{\pi_\ell}_k)_{k = 1 : n_z}$ using (11).
4.  Compute $(\tilde \phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$ using
    the following formula (compare with (12))
    $$\tilde \phi^{\eta_k}_w = \beta + n_d \sum_{i} r_{\ell i k} 1_{x_{\ell i} = w}$$

5.  Update the exponential moving average
    $(\phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$:
    $$\phi^{\eta_k}_w = (1 - \rho_t) \phi^{\eta_k}_w + \rho_t \tilde \phi^{\eta_k}_w$$

6.  Increment $t$ and go back to Step 2.

In the original paper, $\rho_t$ needs to satisfy some conditions that
guarantees convergence of the global parameters:

$$\begin{aligned}
\sum_t \rho_t = \infty \\
\sum_t \rho_t^2 < \infty
\end{aligned}$$

and the choice made there is

$$\rho_t = (t + \tau)^{-\kappa}$$

for some $\kappa \in (.5, 1]$ and $\tau \ge 0$.

AEVB 
----

SVI adds to variational inference stochastic updates similar to
stochastic gradient descent. Why not just use neural networks with
stochastic gradient descent while we are at it? Autoencoding variational
Bayes (AEVB) (Kingma-Welling 2013) is such an algorithm.

Let\'s look back to the original problem of maximising the ELBO:

$$\max_{\theta, q} \sum_{i = 1 : m} L(p(x_i | z_i; \theta) p(z_i; \theta), q(z_i))$$

Since for any given $\theta$, the optimal $q(z_i)$ is the posterior
$p(z_i | x_i; \theta)$, the problem reduces to

$$\max_{\theta} \sum_i L(p(x_i | z_i; \theta) p(z_i; \theta), p(z_i | x_i; \theta))$$

Let us assume $p(z_i; \theta) = p(z_i)$ is independent of $\theta$ to
simplify the problem. In the old mixture models, we have
$p(x_i | z_i; \theta) = p(x_i; \eta_{z_i})$, which we can generalise to
$p(x_i; f(\theta, z_i))$ for some function $f$. Using Beyes\' theorem we
can also write down $p(z_i | x_i; \theta) = q(z_i; g(\theta, x_i))$ for
some function $g$. So the problem becomes

$$\max_{\theta} \sum_i L(p(x_i; f(\theta, z_i)) p(z_i), q(z_i; g(\theta, x_i)))$$

In some cases $g$ can be hard to write down or compute. AEVB addresses
this problem by replacing $g(\theta, x_i)$ with a neural network
$g_\phi(x_i)$ with input $x_i$ and some separate parameters $\phi$. It
also replaces $f(\theta, z_i)$ with a neural network $f_\theta(z_i)$
with input $z_i$ and parameters $\theta$. And now the problem becomes

$$\max_{\theta, \phi} \sum_i L(p(x_i; f_\theta(z_i)) p(z_i), q(z_i; g_\phi(x_i))).$$

The objective function can be written as

$$\sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - D(q(z_i; g_\phi(x_i)) || p(z_i)).$$

The first term is called the negative reconstruction error, like the
$- \|decoder(encoder(x)) - x\|$ in autoencoders, which is where the
\"autoencoder\" in the name comes from.

The second term is a regularisation term that penalises the posterior
$q(z_i)$ that is very different from the prior $p(z_i)$. We assume this
term can be computed analytically.

So only the first term requires computing.

We can approximate the sum over $i$ in a similar fashion as SVI: pick
$j$ uniformly randomly from $\{1 ... m\}$ and treat the whole dataset as
$m$ replicates of $x_j$, and approximate the expectation using
Monte-Carlo:

$$U(x_i, \theta, \phi) := \sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) \approx m \mathbb E_{q(z_j; g_\phi(x_j))} \log p(x_j; f_\theta(z_j)) \approx {m \over L} \sum_{\ell = 1}^L \log p(x_j; f_\theta(z_{j, \ell})),$$

where each $z_{j, \ell}$ is sampled from $q(z_j; g_\phi(x_j))$.

But then it is not easy to approximate the gradient over $\phi$. One can
use the log trick as in policy gradients, but it has the problem of high
variance. In policy gradients this is overcome by using baseline
subtractions. In the AEVB paper it is tackled with the
reparameterisation trick.

Assume there exists a transformation $T_\phi$ and a random variable
$\epsilon$ with distribution independent of $\phi$ or $\theta$, such
that $T_\phi(x_i, \epsilon)$ has distribution $q(z_i; g_\phi(x_i))$. In
this case we can rewrite $U(x, \phi, \theta)$ as

$$\sum_i \mathbb E_{\epsilon \sim p(\epsilon)} \log p(x_i; f_\theta(T_\phi(x_i, \epsilon))),$$

This way one can use Monte-Carlo to approximate
$\nabla_\phi U(x, \phi, \theta)$:

$$\nabla_\phi U(x, \phi, \theta) \approx {m \over L} \sum_{\ell = 1 : L} \nabla_\phi \log p(x_j; f_\theta(T_\phi(x_j, \epsilon_\ell))),$$

where each $\epsilon_{\ell}$ is sampled from $p(\epsilon)$. The
approximation of $U(x, \phi, \theta)$ itself can be done similarly.

### VAE 

As an example of AEVB, the paper introduces variational autoencoder
(VAE), with the following instantiations:

-   The prior $p(z_i) = N(0, I)$ is standard normal, thus independent of
    $\theta$.
-   The distribution $p(x_i; \eta)$ is either Gaussian or categorical.
-   The distribution $q(z_i; \mu, \Sigma)$ is Gaussian with diagonal
    covariance matrix. So
    $g_\phi(z_i) = (\mu_\phi(x_i), \text{diag}(\sigma^2_\phi(x_i)_{1 : d}))$.
    Thus in the reparameterisation trick $\epsilon \sim N(0, I)$ and
    $T_\phi(x_i, \epsilon) = \epsilon \odot \sigma_\phi(x_i) + \mu_\phi(x_i)$,
    where $\odot$ is elementwise multiplication.
-   The KL divergence can be easily computed analytically as
    $- D(q(z_i; g_\phi(x_i)) || p(z_i)) = {d \over 2} + \sum_{j = 1 : d} \log\sigma_\phi(x_i)_j - {1 \over 2} \sum_{j = 1 : d} (\mu_\phi(x_i)_j^2 + \sigma_\phi(x_i)_j^2)$.

With this, one can use backprop to maximise the ELBO.

### Fully Bayesian AEVB 

Let us turn to fully Bayesian version of AEVB. Again, we first recall
the ELBO of the fully Bayesian mixture models:

$$L(p(x, z, \pi, \eta; \alpha, \beta), q(z, \pi, \eta; r, \phi)) = L(p(x | z, \eta) p(z | \pi) p(\pi; \alpha) p(\eta; \beta), q(z; r) q(\eta; \phi^\eta) q(\pi; \phi^\pi)).$$

We write $\theta = (\pi, \eta)$, rewrite $\alpha := (\alpha, \beta)$,
$\phi := r$, and $\gamma := (\phi^\eta, \phi^\pi)$. Furthermore, as in
the half-Bayesian version we assume $p(z | \theta) = p(z)$, i.e. $z$
does not depend on $\theta$. Similarly we also assume
$p(\theta; \alpha) = p(\theta)$. Now we have

$$L(p(x, z, \theta; \alpha), q(z, \theta; \phi, \gamma)) = L(p(x | z, \theta) p(z) p(\theta), q(z; \phi) q(\theta; \gamma)).$$

And the objective is to maximise it over $\phi$ and $\gamma$. We no
longer maximise over $\theta$, because it is now a random variable, like
$z$. Now let us transform it to a neural network model, as in the
half-Bayesian case:

$$L\left(\left(\prod_{i = 1 : m} p(x_i; f_\theta(z_i))\right) \left(\prod_{i = 1 : m} p(z_i) \right) p(\theta), \left(\prod_{i = 1 : m} q(z_i; g_\phi(x_i))\right) q(\theta; h_\gamma(x))\right).$$

where $f_\theta$, $g_\phi$ and $h_\gamma$ are neural networks. Again, by
separating out KL-divergence terms, the above formula becomes

$$\sum_i \mathbb E_{q(\theta; h_\gamma(x))q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - \sum_i D(q(z_i; g_\phi(x_i)) || p(z_i)) - D(q(\theta; h_\gamma(x)) || p(\theta)).$$

Again, we assume the latter two terms can be computed analytically.
Using reparameterisation trick, we write

$$\begin{aligned}
\theta &= R_\gamma(\zeta, x) \\
z_i &= T_\phi(\epsilon, x_i)
\end{aligned}$$

for some transformations $R_\gamma$ and $T_\phi$ and random variables
$\zeta$ and $\epsilon$ so that the output has the desired distributions.

Then the first term can be written as

$$\mathbb E_{\zeta, \epsilon} \log p(x_i; f_{R_\gamma(\zeta, x)} (T_\phi(\epsilon, x_i))),$$

so that the gradients can be computed accordingly.

Again, one may use Monte-Carlo to approximate this expetation.

References 
----------

-   Attias, Hagai. \"A variational baysian framework for graphical
    models.\" In Advances in neural information processing systems, pp.
    209-215. 2000.
-   Bishop, Christopher M. Neural networks for pattern recognition.
    Springer. 2006.
-   Blei, David M., and Michael I. Jordan. "Variational Inference for
    Dirichlet Process Mixtures." Bayesian Analysis 1, no. 1 (March
    2006): 121--43. <https://doi.org/10.1214/06-BA104>.
-   Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent
    Dirichlet Allocation." Journal of Machine Learning Research 3, no.
    Jan (2003): 993--1022.
-   Hofmann, Thomas. "Latent Semantic Models for Collaborative
    Filtering." ACM Transactions on Information Systems 22, no. 1
    (January 1, 2004): 89--115. <https://doi.org/10.1145/963770.963774>.
-   Hofmann, Thomas. \"Learning the similarity of documents: An
    information-geometric approach to document retrieval and
    categorization.\" In Advances in neural information processing
    systems, pp. 914-920. 2000.
-   Hoffman, Matt, David M. Blei, Chong Wang, and John Paisley.
    "Stochastic Variational Inference." ArXiv:1206.7051 \[Cs, Stat\],
    June 29, 2012. <http://arxiv.org/abs/1206.7051>.
-   Kingma, Diederik P., and Max Welling. "Auto-Encoding Variational
    Bayes." ArXiv:1312.6114 \[Cs, Stat\], December 20, 2013.
    <http://arxiv.org/abs/1312.6114>.
-   Kurihara, Kenichi, Max Welling, and Nikos Vlassis. \"Accelerated
    variational Dirichlet process mixtures.\" In Advances in neural
    information processing systems, pp. 761-768. 2007.
-   Sudderth, Erik Blaine. \"Graphical models for visual object
    recognition and tracking.\" PhD diss., Massachusetts Institute of
    Technology, 2006.