aboutsummaryrefslogtreecommitdiff
path: root/posts
diff options
context:
space:
mode:
Diffstat (limited to 'posts')
-rw-r--r--posts/2013-06-01-q-robinson-schensted-paper.org28
-rw-r--r--posts/2014-04-01-q-robinson-schensted-symmetry-paper.org16
-rw-r--r--posts/2015-01-20-weighted-interpretation-super-catalan-numbers.org39
-rw-r--r--posts/2015-04-01-unitary-double-products.org10
-rw-r--r--posts/2015-04-02-juggling-skill-tree.org28
-rw-r--r--posts/2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org67
-rw-r--r--posts/2015-07-01-causal-quantum-product-levy-area.org26
-rw-r--r--posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org64
-rw-r--r--posts/2016-10-13-q-robinson-schensted-knuth-polymer.org50
-rw-r--r--posts/2017-04-25-open_research_toywiki.org21
-rw-r--r--posts/2017-08-07-mathematical_bazaar.org213
-rw-r--r--posts/2018-04-10-update-open-research.org185
-rw-r--r--posts/2018-06-03-automatic_differentiation.org100
-rw-r--r--posts/2018-12-02-lime-shapley.org362
-rw-r--r--posts/2019-01-03-discriminant-analysis.org293
-rw-r--r--posts/2019-02-14-raise-your-elbo.org1150
-rw-r--r--posts/2019-03-13-a-tail-of-two-densities.org1304
-rw-r--r--posts/2019-03-14-great-but-manageable-expectations.org836
-rw-r--r--posts/blog.html21
19 files changed, 4813 insertions, 0 deletions
diff --git a/posts/2013-06-01-q-robinson-schensted-paper.org b/posts/2013-06-01-q-robinson-schensted-paper.org
new file mode 100644
index 0000000..27a6b0e
--- /dev/null
+++ b/posts/2013-06-01-q-robinson-schensted-paper.org
@@ -0,0 +1,28 @@
+#+title: A \(q\)-weighted Robinson-Schensted algorithm
+
+#+date: <2013-06-01>
+
+In [[https://projecteuclid.org/euclid.ejp/1465064320][this paper]] with
+[[http://www.bristol.ac.uk/maths/people/neil-m-oconnell/][Neil]] we
+construct a \(q\)-version of the Robinson-Schensted algorithm with
+column insertion. Like the
+[[http://en.wikipedia.org/wiki/Robinson–Schensted_correspondence][usual
+RS correspondence]] with column insertion, this algorithm could take
+words as input. Unlike the usual RS algorithm, the output is a set of
+weighted pairs of semistandard and standard Young tableaux \((P,Q)\)
+with the same shape. The weights are rational functions of indeterminant
+\(q\).
+
+If \(q\in[0,1]\), the algorithm can be considered as a randomised RS
+algorithm, with 0 and 1 being two interesting cases. When \(q\to0\), it
+is reduced to the latter usual RS algorithm; while when \(q\to1\) with
+proper scaling it should scale to directed random polymer model in
+[[http://arxiv.org/abs/0910.0069][(O'Connell 2012)]]. When the input
+word \(w\) is a random walk:
+
+\begin{align*}\mathbb
+P(w=v)=\prod_{i=1}^na_{v_i},\qquad\sum_ja_j=1\end{align*}
+
+the shape of output evolves as a Markov chain with kernel related to
+\(q\)-Whittaker functions, which are Macdonald functions when \(t=0\)
+with a factor.
diff --git a/posts/2014-04-01-q-robinson-schensted-symmetry-paper.org b/posts/2014-04-01-q-robinson-schensted-symmetry-paper.org
new file mode 100644
index 0000000..b1c967d
--- /dev/null
+++ b/posts/2014-04-01-q-robinson-schensted-symmetry-paper.org
@@ -0,0 +1,16 @@
+#+title: Symmetry property of \(q\)-weighted Robinson-Schensted algorithms and branching algorithms
+#+date: <2014-04-01>
+
+In [[http://link.springer.com/article/10.1007/s10801-014-0505-x][this
+paper]] a symmetry property analogous to the well known symmetry
+property of the normal Robinson-Schensted algorithm has been shown for
+the \(q\)-weighted Robinson-Schensted algorithm. The proof uses a
+generalisation of the growth diagram approach introduced by Fomin. This
+approach, which uses "growth graphs", can also be applied to a wider
+class of insertion algorithms which have a branching structure.
+
+#+caption: Growth graph of q-RS for 1423
+[[../assets/resources/1423graph.jpg]]
+
+Above is the growth graph of the \(q\)-weighted Robinson-Schensted
+algorithm for the permutation \({1 2 3 4\choose1 4 2 3}\).
diff --git a/posts/2015-01-20-weighted-interpretation-super-catalan-numbers.org b/posts/2015-01-20-weighted-interpretation-super-catalan-numbers.org
new file mode 100644
index 0000000..9cde382
--- /dev/null
+++ b/posts/2015-01-20-weighted-interpretation-super-catalan-numbers.org
@@ -0,0 +1,39 @@
+#+title: AMS review of 'A weighted interpretation for the super Catalan
+numbers' by Allen and Gheorghiciuc
+
+#+date: <2015-01-20>
+
+The super Catalan numbers are defined as $$ T(m,n) = {(2 m)! (2 n)!
+\over 2 m! n! (m + n)!}. $$
+
+   This paper has two main results. First a combinatorial interpretation
+of the super Catalan numbers is given: $$ T(m,n) = P(m,n) - N(m,n) $$
+where \(P(m,n)\) enumerates the number of 2-Motzkin paths whose \(m\)
+-th step begins at an even level (called \(m\)-positive paths) and
+\(N(m,n)\) those with \(m\)-th step beginning at an odd level
+(\(m\)-negative paths). The proof uses a recursive argument on the
+number of \(m\)-positive and -negative paths, based on a recursion of
+the super Catalan numbers appearing in [I. M. Gessel, J. Symbolic
+Comput. *14* (1992), no. 2-3, 179--194;
+[[http://www.ams.org/mathscinet/search/publdoc.html?r=1&pg1=MR&s1=1187230&loc=fromrevtext][MR1187230]]]:
+$$ 4T(m,n) = T(m+1, n) + T(m, n+1). $$ This result gives an expression
+for the super Catalan numbers in terms of numbers counting the so-called
+ballot paths. The latter sometimes are also referred to as the
+generalised Catalan numbers forming the entries of the Catalan triangle.
+
+   Based on the first result, the second result is a combinatorial
+interpretation of the super Catalan numbers \(T(2,n)\) in terms of
+counting certain Dyck paths. This is equivalent to a theorem, which
+represents \(T(2,n)\) as counting of certain pairs of Dyck paths, in [I.
+M. Gessel and G. Xin, J. Integer Seq. *8* (2005), no. 2, Article 05.2.3,
+13 pp.;
+[[http://www.ams.org/mathscinet/search/publdoc.html?r=1&pg1=MR&s1=2134162&loc=fromrevtext][MR2134162]]],
+and the equivalence is explained at the end of the paper by a bijection
+between the Dyck paths and the pairs of Dyck paths. The proof of the
+theorem itself is also done by constructing two bijections between Dyck
+paths satisfying certain conditions. All the three bijections are
+formulated by locating, removing and adding steps.
+
+Copyright notice: This review is published at
+http://www.ams.org/mathscinet-getitem?mr=3275875, its copyright owned by
+the AMS.
diff --git a/posts/2015-04-01-unitary-double-products.org b/posts/2015-04-01-unitary-double-products.org
new file mode 100644
index 0000000..d545b3a
--- /dev/null
+++ b/posts/2015-04-01-unitary-double-products.org
@@ -0,0 +1,10 @@
+#+title: Unitary causal quantum stochastic double products as universal
+interactions I
+
+#+date: <2015-04-01>
+
+In
+[[http://www.actaphys.uj.edu.pl/findarticle?series=Reg&vol=46&page=1851][this
+paper]] with [[http://homepages.lboro.ac.uk/~marh3/][Robin]] we show the
+explicit formulae for a family of unitary triangular and rectangular
+double product integrals which can be described as second quantisations.
diff --git a/posts/2015-04-02-juggling-skill-tree.org b/posts/2015-04-02-juggling-skill-tree.org
new file mode 100644
index 0000000..79b35ad
--- /dev/null
+++ b/posts/2015-04-02-juggling-skill-tree.org
@@ -0,0 +1,28 @@
+#+title: jst
+
+#+date: <2015-04-02>
+
+jst = juggling skill tree
+
+If you have ever played a computer role playing game, you may have
+noticed the protagonist sometimes has a skill "tree" (most of the time
+it is actually a directed acyclic graph), where certain skills leads to
+others. For example,
+[[http://hydra-media.cursecdn.com/diablo.gamepedia.com/3/37/Sorceress_Skill_Trees_%28Diablo_II%29.png?version=b74b3d4097ef7ad4e26ebee0dcf33d01][here]]
+is the skill tree of sorceress in
+[[https://en.wikipedia.org/wiki/Diablo_II][Diablo II]].
+
+Now suppose our hero embarks on a quest for learning all the possible
+juggling patterns. Everyone would agree she should start with cascade,
+the simplest nontrivial 3-ball pattern, but what afterwards? A few other
+accessible patterns for beginners are juggler's tennis, two in one and
+even reverse cascade, but what to learn after that? The encyclopeadic
+[[http://libraryofjuggling.com/][Library of Juggling]] serves as a good
+guide, as it records more than 160 patterns, some of which very
+aesthetically appealing. On this website almost all the patterns have a
+"prerequisite" section, indicating what one should learn beforehand. I
+have therefore written a script using [[http://python.org][Python]],
+[[http://www.crummy.com/software/BeautifulSoup/][BeautifulSoup]] and
+[[http://pygraphviz.github.io/][pygraphviz]] to generate a jst (graded
+by difficulties, which is the leftmost column) from the Library of
+Juggling (click the image for the full size):
diff --git a/posts/2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org b/posts/2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org
new file mode 100644
index 0000000..b632c03
--- /dev/null
+++ b/posts/2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org
@@ -0,0 +1,67 @@
+#+title: AMS review of 'Infinite binary words containing repetitions of
+#+title: odd period' by Badkobeh and Crochemore
+
+#+date: <2015-05-30>
+
+This paper is about the existence of pattern-avoiding infinite binary
+words, where the patterns are squares, cubes and \(3^+\)-powers.   
+There are mainly two kinds of results, positive (existence of an
+infinite binary word avoiding a certain pattern) and negative
+(non-existence of such a word). Each positive result is proved by the
+construction of a word with finitely many squares and cubes which are
+listed explicitly. First a synchronising (also known as comma-free)
+uniform morphism \(g\: \Sigma_3^* \to \Sigma_2^*\)
+
+is constructed. Then an argument is given to show that the length of
+squares in the code \(g(w)\) for a squarefree \(w\) is bounded, hence
+all the squares can be obtained by examining all \(g(s)\) for \(s\) of
+bounded lengths. The argument resembles that of the proof of, e.g.,
+Theorem 1, Lemma 2, Theorem 3 and Lemma 4 in [N. Rampersad, J. O.
+Shallit and M. Wang, Theoret. Comput. Sci. *339* (2005), no. 1, 19--34;
+[[http://www.ams.org/mathscinet/search/publdoc.html?r=1&pg1=MR&s1=2142071&loc=fromrevtext][MR2142071]]].
+The negative results are proved by traversing all possible finite words
+satisfying the conditions.
+
+   Let \(L(n_2, n_3, S)\) be the maximum length of a word with \(n_2\)
+distinct squares, \(n_3\) distinct cubes and that the periods of the
+squares can take values only in \(S\) , where \(n_2, n_3 \in \Bbb N \cup
+\{\infty, \omega\}\) and \(S \subset \Bbb N_+\) .    \(n_k = 0\)
+corresponds to \(k\)-free, \(n_k = \infty\) means no restriction on the
+number of distinct \(k\)-powers, and \(n_k = \omega\) means
+\(k^+\)-free.
+
+   Below is the summary of the positive and negative results:
+
+1) (Negative) \(L(\infty, \omega, 2 \Bbb N) < \infty\) : \(\nexists\) an
+ infinite \(3^+\) -free binary word avoiding all squares of odd
+ periods. (Proposition 1)
+
+2) (Negative) \(L(\infty, 0, 2 \Bbb N + 1) \le 23\) : \(\nexists\) an
+ infinite 3-free binary word, avoiding squares of even periods. The
+ longest one has length \(\le 23\) (Proposition 2).
+
+3) (Positive) \(L(\infty, \omega, 2 \Bbb N +
+
+ 1) - = \infty\) :: \(\exists\) an infinite \(3^+\) -free binary word
+ avoiding squares of even periods (Theorem 1).
+
+4) (Positive) \(L(\infty, \omega, \{1, 3\}) = \infty\) : \(\exists\) an
+ infinite \(3^+\) -free binary word containing only squares of period
+ 1 or 3 (Theorem 2).
+
+5) (Negative) \(L(6, 1, 2 \Bbb N + 1) = 57\) : \(\nexists\) an infinite
+ binary word avoiding squares of even period containing \(< 7\)
+ squares and \(< 2\) cubes. The longest one containing 6 squares and 1
+ cube has length 57 (Proposition 6).
+
+6) (Positive) \(L(7, 1, 2 \Bbb N + 1) = \infty\) : \(\exists\) an
+ infinite \(3^+\) -free binary word avoiding squares of even period
+ with 1 cube and 7 squares (Theorem 3).
+
+7) (Positive) \(L(4, 2, 2 \Bbb N + 1) = \infty\) : \(\exists\) an
+ infinite \(3^+\) -free binary words avoiding squares of even period
+ and containing 2 cubes and 4 squares (Theorem 4).
+
+Copyright notice: This review is published at
+http://www.ams.org/mathscinet-getitem?mr=3313467, its copyright owned by
+the AMS.
diff --git a/posts/2015-07-01-causal-quantum-product-levy-area.org b/posts/2015-07-01-causal-quantum-product-levy-area.org
new file mode 100644
index 0000000..528b9b7
--- /dev/null
+++ b/posts/2015-07-01-causal-quantum-product-levy-area.org
@@ -0,0 +1,26 @@
+#+title: On a causal quantum double product integral related to Lévy
+#+title: stochastic area.
+
+#+date: <2015-07-01>
+
+In [[https://arxiv.org/abs/1506.04294][this paper]] with
+[[http://homepages.lboro.ac.uk/~marh3/][Robin]] we study the family of
+causal double product integrals \[ \prod_{a < x < y < b}\left(1 +
+i{\lambda \over 2}(dP_x dQ_y - dQ_x dP_y) + i {\mu \over 2}(dP_x dP_y +
+dQ_x dQ_y)\right) \]
+
+where $P$ and $Q$ are the mutually noncommuting momentum and position
+Brownian motions of quantum stochastic calculus. The evaluation is
+motivated heuristically by approximating the continuous double product
+by a discrete product in which infinitesimals are replaced by finite
+increments. The latter is in turn approximated by the second
+quantisation of a discrete double product of rotation-like operators in
+different planes due to a result in
+[[http://www.actaphys.uj.edu.pl/findarticle?series=Reg&vol=46&page=1851][(Hudson-Pei2015)]].
+The main problem solved in this paper is the explicit evaluation of the
+continuum limit $W$ of the latter, and showing that $W$ is a unitary
+operator. The kernel of $W$ is written in terms of Bessel functions, and
+the evaluation is achieved by working on a lattice path model and
+enumerating linear extensions of related partial orderings, where the
+enumeration turns out to be heavily related to Dyck paths and
+generalisations of Catalan numbers.
diff --git a/posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org b/posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org
new file mode 100644
index 0000000..cda6967
--- /dev/null
+++ b/posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org
@@ -0,0 +1,64 @@
+#+title: AMS review of 'Double Macdonald polynomials as the stable limit
+#+title: of Macdonald superpolynomials' by Blondeau-Fournier, Lapointe and
+#+title: Mathieu
+
+#+date: <2015-07-15>
+
+A Macdonald superpolynomial (introduced in [O. Blondeau-Fournier et al.,
+Lett. Math. Phys. 101 (2012), no. 1, 27--47;
+[[http://www.ams.org/mathscinet/search/publdoc.html?pg1=MR&s1=2935476&loc=fromrevtext][MR2935476]];
+J. Comb. 3 (2012), no. 3, 495--561;
+[[http://www.ams.org/mathscinet/search/publdoc.html?pg1=MR&s1=3029444&loc=fromrevtext][MR3029444]]])
+in \(N\) Grassmannian variables indexed by a superpartition \(\Lambda\)
+is said to be stable if \({m (m + 1) \over 2} \ge |\Lambda|\) and \(N
+\ge |\Lambda| - {m (m - 3) \over 2}\) , where \(m\) is the fermionic
+degree. A stable Macdonald superpolynomial (corresponding to a
+bisymmetric polynomial) is also called a double Macdonald polynomial
+(dMp). The main result of this paper is the factorisation of a dMp into
+plethysms of two classical Macdonald polynomials (Theorem 5). Based on
+this result, this paper
+
+1) shows that the dMp has a unique decomposition into bisymmetric
+ monomials;
+
+2) calculates the norm of the dMp;
+
+3) calculates the kernel of the Cauchy-Littlewood-type identity of the
+ dMp;
+
+4) shows the specialisation of the aforementioned factorisation to the
+ Jack, Hall-Littlewood and Schur cases. One of the three Schur
+ specialisations, denoted as \(s_{\lambda, \mu}\), also appears in (7)
+ and (9) below;
+
+5) defines the \(\omega\) -automorphism in this setting, which was used
+ to prove an identity involving products of four Littlewood-Richardson
+ coefficients;
+
+6) shows an explicit evaluation of the dMp motivated by the most general
+ evaluation of the usual Macdonald polynomials;
+
+7) relates dMps with the representation theory of the hyperoctahedral
+ group \(B_n\) via the double Kostka coefficients (which are defined
+ as the entries of the transition matrix from the bisymmetric Schur
+ functions \(s_{\lambda, \mu}\) to the modified dMps);
+
+8) shows that the double Kostka coefficients have the positivity and the
+ symmetry property, and can be written as sums of products of the
+ usual Kostka coefficients;
+
+9) defines an operator \(\nabla^B\) as an analogue of the nabla operator
+ \(\nabla\) introduced in [F. Bergeron and A. M. Garsia, in /Algebraic
+ methods and \(q\)-special functions/ (Montréal, QC, 1996), 1--52, CRM
+ Proc. Lecture Notes, 22, Amer. Math. Soc., Providence, RI, 1999;
+ [[http://www.ams.org/mathscinet/search/publdoc.html?r=1&pg1=MR&s1=1726826&loc=fromrevtext][MR1726826]]].
+ The action of \(\nabla^B\) on the bisymmetric Schur function
+ \(s_{\lambda, \mu}\) yields the dimension formula \((h + 1)^r\) for
+ the corresponding representation of \(B_n\) , where \(h\) and \(r\)
+ are the Coxeter number and the rank of \(B_n\) , in the same way that
+ the action of \(\nabla\) on the \(n\) th elementary symmetric
+ function leads to the same formula for the group of type \(A_n\) .
+
+Copyright notice: This review is published at
+http://www.ams.org/mathscinet-getitem?mr=3306078, its copyright owned by
+the AMS.
diff --git a/posts/2016-10-13-q-robinson-schensted-knuth-polymer.org b/posts/2016-10-13-q-robinson-schensted-knuth-polymer.org
new file mode 100644
index 0000000..93da639
--- /dev/null
+++ b/posts/2016-10-13-q-robinson-schensted-knuth-polymer.org
@@ -0,0 +1,50 @@
+#+title: A \(q\)-Robinson-Schensted-Knuth algorithm and a \(q\)-polymer
+
+#+date: <2016-10-13>
+
+(Latest update: 2017-01-12) In
+[[http://arxiv.org/abs/1504.00666][Matveev-Petrov 2016]] a
+\(q\)-deformed Robinson-Schensted-Knuth algorithm (\(q\)RSK) was
+introduced. In this article we give reformulations of this algorithm in
+terms of Noumi-Yamada description, growth diagrams and local moves. We
+show that the algorithm is symmetric, namely the output tableaux pair
+are swapped in a sense of distribution when the input matrix is
+transposed. We also formulate a \(q\)-polymer model based on the
+\(q\)RSK and prove the corresponding Burke property, which we use to
+show a strong law of large numbers for the partition function given
+stationary boundary conditions and \(q\)-geometric weights. We use the
+\(q\)-local moves to define a generalisation of the \(q\)RSK taking a
+Young diagram-shape of array as the input. We write down the joint
+distribution of partition functions in the space-like direction of the
+\(q\)-polymer in \(q\)-geometric environment, formulate a \(q\)-version
+of the multilayer polynuclear growth model (\(q\)PNG) and write down the
+joint distribution of the \(q\)-polymer partition functions at a fixed
+time.
+
+This article is available at
+[[https://arxiv.org/abs/1610.03692][arXiv]]. It seems to me that one
+difference between arXiv and Github is that on arXiv each preprint has a
+few versions only. In Github many projects have a "dev" branch hosting
+continuous updates, whereas the master branch is where the stable
+releases live.
+
+[[file:%7B%7B%20site.url%20%7D%7D/assets/resources/qrsklatest.pdf][Here]]
+is a "dev" version of the article, which I shall push to arXiv when it
+stablises. Below is the changelog.
+
+- 2017-01-12: Typos and grammar, arXiv v2.
+- 2016-12-20: Added remarks on the geometric \(q\)-pushTASEP. Added
+ remarks on the converse of the Burke property. Added natural language
+ description of the \(q\)RSK. Fixed typos.
+- 2016-11-13: Fixed some typos in the proof of Theorem 3.
+- 2016-11-07: Fixed some typos. The \(q\)-Burke property is now stated
+ in a more symmetric way, so is the law of large numbers Theorem 2.
+- 2016-10-20: Fixed a few typos. Updated some references. Added a
+ reference: [[http://web.mit.edu/~shopkins/docs/rsk.pdf][a set of notes
+ titled "RSK via local transformations"]]. It is written by
+ [[http://web.mit.edu/~shopkins/][Sam Hopkins]] in 2014 as an
+ expository article based on MIT combinatorics preseminar presentations
+ of Alex Postnikov. It contains some idea (applying local moves to a
+ general Young-diagram shaped array in the order that matches any
+ growth sequence of the underlying Young diagram) which I thought I was
+ the first one to write down.
diff --git a/posts/2017-04-25-open_research_toywiki.org b/posts/2017-04-25-open_research_toywiki.org
new file mode 100644
index 0000000..1e672b0
--- /dev/null
+++ b/posts/2017-04-25-open_research_toywiki.org
@@ -0,0 +1,21 @@
+#+title: Open mathematical research and launching toywiki
+
+#+date: <2017-04-25>
+
+As an experimental project, I am launching toywiki.
+
+It hosts a collection of my research notes.
+
+It takes some ideas from the open source culture and apply them to
+mathematical research: 1. It uses a very permissive license (CC-BY-SA).
+For example anyone can fork the project and make their own version if
+they have a different vision and want to build upon the project. 2. All
+edits will done with maximum transparency, and discussions of any of
+notes should also be as public as possible (e.g. Github issues) 3.
+Anyone can suggest changes by opening issues and submitting pull
+requests
+
+Here are the links: [[http://toywiki.xyz][toywiki]] and
+[[https://github.com/ycpei/toywiki][github repo]].
+
+Feedbacks are welcome by email.
diff --git a/posts/2017-08-07-mathematical_bazaar.org b/posts/2017-08-07-mathematical_bazaar.org
new file mode 100644
index 0000000..64bf335
--- /dev/null
+++ b/posts/2017-08-07-mathematical_bazaar.org
@@ -0,0 +1,213 @@
+#+title: The Mathematical Bazaar
+
+#+date: <2017-08-07>
+
+In this essay I describe some problems in academia of mathematics and
+propose an open source model, which I call open research in mathematics.
+
+This essay is a work in progress - comments and criticisms are
+welcome! [fn:1]
+
+Before I start I should point out that
+
+1. Open research is /not/ open access. In fact the latter is a
+ prerequisite to the former.
+2. I am not proposing to replace the current academic model with the
+ open model - I know academia works well for many people and I am
+ happy for them, but I think an open research community is long
+ overdue since the wide adoption of the World Wide Web more than two
+ decades ago. In fact, I fail to see why an open model can not run in
+ tandem with the academia, just like open source and closed source
+ software development coexist today.
+
+** problems of academia
+ :PROPERTIES:
+ :CUSTOM_ID: problems-of-academia
+ :END:
+Open source projects are characterised by publicly available source
+codes as well as open invitations for public collaborations, whereas
+closed source projects do not make source codes accessible to the
+public. How about mathematical academia then, is it open source or
+closed source? The answer is neither.
+
+Compared to some other scientific disciplines, mathematics does not
+require expensive equipments or resources to replicate results; compared
+to programming in conventional software industry, mathematical findings
+are not meant to be commercial, as credits and reputation rather than
+money are the direct incentives (even though the former are commonly
+used to trade for the latter). It is also a custom and common belief
+that mathematical derivations and theorems shouldn't be patented.
+Because of this, mathematical research is an open source activity in the
+sense that proofs to new results are all available in papers, and thanks
+to open access e.g. the arXiv preprint repository most of the new
+mathematical knowledge is accessible for free.
+
+Then why, you may ask, do I claim that maths research is not open
+sourced? Well, this is because 1. mathematical arguments are not easily
+replicable and 2. mathematical research projects are mostly not open for
+public participation.
+
+Compared to computer programs, mathematical arguments are not written in
+an unambiguous language, and they are terse and not written in maximum
+verbosity (this is especially true in research papers as journals
+encourage limiting the length of submissions), so the understanding of a
+proof depends on whether the reader is equipped with the right
+background knowledge, and the completeness of a proof is highly
+subjective. More generally speaking, computer programs are mostly
+portable because all machines with the correct configurations can
+understand and execute a piece of program, whereas humans are subject to
+their environment, upbringings, resources etc. to have a brain ready to
+comprehend a proof that interests them. (these barriers are softer than
+the expensive equipments and resources in other scientific fields
+mentioned before because it is all about having access to the right
+information)
+
+On the other hand, as far as the pursuit of reputation and prestige
+(which can be used to trade for the scarce resource of research
+positions and grant money) goes, there is often little practical
+motivation for career mathematicians to explain their results to the
+public carefully. And so the weird reality of the mathematical academia
+is that it is not an uncommon practice to keep trade secrets in order to
+protect one's territory and maintain a monopoly. This is doable because
+as long as a paper passes the opaque and sometimes political peer review
+process and is accepted by a journal, it is considered work done,
+accepted by the whole academic community and adds to the reputation of
+the author(s). Just like in the software industry, trade secrets and
+monopoly hinder the development of research as a whole, as well as
+demoralise outsiders who are interested in participating in related
+research.
+
+Apart from trade secrets and territoriality, another reason to the
+nonexistence of open research community is an elitist tradition in the
+mathematical academia, which goes as follows:
+
+- Whoever is not good at mathematics or does not possess a degree in
+ maths is not eligible to do research, or else they run high risks of
+ being labelled a crackpot.
+- Mistakes made by established mathematicians are more tolerable than
+ those less established.
+- Good mathematical writings should be deep, and expositions of
+ non-original results are viewed as inferior work and do not add to
+ (and in some cases may even damage) one's reputation.
+
+All these customs potentially discourage public participations in
+mathematical research, and I do not see them easily go away unless an
+open source community gains momentum.
+
+To solve the above problems, I propose a open source model of
+mathematical research, which has high levels of openness and
+transparency and also has some added benefits listed in the last section
+of this essay. This model tries to achieve two major goals:
+
+- Open and public discussions and collaborations of mathematical
+ research projects online
+- Open review to validate results, where author name, reviewer name,
+ comments and responses are all publicly available online.
+
+To this end, a Github model is fitting. Let me first describe how open
+source collaboration works on Github.
+
+** open source collaborations on Github
+ :PROPERTIES:
+ :CUSTOM_ID: open-source-collaborations-on-github
+ :END:
+On [[https://github.com][Github]], every project is publicly available
+in a repository (we do not consider private repos). The owner can update
+the project by "committing" changes, which include a message of what has
+been changed, the author of the changes and a timestamp. Each project
+has an issue tracker, which is basically a discussion forum about the
+project, where anyone can open an issue (start a discussion), and the
+owner of the project as well as the original poster of the issue can
+close it if it is resolved, e.g. bug fixed, feature added, or out of the
+scope of the project. Closing the issue is like ending the discussion,
+except that the thread is still open to more posts for anyone
+interested. People can react to each issue post, e.g. upvote, downvote,
+celebration, and importantly, all the reactions are public too, so you
+can find out who upvoted or downvoted your post.
+
+When one is interested in contributing code to a project, they fork it,
+i.e. make a copy of the project, and make the changes they like in the
+fork. Once they are happy with the changes, they submit a pull request
+to the original project. The owner of the original project may accept or
+reject the request, and they can comment on the code in the pull
+request, asking for clarification, pointing out problematic part of the
+code etc and the author of the pull request can respond to the comments.
+Anyone, not just the owner can participate in this review process,
+turning it into a public discussion. In fact, a pull request is a
+special issue thread. Once the owner is happy with the pull request,
+they accept it and the changes are merged into the original project. The
+author of the changes will show up in the commit history of the original
+project, so they get the credits.
+
+As an alternative to forking, if one is interested in a project but has
+a different vision, or that the maintainer has stopped working on it,
+they can clone it and make their own version. This is a more independent
+kind of fork because there is no longer intention to contribute back to
+the original project.
+
+Moreover, on Github there is no way to send private messages, which
+forces people to interact publicly. If say you want someone to see and
+reply to your comment in an issue post or pull request, you simply
+mention them by =@someone=.
+
+** open research in mathematics
+ :PROPERTIES:
+ :CUSTOM_ID: open-research-in-mathematics
+ :END:
+All this points to a promising direction of open research. A maths
+project may have a wiki / collection of notes, the paper being written,
+computer programs implementing the results etc. The issue tracker can
+serve as a discussion forum about the project as well as a platform for
+open review (bugs are analogous to mistakes, enhancements are possible
+ways of improving the main results etc.), and anyone can make their own
+version of the project, and (optionally) contribute back by making pull
+requests, which will also be openly reviewed. One may want to add an
+extra "review this project" functionality, so that people can comment on
+the original project like they do in a pull request. This may or may not
+be necessary, as anyone can make comments or point out mistakes in the
+issue tracker.
+
+One may doubt this model due to concerns of credits because work in
+progress is available to anyone. Well, since all the contributions are
+trackable in project commit history and public discussions in issues and
+pull request reviews, there is in fact /less/ room for cheating than the
+current model in academia, where scooping can happen without any
+witnesses. What we need is a platform with a good amount of trust like
+arXiv, so that the open research community honours (and can not ignore)
+the commit history, and the chance of mis-attribution can be reduced to
+minimum.
+
+Compared to the academic model, open research also has the following
+advantages:
+
+- Anyone in the world with Internet access will have a chance to
+ participate in research, whether they are affiliated to a university,
+ have the financial means to attend conferences, or are colleagues of
+ one of the handful experts in a specific field.
+- The problem of replicating / understanding maths results will be
+ solved, as people help each other out. This will also remove the
+ burden of answering queries about one's research. For example, say one
+ has a project "Understanding the fancy results in [paper name]", they
+ write up some initial notes but get stuck understanding certain
+ arguments. In this case they can simply post the questions on the
+ issue tracker, and anyone who knows the answer, or just has a
+ speculation can participate in the discussion. In the end the problem
+ may be resolved without the authors of the paper being bothered, who
+ may be too busy to answer.
+- Similarly, the burden of peer review can also be shifted from a few
+ appointed reviewers to the crowd.
+
+** related readings
+ :PROPERTIES:
+ :CUSTOM_ID: related-readings
+ :END:
+
+- [[http://www.catb.org/esr/writings/cathedral-bazaar/][The Cathedral
+ and the Bazaar by Eric Raymond]]
+- [[http://michaelnielsen.org/blog/doing-science-online/][Doing sience
+ online by Michael Nielson]]
+- [[https://gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible/][Is
+ massively collaborative mathematics possible? by Timothy Gowers]]
+
+[fn:1] Please send your comments to my email address - I am still
+ looking for ways to add a comment functionality to this website.
diff --git a/posts/2018-04-10-update-open-research.org b/posts/2018-04-10-update-open-research.org
new file mode 100644
index 0000000..4b078d5
--- /dev/null
+++ b/posts/2018-04-10-update-open-research.org
@@ -0,0 +1,185 @@
+#+title: Updates on open research
+
+#+date: <2018-04-29>
+
+It has been 9 months since I last wrote about open (maths) research.
+Since then two things happened which prompted me to write an update.
+
+As always I discuss open research only in mathematics, not because I
+think it should not be applied to other disciplines, but simply because
+I do not have experience nor sufficient interests in non-mathematical
+subjects.
+
+First, I read about Richard Stallman the founder of the free software
+movement, in [[http://shop.oreilly.com/product/9780596002879.do][his
+biography by Sam Williams]] and his own collection of essays
+[[https://shop.fsf.org/books-docs/free-software-free-society-selected-essays-richard-m-stallman-3rd-edition][/Free
+software, free society/]], from which I learned a bit more about the
+context and philosophy of free software and its relation to that of open
+source software. For anyone interested in open research, I highly
+recommend having a look at these two books. I am also reading Levy's
+[[http://www.stevenlevy.com/index.php/books/hackers][Hackers]], which
+documented the development of the hacker culture predating Stallman. I
+can see the connection of ideas from the hacker ethic to the free
+software philosophy and to the open source philosophy. My guess is that
+the software world is fortunate to have pioneers who advocated for
+various kinds of freedom and openness from the beginning, whereas for
+academia which has a much longer history, credit protection has always
+been a bigger concern.
+
+Also a month ago I attended a workshop called
+[[https://www.perimeterinstitute.ca/conferences/open-research-rethinking-scientific-collaboration][Open
+research: rethinking scientific collaboration]]. That was the first time
+I met a group of people (mostly physicists) who also want open research
+to happen, and we had some stimulating discussions. Many thanks to the
+organisers at Perimeter Institute for organising the event, and special
+thanks to
+[[https://www.perimeterinstitute.ca/people/matteo-smerlak][Matteo
+Smerlak]] and
+[[https://www.perimeterinstitute.ca/people/ashley-milsted][Ashley
+Milsted]] for invitation and hosting.
+
+From both of these I feel like I should write an updated post on open
+research.
+
+*** Freedom and community
+ :PROPERTIES:
+ :CUSTOM_ID: freedom-and-community
+ :END:
+Ideals matter. Stallman's struggles stemmed from the frustration of
+denied request of source code (a frustration I shared in academia except
+source code is replaced by maths knowledge), and revolved around two
+things that underlie the free software movement: freedom and community.
+That is, the freedom to use, modify and share a work, and by sharing, to
+help the community.
+
+Likewise, as for open research, apart from the utilitarian view that
+open research is more efficient / harder for credit theft, we should not
+ignore the ethical aspect that open research is right and fair. In
+particular, I think freedom and community can also serve as principles
+in open research. One way to make this argument more concrete is to
+describe what I feel are the central problems: NDAs (non-disclosure
+agreements) and reproducibility.
+
+*NDAs*. It is assumed that when establishing a research collaboration,
+or just having a discussion, all those involved own the joint work in
+progress, and no one has the freedom to disclose any information
+e.g. intermediate results without getting permission from all
+collaborators. In effect this amounts to signing an NDA. NDAs are
+harmful because they restrict people's freedom from sharing information
+that can benefit their own or others' research. Considering that in
+contrast to the private sector, the primary goal of academia is
+knowledge but not profit, NDAs in research are unacceptable.
+
+*Reproducibility*. Research papers written down are not necessarily
+reproducible, even though they appear on peer-reviewed journals. This is
+because the peer-review process is opaque and the proofs in the papers
+may not be clear to everyone. To make things worse, there are no open
+channels to discuss results in these papers and one may have to rely on
+interacting with the small circle of the informed. One example is folk
+theorems. Another is trade secrets required to decipher published works.
+
+I should clarify that freedom works both ways. One should have the
+freedom to disclose maths knowledge, but they should also be free to
+withhold any information that does not hamper the reproducibility of
+published works (e.g. results in ongoing research yet to be published),
+even though it may not be nice to do so when such information can help
+others with their research.
+
+Similar to the solution offered by the free software movement, we need a
+community that promotes and respects free flow of maths knowledge, in
+the spirit of the [[https://www.gnu.org/philosophy/][four essential
+freedoms]], a community that rejects NDAs and upholds reproducibility.
+
+Here are some ideas on how to tackle these two problems and build the
+community:
+
+1. Free licensing. It solves NDA problem - free licenses permit
+ redistribution and modification of works, so if you adopt them in
+ your joint work, then you have the freedom to modify and distribute
+ the work; it also helps with reproducibility - if a paper is not
+ clear, anyone can write their own version and publish it. Bonus
+ points with the use of copyleft licenses like
+ [[https://creativecommons.org/licenses/by-sa/4.0/][Creative Commons
+ Share-Alike]] or the [[https://www.gnu.org/licenses/fdl.html][GNU
+ Free Documentation License]].
+2. A forum for discussions of mathematics. It helps solve the
+ reproducibility problem - public interaction may help quickly clarify
+ problems. By the way, Math Overflow is not a forum.
+3. An infrastructure of mathematical knowledge. Like the GNU system, a
+ mathematics encyclopedia under a copyleft license maintained in the
+ Github-style rather than Wikipedia-style by a "Free Mathematics
+ Foundation", and drawing contributions from the public (inside or
+ outside of the academia). To begin with, crowd-source (again,
+ Github-style) the proofs of say 1000 foundational theorems covered in
+ the curriculum of a bachelor's degree. Perhaps start with taking
+ contributions from people with some credentials (e.g. having a
+ bachelor degree in maths) and then expand the contribution permission
+ to the public, or taking advantage of existing corpus under free
+ license like Wikipedia.
+4. Citing with care: if a work is considered authorative but you
+ couldn't reproduce the results, whereas another paper which tries to
+ explain or discuss similar results makes the first paper
+ understandable to you, give both papers due attribution (something
+ like: see [1], but I couldn't reproduce the proof in [1], and the
+ proofs in [2] helped clarify it). No one should be offended if you
+ say you can not reproduce something - there may be causes on both
+ sides, whereas citing [2] is fairer and helps readers with a similar
+ background.
+
+*** Tools for open research
+ :PROPERTIES:
+ :CUSTOM_ID: tools-for-open-research
+ :END:
+The open research workshop revolved around how to lead academia towards
+a more open culture. There were discussions on open research tools,
+improving credit attributions, the peer-review process and the path to
+adoption.
+
+During the workshop many efforts for open research were mentioned, and
+afterwards I was also made aware by more of them, like the following:
+
+- [[https://osf.io][OSF]], an online research platform. It has a clean
+ and simple interface with commenting, wiki, citation generation, DOI
+ generation, tags, license generation etc. Like Github it supports
+ private and public repositories (but defaults to private), version
+ control, with the ability to fork or bookmark a project.
+- [[https://scipost.org/][SciPost]], physics journals whose peer review
+ reports and responses are public (peer-witnessed refereeing), and
+ allows comments (post-publication evaluation). Like arXiv, it requires
+ some academic credential (PhD or above) to register.
+- [[https://knowen.org/][Knowen]], a platform to organise knowledge in
+ directed acyclic graphs. Could be useful for building the
+ infrastructure of mathematical knowledge.
+- [[https://fermatslibrary.com/][Fermat's Library]], the journal club
+ website that crowd-annotates one notable paper per week released a
+ Chrome extension [[https://fermatslibrary.com/librarian][Librarian]]
+ that overlays a commenting interface on arXiv. As an example Ian
+ Goodfellow did an
+ [[https://fermatslibrary.com/arxiv_comments?url=https://arxiv.org/pdf/1406.2661.pdf][AMA
+ (ask me anything) on his GAN paper]].
+- [[https://polymathprojects.org/][The Polymath project]], the famous
+ massive collaborative mathematical project. Not exactly new, the
+ Polymath project is the only open maths research project that has
+ gained some traction and recognition. However, it does not have many
+ active projects
+ ([[http://michaelnielsen.org/polymath1/index.php?title=Main_Page][currently
+ only one active project]]).
+- [[https://stacks.math.columbia.edu/][The Stacks Project]]. I was made
+ aware of this project by [[https://people.kth.se/~yitingl/][Yiting]].
+ Its data is hosted on github and accepts contributions via pull
+ requests and is licensed under the GNU Free Documentation License,
+ ticking many boxes of the free and open source model.
+
+*** An anecdote from the workshop
+ :PROPERTIES:
+ :CUSTOM_ID: an-anecdote-from-the-workshop
+ :END:
+In a conversation during the workshop, one of the participants called
+open science "normal science", because reproducibility, open access,
+collaborations, and fair attributions are all what science is supposed
+to be, and practices like treating the readers as buyers rather than
+users should be called "bad science", rather than "closed science".
+
+To which an organiser replied: maybe we should rename the workshop
+"Not-bad science".
diff --git a/posts/2018-06-03-automatic_differentiation.org b/posts/2018-06-03-automatic_differentiation.org
new file mode 100644
index 0000000..cebcf8c
--- /dev/null
+++ b/posts/2018-06-03-automatic_differentiation.org
@@ -0,0 +1,100 @@
+#+title: Automatic differentiation
+
+#+date: <2018-06-03>
+
+This post serves as a note and explainer of autodiff. It is licensed
+under [[https://www.gnu.org/licenses/fdl.html][GNU FDL]].
+
+For my learning I benefited a lot from
+[[http://www.cs.toronto.edu/%7Ergrosse/courses/csc321_2018/slides/lec10.pdf][Toronto
+CSC321 slides]] and the
+[[https://github.com/mattjj/autodidact/][autodidact]] project which is a
+pedagogical implementation of
+[[https://github.com/hips/autograd][Autograd]]. That said, any mistakes
+in this note are mine (especially since some of the knowledge is
+obtained from interpreting slides!), and if you do spot any I would be
+grateful if you can let me know.
+
+Automatic differentiation (AD) is a way to compute derivatives. It does
+so by traversing through a computational graph using the chain rule.
+
+There are the forward mode AD and reverse mode AD, which are kind of
+symmetric to each other and understanding one of them results in little
+to no difficulty in understanding the other.
+
+In the language of neural networks, one can say that the forward mode AD
+is used when one wants to compute the derivatives of functions at all
+layers with respect to input layer weights, whereas the reverse mode AD
+is used to compute the derivatives of output functions with respect to
+weights at all layers. Therefore reverse mode AD (rmAD) is the one to
+use for gradient descent, which is the one we focus in this post.
+
+Basically rmAD requires the computation to be sufficiently decomposed,
+so that in the computational graph, each node as a function of its
+parent nodes is an elementary function that the AD engine has knowledge
+about.
+
+For example, the Sigmoid activation $a' = \sigma(w a + b)$ is quite
+simple, but it should be decomposed to simpler computations:
+
+- $a' = 1 / t_1$
+- $t_1 = 1 + t_2$
+- $t_2 = \exp(t_3)$
+- $t_3 = - t_4$
+- $t_4 = t_5 + b$
+- $t_5 = w a$
+
+Thus the function $a'(a)$ is decomposed to elementary operations like
+addition, subtraction, multiplication, reciprocitation, exponentiation,
+logarithm etc. And the rmAD engine stores the Jacobian of these
+elementary operations.
+
+Since in neural networks we want to find derivatives of a single loss
+function $L(x; \theta)$, we can omit $L$ when writing derivatives and
+denote, say $\bar \theta_k := \partial_{\theta_k} L$.
+
+In implementations of rmAD, one can represent the Jacobian as a
+transformation $j: (x \to y) \to (y, \bar y, x) \to \bar x$. $j$ is
+called the /Vector Jacobian Product/ (VJP). For example,
+$j(\exp)(y, \bar y, x) = y \bar y$ since given $y = \exp(x)$,
+
+$\partial_x L = \partial_x y \cdot \partial_y L = \partial_x \exp(x) \cdot \partial_y L = y \bar y$
+
+as another example, $j(+)(y, \bar y, x_1, x_2) = (\bar y, \bar y)$ since
+given $y = x_1 + x_2$, $\bar{x_1} = \bar{x_2} = \bar y$.
+
+Similarly,
+
+1. $j(/)(y, \bar y, x_1, x_2) = (\bar y / x_2, - \bar y x_1 / x_2^2)$
+2. $j(\log)(y, \bar y, x) = \bar y / x$
+3. $j((A, \beta) \mapsto A \beta)(y, \bar y, A, \beta) = (\bar y \otimes \beta, A^T \bar y)$.
+4. etc...
+
+In the third one, the function is a matrix $A$ multiplied on the right
+by a column vector $\beta$, and $\bar y \otimes \beta$ is the tensor
+product which is a fancy way of writing $\bar y \beta^T$. See
+[[https://github.com/mattjj/autodidact/blob/master/autograd/numpy/numpy_vjps.py][numpy_vjps.py]]
+for the implementation in autodidact.
+
+So, given a node say $y = y(x_1, x_2, ..., x_n)$, and given the value of
+$y$, $x_{1 : n}$ and $\bar y$, rmAD computes the values of
+$\bar x_{1 : n}$ by using the Jacobians.
+
+This is the gist of rmAD. It stores the values of each node in a forward
+pass, and computes the derivatives of each node exactly once in a
+backward pass.
+
+It is a nice exercise to derive the backpropagation in the fully
+connected feedforward neural networks
+(e.g. [[http://neuralnetworksanddeeplearning.com/chap2.html#the_four_fundamental_equations_behind_backpropagation][the
+one for MNIST in Neural Networks and Deep Learning]]) using rmAD.
+
+AD is an approach lying between the extremes of numerical approximation
+(e.g. finite difference) and symbolic evaluation. It uses exact formulas
+(VJP) at each elementary operation like symbolic evaluation, while
+evaluates each VJP numerically rather than lumping all the VJPs into an
+unwieldy symbolic formula.
+
+Things to look further into: the higher-order functional currying form
+$j: (x \to y) \to (y, \bar y, x) \to \bar x$ begs for a functional
+programming implementation.
diff --git a/posts/2018-12-02-lime-shapley.org b/posts/2018-12-02-lime-shapley.org
new file mode 100644
index 0000000..05ef4ee
--- /dev/null
+++ b/posts/2018-12-02-lime-shapley.org
@@ -0,0 +1,362 @@
+#+title: Shapley, LIME and SHAP
+
+#+date: <2018-12-02>
+
+In this post I explain LIME (Ribeiro et. al. 2016), the Shapley values
+(Shapley, 1953) and the SHAP values (Strumbelj-Kononenko, 2014;
+Lundberg-Lee, 2017).
+
+*Acknowledgement*. Thanks to Josef Lindman Hörnlund for bringing the
+LIME and SHAP papers to my attention. The research was done while
+working at KTH mathematics department.
+
+/If you are reading on a mobile device, you may need to "request desktop
+site" for the equations to be properly displayed. This post is licensed
+under CC BY-SA and GNU FDL./
+
+** Shapley values
+ :PROPERTIES:
+ :CUSTOM_ID: shapley-values
+ :END:
+A coalitional game $(v, N)$ of $n$ players involves
+
+- The set $N = \{1, 2, ..., n\}$ that represents the players.
+- A function $v: 2^N \to \mathbb R$, where $v(S)$ is the worth of
+ coalition $S \subset N$.
+
+The Shapley values $\phi_i(v)$ of such a game specify a fair way to
+distribute the total worth $v(N)$ to the players. It is defined as (in
+the following, for a set $S \subset N$ we use the convention $s = |S|$
+to be the number of elements of set $S$ and the shorthand
+$S - i := S \setminus \{i\}$ and $S + i := S \cup \{i\}$)
+
+$$\phi_i(v) = \sum_{S: i \in S} {(n - s)! (s - 1)! \over n!} (v(S) - v(S - i)).$$
+
+It is not hard to see that $\phi_i(v)$ can be viewed as an expectation:
+
+$$\phi_i(v) = \mathbb E_{S \sim \nu_i} (v(S) - v(S - i))$$
+
+where $\nu_i(S) = n^{-1} {n - 1 \choose s - 1}^{-1} 1_{i \in S}$, that
+is, first pick the size $s$ uniformly from $\{1, 2, ..., n\}$, then pick
+$S$ uniformly from the subsets of $N$ that has size $s$ and contains
+$i$.
+
+The Shapley values satisfy some nice properties which are readily
+verified, including:
+
+- *Efficiency*. $\sum_i \phi_i(v) = v(N) - v(\emptyset)$.
+- *Symmetry*. If for some $i, j \in N$, for all $S \subset N$, we have
+ $v(S + i) = v(S + j)$, then $\phi_i(v) = \phi_j(v)$.
+- *Null player*. If for some $i \in N$, for all $S \subset N$, we have
+ $v(S + i) = v(S)$, then $\phi_i(v) = 0$.
+- *Linearity*. $\phi_i$ is linear in games. That is
+ $\phi_i(v) + \phi_i(w) = \phi_i(v + w)$, where $v + w$ is defined by
+ $(v + w)(S) := v(S) + w(S)$.
+
+In the literature, an added assumption $v(\emptyset) = 0$ is often
+given, in which case the Efficiency property is defined as
+$\sum_i \phi_i(v) = v(N)$. Here I discard this assumption to avoid minor
+inconsistencies across different sources. For example, in the LIME
+paper, the local model is defined without an intercept, even though the
+underlying $v(\emptyset)$ may not be $0$. In the SHAP paper, an
+intercept $\phi_0 = v(\emptyset)$ is added which fixes this problem when
+making connections to the Shapley values.
+
+Conversely, according to Strumbelj-Kononenko (2010), it was shown in
+Shapley's original paper (Shapley, 1953) that these four properties
+together with $v(\emptyset) = 0$ defines the Shapley values.
+
+** LIME
+ :PROPERTIES:
+ :CUSTOM_ID: lime
+ :END:
+LIME (Ribeiro et. al. 2016) is a model that offers a way to explain
+feature contributions of supervised learning models locally.
+
+Let $f: X_1 \times X_2 \times ... \times X_n \to \mathbb R$ be a
+function. We can think of $f$ as a model, where $X_j$ is the space of
+$j$th feature. For example, in a language model, $X_j$ may correspond to
+the count of the $j$th word in the vocabulary, i.e. the bag-of-words
+model.
+
+The output may be something like housing price, or log-probability of
+something.
+
+LIME tries to assign a value to each feature /locally/. By locally, we
+mean that given a specific sample $x \in X := \prod_{i = 1}^n X_i$, we
+want to fit a model around it.
+
+More specifically, let $h_x: 2^N \to X$ be a function defined by
+
+$$(h_x(S))_i =
+\begin{cases}
+x_i, & \text{if }i \in S; \\
+0, & \text{otherwise.}
+\end{cases}$$
+
+That is, $h_x(S)$ masks the features that are not in $S$, or in other
+words, we are perturbing the sample $x$. Specifically, $h_x(N) = x$.
+Alternatively, the $0$ in the "otherwise" case can be replaced by some
+kind of default value (see the section titled SHAP in this post).
+
+For a set $S \subset N$, let us denote $1_S \in \{0, 1\}^n$ to be an
+$n$-bit where the $k$th bit is $1$ if and only if $k \in S$.
+
+Basically, LIME samples $S_1, S_2, ..., S_m \subset N$ to obtain a set
+of perturbed samples $x_i = h_x(S_i)$ in the $X$ space, and then fits a
+linear model $g$ using $1_{S_i}$ as the input samples and $f(h_x(S_i))$
+as the output samples:
+
+*Problem*(LIME). Find $w = (w_1, w_2, ..., w_n)$ that minimises
+
+$$\sum_i (w \cdot 1_{S_i} - f(h_x(S_i)))^2 \pi_x(h_x(S_i))$$
+
+where $\pi_x(x')$ is a function that penalises $x'$s that are far away
+from $x$. In the LIME paper the Gaussian kernel was used:
+
+$$\pi_x(x') = \exp\left({- \|x - x'\|^2 \over \sigma^2}\right).$$
+
+Then $w_i$ represents the importance of the $i$th feature.
+
+The LIME model has a more general framework, but the specific model
+considered in the paper is the one described above, with a Lasso for
+feature selection.
+
+*Remark*. One difference between our account here and the one in the
+LIME paper is: the dimension of the data space may differ from $n$ (see
+Section 3.1 of that paper). But in the case of text data, they do use
+bag-of-words (our $X$) for an "intermediate" representation. So my
+understanding is, in their context, there is an "original" data space
+(let's call it $X'$). And there is a one-one correspondence between $X'$
+and $X$ (let's call it $r: X' \to X$), so that given a sample
+$x' \in X'$, we can compute the output of $S$ in the local model with
+$f(r^{-1}(h_{r(x')}(S)))$. As an example, in the example of $X$ being
+the bag of words, $X'$ may be the embedding vector space, so that
+$r(x') = A^{-1} x'$, where $A$ is the word embedding matrix. Therefore,
+without loss of generality, we assume the input space to be $X$ which is
+of dimension $n$.
+
+** Shapley values and LIME
+ :PROPERTIES:
+ :CUSTOM_ID: shapley-values-and-lime
+ :END:
+The connection between the Shapley values and LIME is noted in
+Lundberg-Lee (2017), but the underlying connection goes back to 1988
+(Charnes et. al.).
+
+To see the connection, we need to modify LIME a bit.
+
+First, we need to make LIME less efficient by considering /all/ the
+$2^n$ subsets instead of the $m$ samples $S_1, S_2, ..., S_{m}$.
+
+Then we need to relax the definition of $\pi_x$. It no longer needs to
+penalise samples that are far away from $x$. In fact, we will see later
+than the choice of $\pi_x(x')$ that yields the Shapley values is high
+when $x'$ is very close or very far away from $x$, and low otherwise. We
+further add the restriction that $\pi_x(h_x(S))$ only depends on the
+size of $S$, thus we rewrite it as $q(s)$ instead.
+
+We also denote $v(S) := f(h_x(S))$ and $w(S) = \sum_{i \in S} w_i$.
+
+Finally, we add the Efficiency property as a constraint:
+$\sum_{i = 1}^n w_i = f(x) - f(h_x(\emptyset)) = v(N) - v(\emptyset)$.
+
+Then the problem becomes a weighted linear regression:
+
+*Problem*. minimises $\sum_{S \subset N} (w(S) - v(S))^2 q(s)$ over $w$
+subject to $w(N) = v(N) - v(\emptyset)$.
+
+*Claim* (Charnes et. al. 1988). The solution to this problem is
+
+$$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s)\right)^{-1} \sum_{S \subset N: i \in S} \left({n - s \over n} q(s) v(S) - {s - 1 \over n} q(s - 1) v(S - i)\right). \qquad (-1)$$
+
+Specifically, if we choose
+
+$$q(s) = c {n - 2 \choose s - 1}^{-1}$$
+
+for any constant $c$, then $w_i = \phi_i(v)$ are the Shapley values.
+
+*Remark*. Don't worry about this specific choice of $q(s)$ when $s = 0$
+or $n$, because $q(0)$ and $q(n)$ do not appear on the right hand side
+of (-1). Therefore they can be defined to be of any value. A common
+convention of the binomial coefficients is to set ${\ell \choose k} = 0$
+if $k < 0$ or $k > \ell$, in which case $q(0) = q(n) = \infty$.
+
+In Lundberg-Lee (2017), $c$ is chosen to be $1 / n$, see Theorem 2
+there.
+
+In Charnes et. al. 1988, the $w_i$s defined in (-1) are called the
+generalised Shapley values.
+
+*Proof*. The Lagrangian is
+
+$$L(w, \lambda) = \sum_{S \subset N} (v(S) - w(S))^2 q(s) - \lambda(w(N) - v(N) + v(\emptyset)).$$
+
+and by making $\partial_{w_i} L(w, \lambda) = 0$ we have
+
+$${1 \over 2} \lambda = \sum_{S \subset N: i \in S} (w(S) - v(S)) q(s). \qquad (0)$$
+
+Summing (0) over $i$ and divide by $n$, we have
+
+$${1 \over 2} \lambda = {1 \over n} \sum_i \sum_{S: i \in S} (w(S) q(s) - v(S) q(s)). \qquad (1)$$
+
+We examine each of the two terms on the right hand side.
+
+Counting the terms involving $w_i$ and $w_j$ for $j \neq i$, and using
+$w(N) = v(N) - v(\emptyset)$ we have
+
+$$\begin{aligned}
+&\sum_{S \subset N: i \in S} w(S) q(s) \\
+&= \sum_{s = 1}^n {n - 1 \choose s - 1} q(s) w_i + \sum_{j \neq i}\sum_{s = 2}^n {n - 2 \choose s - 2} q(s) w_j \\
+&= q(1) w_i + \sum_{s = 2}^n q(s) \left({n - 1 \choose s - 1} w_i + \sum_{j \neq i} {n - 2 \choose s - 2} w_j\right) \\
+&= q(1) w_i + \sum_{s = 2}^n \left({n - 2 \choose s - 1} w_i + {n - 2 \choose s - 2} (v(N) - v(\emptyset))\right) q(s) \\
+&= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) w_i + \sum_{s = 2}^n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset)). \qquad (2)
+\end{aligned}$$
+
+Summing (2) over $i$, we obtain
+
+$$\begin{aligned}
+&\sum_i \sum_{S: i \in S} w(S) q(s)\\
+&= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) (v(N) - v(\emptyset)) + \sum_{s = 2}^n n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset))\\
+&= \sum_{s = 1}^n s{n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset)). \qquad (3)
+\end{aligned}$$
+
+For the second term in (1), we have
+
+$$\sum_i \sum_{S: i \in S} v(S) q(s) = \sum_{S \subset N} s v(S) q(s). \qquad (4)$$
+
+Plugging (3)(4) in (1), we have
+
+$${1 \over 2} \lambda = {1 \over n} \left(\sum_{s = 1}^n s {n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset)) - \sum_{S \subset N} s q(s) v(S) \right). \qquad (5)$$
+
+Plugging (5)(2) in (0) and solve for $w_i$, we have
+
+$$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) \right)^{-1} \left( \sum_{S: i \in S} q(s) v(S) - {1 \over n} \sum_{S \subset N} s q(s) v(S) \right). \qquad (6)$$
+
+By splitting all subsets of $N$ into ones that contain $i$ and ones that
+do not and pair them up, we have
+
+$$\sum_{S \subset N} s q(s) v(S) = \sum_{S: i \in S} (s q(s) v(S) + (s - 1) q(s - 1) v(S - i)).$$
+
+Plugging this back into (6) we get the desired result. $\square$
+
+** SHAP
+ :PROPERTIES:
+ :CUSTOM_ID: shap
+ :END:
+The paper that coined the term "SHAP values" (Lundberg-Lee 2017) is not
+clear in its definition of the "SHAP values" and its relation to LIME,
+so the following is my interpretation of their interpretation model,
+which coincide with a model studied in Strumbelj-Kononenko 2014.
+
+Recall that we want to calculate feature contributions to a model $f$ at
+a sample $x$.
+
+Let $\mu$ be a probability density function over the input space
+$X = X_1 \times ... \times X_n$. A natural choice would be the density
+that generates the data, or one that approximates such density (e.g.
+empirical distribution).
+
+The feature contribution (SHAP value) is thus defined as the Shapley
+value $\phi_i(v)$, where
+
+$$v(S) = \mathbb E_{z \sim \mu} (f(z) | z_S = x_S). \qquad (7)$$
+
+So it is a conditional expectation where $z_i$ is clamped for $i \in S$.
+In fact, the definition of feature contributions in this form predates
+Lundberg-Lee 2017. For example, it can be found in
+Strumbelj-Kononenko 2014.
+
+One simplification is to assume the $n$ features are independent, thus
+$\mu = \mu_1 \times \mu_2 \times ... \times \mu_n$. In this case, (7)
+becomes
+
+$$v(S) = \mathbb E_{z_{N \setminus S} \sim \mu_{N \setminus S}} f(x_S, z_{N \setminus S}) \qquad (8)$$
+
+For example, Strumbelj-Kononenko (2010) considers this scenario where
+$\mu$ is the uniform distribution over $X$, see Definition 4 there.
+
+A further simplification is model linearity, which means $f$ is linear.
+In this case, (8) becomes
+
+$$v(S) = f(x_S, \mathbb E_{\mu_{N \setminus S}} z_{N \setminus S}). \qquad (9)$$
+
+It is worth noting that to make the modified LIME model considered in
+the previous section fall under the linear SHAP framework (9), we need
+to make two further specialisations, the first is rather cosmetic: we
+need to change the definition of $h_x(S)$ to
+
+$$(h_x(S))_i =
+\begin{cases}
+x_i, & \text{if }i \in S; \\
+\mathbb E_{\mu_i} z_i, & \text{otherwise.}
+\end{cases}$$
+
+But we also need to boldly assume the original $f$ to be linear, which
+in my view, defeats the purpose of interpretability, because linear
+models are interpretable by themselves.
+
+One may argue that perhaps we do not need linearity to define $v(S)$ as
+in (9). If we do so, however, then (9) loses mathematical meaning. A
+bigger question is: how effective is SHAP? An even bigger question: in
+general, how do we evaluate models of interpretation?
+
+** Evaluating SHAP
+ :PROPERTIES:
+ :CUSTOM_ID: evaluating-shap
+ :END:
+The quest of the SHAP paper can be decoupled into two independent
+components: showing the niceties of Shapley values and choosing the
+coalitional game $v$.
+
+The SHAP paper argues that Shapley values $\phi_i(v)$ are a good
+measurement because they are the only values satisfying the some nice
+properties including the Efficiency property mentioned at the beginning
+of the post, invariance under permutation and monotonicity, see the
+paragraph below Theorem 1 there, which refers to Theorem 2 of Young
+(1985).
+
+Indeed, both efficiency (the "additive feature attribution methods" in
+the paper) and monotonicity are meaningful when considering $\phi_i(v)$
+as the feature contribution of the $i$th feature.
+
+The question is thus reduced to the second component: what constitutes a
+nice choice of $v$?
+
+The SHAP paper answers this question with 3 options with increasing
+simplification: (7)(8)(9) in the previous section of this post
+(corresponding to (9)(11)(12) in the paper). They are intuitive, but it
+will be interesting to see more concrete (or even mathematical)
+justifications of such choices.
+
+** References
+ :PROPERTIES:
+ :CUSTOM_ID: references
+ :END:
+
+- Charnes, A., B. Golany, M. Keane, and J. Rousseau. "Extremal Principle
+ Solutions of Games in Characteristic Function Form: Core, Chebychev
+ and Shapley Value Generalizations." In Econometrics of Planning and
+ Efficiency, edited by Jati K. Sengupta and Gopal K. Kadekodi, 123--33.
+ Dordrecht: Springer Netherlands, 1988.
+ [[https://doi.org/10.1007/978-94-009-3677-5_7]].
+- Lundberg, Scott, and Su-In Lee. "A Unified Approach to Interpreting
+ Model Predictions." ArXiv:1705.07874 [Cs, Stat], May 22, 2017.
+ [[http://arxiv.org/abs/1705.07874]].
+- Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "'Why Should
+ I Trust You?': Explaining the Predictions of Any Classifier."
+ ArXiv:1602.04938 [Cs, Stat], February 16, 2016.
+ [[http://arxiv.org/abs/1602.04938]].
+- Shapley, L. S. "17. A Value for n-Person Games." In Contributions to
+ the Theory of Games (AM-28), Volume II, Vol. 2. Princeton: Princeton
+ University Press, 1953. [[https://doi.org/10.1515/9781400881970-018]].
+- Strumbelj, Erik, and Igor Kononenko. "An Efficient Explanation of
+ Individual Classifications Using Game Theory." J. Mach. Learn. Res. 11
+ (March 2010): 1--18.
+- Strumbelj, Erik, and Igor Kononenko. "Explaining Prediction Models and
+ Individual Predictions with Feature Contributions." Knowledge and
+ Information Systems 41, no. 3 (December 2014): 647--65.
+ [[https://doi.org/10.1007/s10115-013-0679-x]].
+- Young, H. P. "Monotonic Solutions of Cooperative Games." International
+ Journal of Game Theory 14, no. 2 (June 1, 1985): 65--72.
+ [[https://doi.org/10.1007/BF01769885]].
diff --git a/posts/2019-01-03-discriminant-analysis.org b/posts/2019-01-03-discriminant-analysis.org
new file mode 100644
index 0000000..34c16bf
--- /dev/null
+++ b/posts/2019-01-03-discriminant-analysis.org
@@ -0,0 +1,293 @@
+#+title: Discriminant analysis
+
+#+DATE: <2019-01-03>
+
+In this post I talk about the theory and implementation of linear and
+quadratic discriminant analysis, classical methods in statistical
+learning.
+
+*Acknowledgement*. Various sources were of great help to my
+understanding of the subject, including Chapter 4 of
+[[https://web.stanford.edu/~hastie/ElemStatLearn/][The Elements of
+Statistical Learning]],
+[[http://cs229.stanford.edu/notes/cs229-notes2.pdf][Stanford CS229
+Lecture notes]], and
+[[https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/discriminant_analysis.py][the
+scikit-learn code]]. Research was done while working at KTH mathematics
+department.
+
+/If you are reading on a mobile device, you may need to "request desktop
+site" for the equations to be properly displayed. This post is licensed
+under CC BY-SA and GNU FDL./
+
+** Theory
+ :PROPERTIES:
+ :CUSTOM_ID: theory
+ :END:
+Quadratic discriminant analysis (QDA) is a classical classification
+algorithm. It assumes that the data is generated by Gaussian
+distributions, where each class has its own mean and covariance.
+
+$$(x | y = i) \sim N(\mu_i, \Sigma_i).$$
+
+It also assumes a categorical class prior:
+
+$$\mathbb P(y = i) = \pi_i$$
+
+The log-likelihood is thus
+
+$$\begin{aligned}
+\log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\
+&= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0)
+\end{aligned}$$
+
+where $C$ and $C'$ are constants.
+
+Thus the prediction is done by taking argmax of the above formula.
+
+In training, let $X$, $y$ be the input data, where $X$ is of shape
+$m \times n$, and $y$ of shape $m$. We adopt the convention that each
+row of $X$ is a sample $x^{(i)T}$. So there are $m$ samples and $n$
+features. Denote by $m_i = \#\{j: y_j = i\}$ be the number of samples in
+class $i$. Let $n_c$ be the number of classes.
+
+We estimate $\mu_i$ by the sample means, and $\pi_i$ by the frequencies:
+
+$$\begin{aligned}
+\mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\
+\pi_i &:= \mathbb P(y = i) = {m_i \over m}.
+\end{aligned}$$
+
+Linear discriminant analysis (LDA) is a specialisation of QDA: it
+assumes all classes share the same covariance, i.e. $\Sigma_i = \Sigma$
+for all $i$.
+
+Guassian Naive Bayes is a different specialisation of QDA: it assumes
+that all $\Sigma_i$ are diagonal, since all the features are assumed to
+be independent.
+
+*** QDA
+ :PROPERTIES:
+ :CUSTOM_ID: qda
+ :END:
+We look at QDA.
+
+We estimate $\Sigma_i$ by the variance mean:
+
+$$\begin{aligned}
+\Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}.
+\end{aligned}$$
+
+where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ are the centred $x^{(j)}$.
+Plugging this into (0) we are done.
+
+There are two problems that can break the algorithm. First, if one of
+the $m_i$ is $1$, then $\Sigma_i$ is ill-defined. Second, one of
+$\Sigma_i$'s might be singular.
+
+In either case, there is no way around it, and the implementation should
+throw an exception.
+
+This won't be a problem of the LDA, though, unless there is only one
+sample for each class.
+
+*** Vanilla LDA
+ :PROPERTIES:
+ :CUSTOM_ID: vanilla-lda
+ :END:
+Now let us look at LDA.
+
+Since all classes share the same covariance, we estimate $\Sigma$ using
+sample variance
+
+$$\begin{aligned}
+\Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T},
+\end{aligned}$$
+
+where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ and ${1 \over m - n_c}$ comes
+from [[https://en.wikipedia.org/wiki/Bessel%27s_correction][Bessel's
+Correction]].
+
+Let us write down the decision function (0). We can remove the first
+term on the right hand side, since all $\Sigma_i$ are the same, and we
+only care about argmax of that equation. Thus it becomes
+
+$$- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)$$
+
+Notice that we just walked around the problem of figuring out
+$\log \det \Sigma$ when $\Sigma$ is singular.
+
+But how about $\Sigma^{-1}$?
+
+We sidestep this problem by using the pseudoinverse of $\Sigma$ instead.
+This can be seen as applying a linear transformation to $X$ to turn its
+covariance matrix to identity. And thus the model becomes a sort of a
+nearest neighbour classifier.
+
+*** Nearest neighbour classifier
+ :PROPERTIES:
+ :CUSTOM_ID: nearest-neighbour-classifier
+ :END:
+More specifically, we want to transform the first term of (0) to a norm
+to get a classifier based on nearest neighbour modulo $\log \pi_i$:
+
+$$- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i$$
+
+To compute $A$, we denote
+
+$$X_c = X - M,$$
+
+where the $i$th row of $M$ is $\mu_{y_i}^T$, the mean of the class $x_i$
+belongs to, so that $\Sigma = {1 \over m - n_c} X_c^T X_c$.
+
+Let
+
+$${1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T$$
+
+be the SVD of ${1 \over \sqrt{m - n_c}}X_c$. Let
+$D_x = \text{diag} (s_1, ..., s_r)$ be the diagonal matrix with all the
+nonzero singular values, and rewrite $V_x$ as an $n \times r$ matrix
+consisting of the first $r$ columns of $V_x$.
+
+Then with an abuse of notation, the pseudoinverse of $\Sigma$ is
+
+$$\Sigma^{-1} = V_x D_x^{-2} V_x^T.$$
+
+So we just need to make $A = D_x^{-1} V_x^T$. When it comes to
+prediction, just transform $x$ with $A$, and find the nearest centroid
+$A \mu_i$ (again, modulo $\log \pi_i$) and label the input with $i$.
+
+*** Dimensionality reduction
+ :PROPERTIES:
+ :CUSTOM_ID: dimensionality-reduction
+ :END:
+We can further simplify the prediction by dimensionality reduction.
+Assume $n_c \le n$. Then the centroid spans an affine space of dimension
+$p$ which is at most $n_c - 1$. So what we can do is to project both the
+transformed sample $Ax$ and centroids $A\mu_i$ to the linear subspace
+parallel to the affine space, and do the nearest neighbour
+classification there.
+
+So we can perform SVD on the matrix $(M - \bar x) V_x D_x^{-1}$ where
+$\bar x$, a row vector, is the sample mean of all data i.e. average of
+rows of $X$:
+
+$$(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.$$
+
+Again, we let $V_m$ be the $r \times p$ matrix by keeping the first $p$
+columns of $V_m$.
+
+The projection operator is thus $V_m$. And so the final transformation
+is $V_m^T D_x^{-1} V_x^T$.
+
+There is no reason to stop here, and we can set $p$ even smaller, which
+will result in a lossy compression / regularisation equivalent to doing
+[[https://en.wikipedia.org/wiki/Principal_component_analysis][principle
+component analysis]] on $(M - \bar x) V_x D_x^{-1}$.
+
+Note that as of 2019-01-04, in the
+[[https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py][scikit-learn
+implementation of LDA]], the prediction is done without any lossy
+compression, even if the parameter =n_components= is set to be smaller
+than dimension of the affine space spanned by the centroids. In other
+words, the prediction does not change regardless of =n_components=.
+
+*** Fisher discriminant analysis
+ :PROPERTIES:
+ :CUSTOM_ID: fisher-discriminant-analysis
+ :END:
+The Fisher discriminant analysis involves finding an $n$-dimensional
+vector $a$ that maximises between-class covariance with respect to
+within-class covariance:
+
+$${a^T M_c^T M_c a \over a^T X_c^T X_c a},$$
+
+where $M_c = M - \bar x$ is the centred sample mean matrix.
+
+As it turns out, this is (almost) equivalent to the derivation above,
+modulo a constant. In particular, $a = c V_m^T D_x^{-1} V_x^T$ where
+$p = 1$ for arbitrary constant $c$.
+
+To see this, we can first multiply the denominator with a constant
+${1 \over m - n_c}$ so that the matrix in the denominator becomes the
+covariance estimate $\Sigma$.
+
+We decompose $a$: $a = V_x D_x^{-1} b + \tilde V_x \tilde b$, where
+$\tilde V_x$ consists of column vectors orthogonal to the column space
+of $V_x$.
+
+We ignore the second term in the decomposition. In other words, we only
+consider $a$ in the column space of $V_x$.
+
+Then the problem is to find an $r$-dimensional vector $b$ to maximise
+
+$${b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.$$
+
+This is the problem of principle component analysis, and so $b$ is first
+column of $V_m$.
+
+Therefore, the solution to Fisher discriminant analysis is
+$a = c V_x D_x^{-1} V_m$ with $p = 1$.
+
+*** Linear model
+ :PROPERTIES:
+ :CUSTOM_ID: linear-model
+ :END:
+The model is called linear discriminant analysis because it is a linear
+model. To see this, let $B = V_m^T D_x^{-1} V_x^T$ be the matrix of
+transformation. Now we are comparing
+
+$$- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k$$
+
+across all $k$s. Expanding the norm and removing the common term
+$\|B x\|^2$, we see a linear form:
+
+$$\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k$$
+
+So the prediction for $X_{\text{new}}$ is
+
+$$\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)$$
+
+thus the decision boundaries are linear.
+
+This is how scikit-learn implements LDA, by inheriting from
+=LinearClassifierMixin= and redirecting the classification there.
+
+** Implementation
+ :PROPERTIES:
+ :CUSTOM_ID: implementation
+ :END:
+This is where things get interesting. How do I validate my understanding
+of the theory? By implementing and testing the algorithm.
+
+I try to implement it as close as possible to the natural language /
+mathematical descriptions of the model, which means clarity over
+performance.
+
+How about testing? Numerical experiments are harder to test than
+combinatorial / discrete algorithms in general because the output is
+less verifiable by hand. My shortcut solution to this problem is to test
+against output from the scikit-learn package.
+
+It turned out to be harder than expected, as I had to dig into the code
+of scikit-learn when the outputs mismatch. Their code is quite
+well-written though.
+
+The result is
+[[https://github.com/ycpei/machine-learning/tree/master/discriminant-analysis][here]].
+
+*** Fun facts about LDA
+ :PROPERTIES:
+ :CUSTOM_ID: fun-facts-about-lda
+ :END:
+One property that can be used to test the LDA implementation is the fact
+that the scatter matrix $B(X - \bar x)^T (X - \bar X) B^T$ of the
+transformed centred sample is diagonal.
+
+This can be derived by using another fun fact that the sum of the
+in-class scatter matrix and the between-class scatter matrix is the
+sample scatter matrix:
+
+$$X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).$$
+
+The verification is not very hard and left as an exercise.
diff --git a/posts/2019-02-14-raise-your-elbo.org b/posts/2019-02-14-raise-your-elbo.org
new file mode 100644
index 0000000..9e15552
--- /dev/null
+++ b/posts/2019-02-14-raise-your-elbo.org
@@ -0,0 +1,1150 @@
+#+title: Raise your ELBO
+
+#+date: <2019-02-14>
+
+In this post I give an introduction to variational inference, which is
+about maximising the evidence lower bound (ELBO).
+
+I use a top-down approach, starting with the KL divergence and the ELBO,
+to lay the mathematical framework of all the models in this post.
+
+Then I define mixture models and the EM algorithm, with Gaussian mixture
+model (GMM), probabilistic latent semantic analysis (pLSA) and the
+hidden markov model (HMM) as examples.
+
+After that I present the fully Bayesian version of EM, also known as
+mean field approximation (MFA), and apply it to fully Bayesian mixture
+models, with fully Bayesian GMM (also known as variational GMM), latent
+Dirichlet allocation (LDA) and Dirichlet process mixture model (DPMM) as
+examples.
+
+Then I explain stochastic variational inference, a modification of EM
+and MFA to improve efficiency.
+
+Finally I talk about autoencoding variational Bayes (AEVB), a
+Monte-Carlo + neural network approach to raising the ELBO, exemplified
+by the variational autoencoder (VAE). I also show its fully Bayesian
+version.
+
+*Acknowledgement*. The following texts and resources were illuminating
+during the writing of this post: the Stanford CS228 notes
+([[https://ermongroup.github.io/cs228-notes/inference/variational/][1]],[[https://ermongroup.github.io/cs228-notes/learning/latent/][2]]),
+the
+[[https://www.cs.tau.ac.il/~rshamir/algmb/presentations/EM-BW-Ron-16%20.pdf][Tel
+Aviv Algorithms in Molecular Biology slides]] (clear explanations of the
+connection between EM and Baum-Welch), Chapter 10 of
+[[https://www.springer.com/us/book/9780387310732][Bishop's book]]
+(brilliant introduction to variational GMM), Section 2.5 of
+[[http://cs.brown.edu/~sudderth/papers/sudderthPhD.pdf][Sudderth's
+thesis]] and [[https://metacademy.org][metacademy]]. Also thanks to
+Josef Lindman Hörnlund for discussions. The research was done while
+working at KTH mathematics department.
+
+/If you are reading on a mobile device, you may need to "request desktop
+site" for the equations to be properly displayed. This post is licensed
+under CC BY-SA and GNU FDL./
+
+** KL divergence and ELBO
+ :PROPERTIES:
+ :CUSTOM_ID: kl-divergence-and-elbo
+ :END:
+Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL)
+divergence is defined as
+
+$$D(q||p) = E_q \log{q \over p}.$$
+
+It achieves minimum $0$ when $p = q$.
+
+If $p$ can be further written as
+
+$$p(x) = {w(x) \over Z}, \qquad (0)$$
+
+where $Z$ is a normaliser, then
+
+$$\log Z = D(q||p) + L(w, q), \qquad(1)$$
+
+where $L(w, q)$ is called the evidence lower bound (ELBO), defined by
+
+$$L(w, q) = E_q \log{w \over q}. \qquad (1.25)$$
+
+From (1), we see that to minimise the nonnegative term $D(q || p)$, one
+can maximise the ELBO.
+
+To this end, we can simply discard $D(q || p)$ in (1) and obtain:
+
+$$\log Z \ge L(w, q) \qquad (1.3)$$
+
+and keep in mind that the inequality becomes an equality when
+$q = {w \over Z}$.
+
+It is time to define the task of variational inference (VI), also known
+as variational Bayes (VB).
+
+*Definition*. Variational inference is concerned with maximising the
+ELBO $L(w, q)$.
+
+There are mainly two versions of VI, the half Bayesian and the fully
+Bayesian cases. Half Bayesian VI, to which expectation-maximisation
+algorithms (EM) apply, instantiates (1.3) with
+
+$$\begin{aligned}
+Z &= p(x; \theta)\\
+w &= p(x, z; \theta)\\
+q &= q(z)
+\end{aligned}$$
+
+and the dummy variable $x$ in Equation (0) is substituted with $z$.
+
+Fully Bayesian VI, often just called VI, has the following
+instantiations:
+
+$$\begin{aligned}
+Z &= p(x) \\
+w &= p(x, z, \theta) \\
+q &= q(z, \theta)
+\end{aligned}$$
+
+and $x$ in Equation (0) is substituted with $(z, \theta)$.
+
+In both cases $\theta$ are parameters and $z$ are latent variables.
+
+*Remark on the naming of things*. The term "variational" comes from the
+fact that we perform calculus of variations: maximise some functional
+($L(w, q)$) over a set of functions ($q$). Note however, most of the VI
+/ VB algorithms do not concern any techniques in calculus of variations,
+but only uses Jensen's inequality / the fact the $D(q||p)$ reaches
+minimum when $p = q$. Due to this reasoning of the naming, EM is also a
+kind of VI, even though in the literature VI often referes to its fully
+Bayesian version.
+
+** EM
+ :PROPERTIES:
+ :CUSTOM_ID: em
+ :END:
+To illustrate the EM algorithms, we first define the mixture model.
+
+*Definition (mixture model)*. Given dataset $x_{1 : m}$, we assume the
+data has some underlying latent variable $z_{1 : m}$ that may take a
+value from a finite set $\{1, 2, ..., n_z\}$. Let $p(z_{i}; \pi)$ be
+categorically distributed according to the probability vector $\pi$.
+That is, $p(z_{i} = k; \pi) = \pi_k$. Also assume
+$p(x_{i} | z_{i} = k; \eta) = p(x_{i}; \eta_k)$. Find
+$\theta = (\pi, \eta)$ that maximises the likelihood
+$p(x_{1 : m}; \theta)$.
+
+Represented as a DAG (a.k.a the plate notations), the model looks like
+this:
+
+[[/assets/resources/mixture-model.png]]
+
+where the boxes with $m$ mean repitition for $m$ times, since there $m$
+indepdent pairs of $(x, z)$, and the same goes for $\eta$.
+
+The direct maximisation
+
+$$\max_\theta \sum_i \log p(x_{i}; \theta) = \max_\theta \sum_i \log \int p(x_{i} | z_i; \theta) p(z_i; \theta) dz_i$$
+
+is hard because of the integral in the log.
+
+We can fit this problem in (1.3) by having $Z = p(x_{1 : m}; \theta)$
+and $w = p(z_{1 : m}, x_{1 : m}; \theta)$. The plan is to update
+$\theta$ repeatedly so that $L(p(z, x; \theta_t), q(z))$ is non
+decreasing over time $t$.
+
+Equation (1.3) at time $t$ for the $i$th datapoint is
+
+$$\log p(x_{i}; \theta_t) \ge L(p(z_i, x_{i}; \theta_t), q(z_i)) \qquad (2)$$
+
+Each timestep consists of two steps, the E-step and the M-step.
+
+At E-step, we set
+
+$$q(z_{i}) = p(z_{i}|x_{i}; \theta_t), $$
+
+to turn the inequality into equality. We denote $r_{ik} = q(z_i = k)$
+and call them responsibilities, so the posterior $q(z_i)$ is categorical
+distribution with parameter $r_i = r_{i, 1 : n_z}$.
+
+At M-step, we maximise $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_{i}))$
+over $\theta$:
+
+$$\begin{aligned}
+\theta_{t + 1} &= \text{argmax}_\theta \sum_i L(p(x_{i}, z_{i}; \theta), p(z_{i} | x_{i}; \theta_t)) \\
+&= \text{argmax}_\theta \sum_i \mathbb E_{p(z_{i} | x_{i}; \theta_t)} \log p(x_{i}, z_{i}; \theta) \qquad (2.3)
+\end{aligned}$$
+
+So $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_i))$ is non-decreasing at both
+the E-step and the M-step.
+
+We can see from this derivation that EM is half-Bayesian. The E-step is
+Bayesian it computes the posterior of the latent variables and the
+M-step is frequentist because it performs maximum likelihood estimate of
+$\theta$.
+
+It is clear that the ELBO sum coverges as it is nondecreasing with an
+upper bound, but it is not clear whether the sum converges to the
+correct value, i.e. $\max_\theta p(x_{1 : m}; \theta)$. In fact it is
+said that the EM does get stuck in local maximum sometimes.
+
+A different way of describing EM, which will be useful in hidden Markov
+model is:
+
+- At E-step, one writes down the formula
+ $$\sum_i \mathbb E_{p(z_i | x_{i}; \theta_t)} \log p(x_{i}, z_i; \theta). \qquad (2.5)$$
+
+- At M-setp, one finds $\theta_{t + 1}$ to be the $\theta$ that
+ maximises the above formula.
+
+*** GMM
+ :PROPERTIES:
+ :CUSTOM_ID: gmm
+ :END:
+Gaussian mixture model (GMM) is an example of mixture models.
+
+The space of the data is $\mathbb R^n$. We use the hypothesis that the
+data is Gaussian conditioned on the latent variable:
+
+$$(x_i; \eta_k) \sim N(\mu_k, \Sigma_k),$$
+
+so we write $\eta_k = (\mu_k, \Sigma_k)$,
+
+During E-step, the $q(z_i)$ can be directly computed using Bayes'
+theorem:
+
+$$r_{ik} = q(z_i = k) = \mathbb P(z_i = k | x_{i}; \theta_t)
+= {g_{\mu_{t, k}, \Sigma_{t, k}} (x_{i}) \pi_{t, k} \over \sum_{j = 1 : n_z} g_{\mu_{t, j}, \Sigma_{t, j}} (x_{i}) \pi_{t, j}},$$
+
+where
+$g_{\mu, \Sigma} (x) = (2 \pi)^{- n / 2} \det \Sigma^{-1 / 2} \exp(- {1 \over 2} (x - \mu)^T \Sigma^{-1} (x - \mu))$
+is the pdf of the Gaussian distribution $N(\mu, \Sigma)$.
+
+During M-step, we need to compute
+
+$$\text{argmax}_{\Sigma, \mu, \pi} \sum_{i = 1 : m} \sum_{k = 1 : n_z} r_{ik} \log (g_{\mu_k, \Sigma_k}(x_{i}) \pi_k).$$
+
+This is similar to the quadratic discriminant analysis, and the solution
+is
+
+$$\begin{aligned}
+\pi_{k} &= {1 \over m} \sum_{i = 1 : m} r_{ik}, \\
+\mu_{k} &= {\sum_i r_{ik} x_{i} \over \sum_i r_{ik}}, \\
+\Sigma_{k} &= {\sum_i r_{ik} (x_{i} - \mu_{t, k}) (x_{i} - \mu_{t, k})^T \over \sum_i r_{ik}}.
+\end{aligned}$$
+
+*Remark*. The k-means algorithm is the $\epsilon \to 0$ limit of the GMM
+with constraints $\Sigma_k = \epsilon I$. See Section 9.3.2 of Bishop
+2006 for derivation. It is also briefly mentioned there that a variant
+in this setting where the covariance matrix is not restricted to
+$\epsilon I$ is called elliptical k-means algorithm.
+
+*** SMM
+ :PROPERTIES:
+ :CUSTOM_ID: smm
+ :END:
+As a transition to the next models to study, let us consider a simpler
+mixture model obtained by making one modification to GMM: change
+$(x; \eta_k) \sim N(\mu_k, \Sigma_k)$ to
+$\mathbb P(x = w; \eta_k) = \eta_{kw}$ where $\eta$ is a stochastic
+matrix and $w$ is an arbitrary element of the space for $x$. So now the
+space for both $x$ and $z$ are finite. We call this model the simple
+mixture model (SMM).
+
+As in GMM, at E-step $r_{ik}$ can be explicitly computed using Bayes'
+theorem.
+
+It is not hard to write down the solution to the M-step in this case:
+
+$$\begin{aligned}
+\pi_{k} &= {1 \over m} \sum_i r_{ik}, \qquad (2.7)\\
+\eta_{k, w} &= {\sum_i r_{ik} 1_{x_i = w} \over \sum_i r_{ik}}. \qquad (2.8)
+\end{aligned}$$
+
+where $1_{x_i = w}$ is the
+[[https://en.wikipedia.org/wiki/Indicator_function][indicator
+function]], and evaluates to $1$ if $x_i = w$ and $0$ otherwise.
+
+Two trivial variants of the SMM are the two versions of probabilistic
+latent semantic analysis (pLSA), which we call pLSA1 and pLSA2.
+
+The model pLSA1 is a probabilistic version of latent semantic analysis,
+which is basically a simple matrix factorisation model in collaborative
+filtering, whereas pLSA2 has a fully Bayesian version called latent
+Dirichlet allocation (LDA), not to be confused with the other LDA
+(linear discriminant analysis).
+
+*** pLSA
+ :PROPERTIES:
+ :CUSTOM_ID: plsa
+ :END:
+The pLSA model (Hoffman 2000) is a mixture model, where the dataset is
+now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$
+are words and $d$ are documents, and a pair $(d, x)$ represent an
+ocurrance of word $x$ in document $d$.
+
+For each datapoint $(d_{i}, x_{i})$,
+
+$$\begin{aligned}
+p(d_i, x_i; \theta) &= \sum_{z_i} p(z_i; \theta) p(d_i | z_i; \theta) p(x_i | z_i; \theta) \qquad (2.91)\\
+&= p(d_i; \theta) \sum_z p(x_i | z_i; \theta) p (z_i | d_i; \theta) \qquad (2.92).
+\end{aligned}$$
+
+Of the two formulations, (2.91) corresponds to pLSA type 1, and (2.92)
+corresponds to type 2.
+
+**** pLSA1
+ :PROPERTIES:
+ :CUSTOM_ID: plsa1
+ :END:
+The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted
+with $(d_i, x_i)$, which conditioned on $z_i$ are independently
+categorically distributed:
+
+$$p(d_i = u, x_i = w | z_i = k; \theta) = p(d_i ; \xi_k) p(x_i; \eta_k) = \xi_{ku} \eta_{kw}.$$
+
+The model can be illustrated in the plate notations:
+
+[[/assets/resources/plsa1.png]]
+
+So the solution of the M-step is
+
+$$\begin{aligned}
+\pi_{k} &= {1 \over m} \sum_i r_{ik} \\
+\xi_{k, u} &= {\sum_i r_{ik} 1_{d_{i} = u} \over \sum_i r_{ik}} \\
+\eta_{k, w} &= {\sum_i r_{ik} 1_{x_{i} = w} \over \sum_i r_{ik}}.
+\end{aligned}$$
+
+*Remark*. pLSA1 is the probabilistic version of LSA, also known as
+matrix factorisation.
+
+Let $n_d$ and $n_x$ be the number of values $d_i$ and $x_i$ can take.
+
+*Problem* (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix
+$s \le \min\{n_d, n_x\}$. Find $n_d \times s$ matrix $D$ and
+$n_x \times s$ matrix $X$ that minimises
+
+$$J(D, X) = \|R - D X^T\|_F.$$
+
+where $\|\cdot\|_F$ is the Frobenius norm.
+
+*Claim*. Let $R = U \Sigma V^T$ be the SVD of $R$, then the solution to
+the above problem is $D = U_s \Sigma_s^{{1 \over 2}}$ and
+$X = V_s \Sigma_s^{{1 \over 2}}$, where $U_s$ (resp. $V_s$) is the
+matrix of the first $s$ columns of $U$ (resp. $V$) and $\Sigma_s$ is the
+$s \times s$ submatrix of $\Sigma$.
+
+One can compare pLSA1 with LSA. Both procedures produce embeddings of
+$d$ and $x$: in pLSA we obtain $n_z$ dimensional embeddings
+$\xi_{\cdot, u}$ and $\eta_{\cdot, w}$, whereas in LSA we obtain $s$
+dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$.
+
+**** pLSA2
+ :PROPERTIES:
+ :CUSTOM_ID: plsa2
+ :END:
+Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite
+it as
+
+$$p(x_i | d_i; \theta) = \sum_{z_i} p(x_i | z_i; \theta) p(z_i | d_i; \theta).$$
+
+To simplify notations, we collect all the $x_i$s with the corresponding
+$d_i$ equal to 1 (suppose there are $m_1$ of them), and write them as
+$(x_{1, j})_{j = 1 : m_1}$. In the same fashion we construct
+$x_{2, 1 : m_2}, x_{3, 1 : m_3}, ... x_{n_d, 1 : m_{n_d}}$. Similarly,
+we relabel the corresponding $d_i$ and $z_i$ accordingly.
+
+With almost no loss of generality, we assume all $m_\ell$s are equal and
+write them as $m$.
+
+Now the model becomes
+
+$$p(x_{\ell, i} | d_{\ell, i} = \ell; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k | d_{\ell, i} = \ell; \theta).$$
+
+Since we have regrouped the $x$'s and $z$'s whose indices record the
+values of the $d$'s, we can remove the $d$'s from the equation
+altogether:
+
+$$p(x_{\ell, i}; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k; \theta).$$
+
+It is effectively a modification of SMM by making $n_d$ copies of $\pi$.
+More specifically the parameters are
+$\theta = (\pi_{1 : n_d, 1 : n_z}, \eta_{1 : n_z, 1 : n_x})$, where we
+model $(z | d = \ell) \sim \text{Cat}(\pi_{\ell, \cdot})$ and, as in
+pLSA1, $(x | z = k) \sim \text{Cat}(\eta_{k, \cdot})$.
+
+Illustrated in the plate notations, pLSA2 is:
+
+[[/assets/resources/plsa2.png]]
+
+The computation is basically adding an index $\ell$ to the computation
+of SMM wherever applicable.
+
+The updates at the E-step is
+
+$$r_{\ell i k} = p(z_{\ell i} = k | x_{\ell i}; \theta) \propto \pi_{\ell k} \eta_{k, x_{\ell i}}.$$
+
+And at the M-step
+
+$$\begin{aligned}
+\pi_{\ell k} &= {1 \over m} \sum_i r_{\ell i k} \\
+\eta_{k w} &= {\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w} \over \sum_{\ell, i} r_{\ell i k}}.
+\end{aligned}$$
+
+*** HMM
+ :PROPERTIES:
+ :CUSTOM_ID: hmm
+ :END:
+The hidden markov model (HMM) is a sequential version of SMM, in the
+same sense that recurrent neural networks are sequential versions of
+feed-forward neural networks.
+
+HMM is an example where the posterior $p(z_i | x_i; \theta)$ is not easy
+to compute, and one has to utilise properties of the underlying Bayesian
+network to go around it.
+
+Now each sample is a sequence $x_i = (x_{ij})_{j = 1 : T}$, and so are
+the latent variables $z_i = (z_{ij})_{j = 1 : T}$.
+
+The latent variables are assumed to form a Markov chain with transition
+matrix $(\xi_{k \ell})_{k \ell}$, and $x_{ij}$ is completely dependent
+on $z_{ij}$:
+
+$$\begin{aligned}
+p(z_{ij} | z_{i, j - 1}) &= \xi_{z_{i, j - 1}, z_{ij}},\\
+p(x_{ij} | z_{ij}) &= \eta_{z_{ij}, x_{ij}}.
+\end{aligned}$$
+
+Also, the distribution of $z_{i1}$ is again categorical with parameter
+$\pi$:
+
+$$p(z_{i1}) = \pi_{z_{i1}}$$
+
+So the parameters are $\theta = (\pi, \xi, \eta)$. And HMM can be shown
+in plate notations as:
+
+[[/assets/resources/hmm.png]]
+
+Now we apply EM to HMM, which is called the
+[[https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm][Baum-Welch
+algorithm]]. Unlike the previous examples, it is too messy to compute
+$p(z_i | x_{i}; \theta)$, so during the E-step we instead write down
+formula (2.5) directly in hope of simplifying it:
+
+$$\begin{aligned}
+\mathbb E_{p(z_i | x_i; \theta_t)} \log p(x_i, z_i; \theta_t) &=\mathbb E_{p(z_i | x_i; \theta_t)} \left(\log \pi_{z_{i1}} + \sum_{j = 2 : T} \log \xi_{z_{i, j - 1}, z_{ij}} + \sum_{j = 1 : T} \log \eta_{z_{ij}, x_{ij}}\right). \qquad (3)
+\end{aligned}$$
+
+Let us compute the summand in second term:
+
+$$\begin{aligned}
+\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \xi_{z_{i, j - 1}, z_{ij}} &= \sum_{k, \ell} (\log \xi_{k, \ell}) \mathbb E_{p(z_{i} | x_{i}; \theta_t)} 1_{z_{i, j - 1} = k, z_{i, j} = \ell} \\
+&= \sum_{k, \ell} p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t) \log \xi_{k, \ell}. \qquad (4)
+\end{aligned}$$
+
+Similarly, one can write down the first term and the summand in the
+third term to obtain
+
+$$\begin{aligned}
+\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \pi_{z_{i1}} &= \sum_k p(z_{i1} = k | x_{i}; \theta_t), \qquad (5) \\
+\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \eta_{z_{i, j}, x_{i, j}} &= \sum_{k, w} 1_{x_{ij} = w} p(z_{i, j} = k | x_i; \theta_t) \log \eta_{k, w}. \qquad (6)
+\end{aligned}$$
+
+plugging (4)(5)(6) back into (3) and summing over $j$, we obtain the
+formula to maximise over $\theta$ on:
+
+$$\sum_k \sum_i r_{i1k} \log \pi_k + \sum_{k, \ell} \sum_{j = 2 : T, i} s_{ijk\ell} \log \xi_{k, \ell} + \sum_{k, w} \sum_{j = 1 : T, i} r_{ijk} 1_{x_{ij} = w} \log \eta_{k, w},$$
+
+where
+
+$$\begin{aligned}
+r_{ijk} &:= p(z_{ij} = k | x_{i}; \theta_t), \\
+s_{ijk\ell} &:= p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t).
+\end{aligned}$$
+
+Now we proceed to the M-step. Since each of the
+$\pi_k, \xi_{k, \ell}, \eta_{k, w}$ is nicely confined in the inner sum
+of each term, together with the constraint
+$\sum_k \pi_k = \sum_\ell \xi_{k, \ell} = \sum_w \eta_{k, w} = 1$ it is
+not hard to find the argmax at time $t + 1$ (the same way one finds the
+MLE for any categorical distribution):
+
+$$\begin{aligned}
+\pi_{k} &= {1 \over m} \sum_i r_{i1k}, \qquad (6.1) \\
+\xi_{k, \ell} &= {\sum_{j = 2 : T, i} s_{ijk\ell} \over \sum_{j = 1 : T - 1, i} r_{ijk}}, \qquad(6.2) \\
+\eta_{k, w} &= {\sum_{ij} 1_{x_{ij} = w} r_{ijk} \over \sum_{ij} r_{ijk}}. \qquad(6.3)
+\end{aligned}$$
+
+Note that (6.1)(6.3) are almost identical to (2.7)(2.8). This makes
+sense as the only modification HMM makes over SMM is the added
+dependencies between the latent variables.
+
+What remains is to compute $r$ and $s$.
+
+This is done by using the forward and backward procedures which takes
+advantage of the conditioned independence / topology of the underlying
+Bayesian network. It is out of scope of this post, but for the sake of
+completeness I include it here.
+
+Let
+
+$$\begin{aligned}
+\alpha_k(i, j) &:= p(x_{i, 1 : j}, z_{ij} = k; \theta_t), \\
+\beta_k(i, j) &:= p(x_{i, j + 1 : T} | z_{ij} = k; \theta_t).
+\end{aligned}$$
+
+They can be computed recursively as
+
+$$\begin{aligned}
+\alpha_k(i, j) &= \begin{cases}
+\eta_{k, x_{1j}} \pi_k, & j = 1; \\
+\eta_{k, x_{ij}} \sum_\ell \alpha_\ell(j - 1, i) \xi_{k\ell}, & j \ge 2.
+\end{cases}\\
+\beta_k(i, j) &= \begin{cases}
+1, & j = T;\\
+\sum_\ell \xi_{k\ell} \beta_\ell(j + 1, i) \eta_{\ell, x_{i, j + 1}}, & j < T.
+\end{cases}
+\end{aligned}$$
+
+Then
+
+$$\begin{aligned}
+p(z_{ij} = k, x_{i}; \theta_t) &= \alpha_k(i, j) \beta_k(i, j), \qquad (7)\\
+p(x_{i}; \theta_t) &= \sum_k \alpha_k(i, j) \beta_k(i, j),\forall j = 1 : T \qquad (8)\\
+p(z_{i, j - 1} = k, z_{i, j} = \ell, x_{i}; \theta_t) &= \alpha_k(i, j) \xi_{k\ell} \beta_\ell(i, j + 1) \eta_{\ell, x_{j + 1, i}}. \qquad (9)
+\end{aligned}$$
+
+And this yields $r_{ijk}$ and $s_{ijk\ell}$ since they can be computed
+as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively.
+
+** Fully Bayesian EM / MFA
+ :PROPERTIES:
+ :CUSTOM_ID: fully-bayesian-em-mfa
+ :END:
+Let us now venture into the realm of full Bayesian.
+
+In EM we aim to maximise the ELBO
+
+$$\int q(z) \log {p(x, z; \theta) \over q(z)} dz d\theta$$
+
+alternately over $q$ and $\theta$. As mentioned before, the E-step of
+maximising over $q$ is Bayesian, in that it computes the posterior of
+$z$, whereas the M-step of maximising over $\theta$ is maximum
+likelihood and frequentist.
+
+The fully Bayesian EM makes the M-step Bayesian by making $\theta$ a
+random variable, so the ELBO becomes
+
+$$L(p(x, z, \theta), q(z, \theta)) = \int q(z, \theta) \log {p(x, z, \theta) \over q(z, \theta)} dz d\theta$$
+
+We further assume $q$ can be factorised into distributions on $z$ and
+$\theta$: $q(z, \theta) = q_1(z) q_2(\theta)$. So the above formula is
+rewritten as
+
+$$L(p(x, z, \theta), q(z, \theta)) = \int q_1(z) q_2(\theta) \log {p(x, z, \theta) \over q_1(z) q_2(\theta)} dz d\theta$$
+
+To find argmax over $q_1$, we rewrite
+
+$$\begin{aligned}
+L(p(x, z, \theta), q(z, \theta)) &= \int q_1(z) \left(\int q_2(\theta) \log p(x, z, \theta) d\theta\right) dz - \int q_1(z) \log q_1(z) dz - \int q_2(\theta) \log q_2(\theta) \\&= - D(q_1(z) || p_x(z)) + C,
+\end{aligned}$$
+
+where $p_x$ is a density in $z$ with
+
+$$\log p_x(z) = \mathbb E_{q_2(\theta)} \log p(x, z, \theta) + C.$$
+
+So the $q_1$ that maximises the ELBO is $q_1^* = p_x$.
+
+Similarly, the optimal $q_2$ is such that
+
+$$\log q_2^*(\theta) = \mathbb E_{q_1(z)} \log p(x, z, \theta) + C.$$
+
+The fully Bayesian EM thus alternately evaluates $q_1^*$ (E-step) and
+$q_2^*$ (M-step).
+
+It is also called mean field approximation (MFA), and can be easily
+generalised to models with more than two groups of latent variables, see
+e.g. Section 10.1 of Bishop 2006.
+
+*** Application to mixture models
+ :PROPERTIES:
+ :CUSTOM_ID: application-to-mixture-models
+ :END:
+*Definition (Fully Bayesian mixture model)*. The relations between
+$\pi$, $\eta$, $x$, $z$ are the same as in the definition of mixture
+models. Furthermore, we assume the distribution of $(x | \eta_k)$
+belongs to the
+[[https://en.wikipedia.org/wiki/Exponential_family][exponential family]]
+(the definition of the exponential family is briefly touched at the end
+of this section). But now both $\pi$ and $\eta$ are random variables.
+Let the prior distribution $p(\pi)$ is Dirichlet with parameter
+$(\alpha, \alpha, ..., \alpha)$. Let the prior $p(\eta_k)$ be the
+conjugate prior of $(x | \eta_k)$, with parameter $\beta$, we will see
+later in this section that the posterior $q(\eta_k)$ belongs to the same
+family as $p(\eta_k)$. Represented in a plate notations, a fully
+Bayesian mixture model looks like:
+
+[[/assets/resources/fully-bayesian-mm.png]]
+
+Given this structure we can write down the mean-field approximation:
+
+$$\log q(z) = \mathbb E_{q(\eta)q(\pi)} (\log(x | z, \eta) + \log(z | \pi)) + C.$$
+
+Both sides can be factored into per-sample expressions, giving us
+
+$$\log q(z_i) = \mathbb E_{q(\eta)} \log p(x_i | z_i, \eta) + \mathbb E_{q(\pi)} \log p(z_i | \pi) + C$$
+
+Therefore
+
+$$\log r_{ik} = \log q(z_i = k) = \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k) + \mathbb E_{q(\pi)} \log \pi_k + C. \qquad (9.1)$$
+
+So the posterior of each $z_i$ is categorical regardless of the $p$s and
+$q$s.
+
+Computing the posterior of $\pi$ and $\eta$:
+
+$$\log q(\pi) + \log q(\eta) = \log p(\pi) + \log p(\eta) + \sum_i \mathbb E_{q(z_i)} p(x_i | z_i, \eta) + \sum_i \mathbb E_{q(z_i)} p(z_i | \pi) + C.$$
+
+So we can separate the terms involving $\pi$ and those involving $\eta$.
+First compute the posterior of $\pi$:
+
+$$\log q(\pi) = \log p(\pi) + \sum_i \mathbb E_{q(z_i)} \log p(z_i | \pi) = \log p(\pi) + \sum_i \sum_k r_{ik} \log \pi_k + C.$$
+
+The right hand side is the log of a Dirichlet modulus the constant $C$,
+from which we can update the posterior parameter $\phi^\pi$:
+
+$$\phi^\pi_k = \alpha + \sum_i r_{ik}. \qquad (9.3)$$
+
+Similarly we can obtain the posterior of $\eta$:
+
+$$\log q(\eta) = \log p(\eta) + \sum_i \sum_k r_{ik} \log p(x_i | \eta_k) + C.$$
+
+Again we can factor the terms with respect to $k$ and get:
+
+$$\log q(\eta_k) = \log p(\eta_k) + \sum_i r_{ik} \log p(x_i | \eta_k) + C. \qquad (9.5)$$
+
+Here we can see why conjugate prior works. Mathematically, given a
+probability distribution $p(x | \theta)$, the distribution $p(\theta)$
+is called conjugate prior of $p(x | \theta)$ if
+$\log p(\theta) + \log p(x | \theta)$ has the same form as
+$\log p(\theta)$.
+
+For example, the conjugate prior for the exponential family
+$p(x | \theta) = h(x) \exp(\theta \cdot T(x) - A(\theta))$ where $T$,
+$A$ and $h$ are some functions is
+$p(\theta; \chi, \nu) \propto \exp(\chi \cdot \theta - \nu A(\theta))$.
+
+Here what we want is a bit different from conjugate priors because of
+the coefficients $r_{ik}$. But the computation carries over to the
+conjugate priors of the exponential family (try it yourself and you'll
+see). That is, if $p(x_i | \eta_k)$ belongs to the exponential family
+
+$$p(x_i | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$
+
+and if $p(\eta_k)$ is the conjugate prior of $p(x_i | \eta_k)$
+
+$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k))$$
+
+then $q(\eta_k)$ has the same form as $p(\eta_k)$, and from (9.5) we can
+compute the updates of $\phi^{\eta_k}$:
+
+$$\begin{aligned}
+\phi^{\eta_k}_1 &= \chi + \sum_i r_{ik} T(x_i), \qquad (9.7) \\
+\phi^{\eta_k}_2 &= \nu + \sum_i r_{ik}. \qquad (9.9)
+\end{aligned}$$
+
+So the mean field approximation for the fully Bayesian mixture model is
+the alternate iteration of (9.1) (E-step) and (9.3)(9.7)(9.9) (M-step)
+until convergence.
+
+*** Fully Bayesian GMM
+ :PROPERTIES:
+ :CUSTOM_ID: fully-bayesian-gmm
+ :END:
+A typical example of fully Bayesian mixture models is the fully Bayesian
+Gaussian mixture model (Attias 2000, also called variational GMM in the
+literature). It is defined by applying the same modification to GMM as
+the difference between Fully Bayesian mixture model and the vanilla
+mixture model.
+
+More specifically:
+
+- $p(z_{i}) = \text{Cat}(\pi)$ as in vanilla GMM
+- $p(\pi) = \text{Dir}(\alpha, \alpha, ..., \alpha)$ has Dirichlet
+ distribution, the conjugate prior to the parameters of the categorical
+ distribution.
+- $p(x_i | z_i = k) = p(x_i | \eta_k) = N(\mu_{k}, \Sigma_{k})$ as in
+ vanilla GMM
+- $p(\mu_k, \Sigma_k) = \text{NIW} (\mu_0, \lambda, \Psi, \nu)$ is the
+ normal-inverse-Wishart distribution, the conjugate prior to the mean
+ and covariance matrix of the Gaussian distribution.
+
+The E-step and M-step can be computed using (9.1) and (9.3)(9.7)(9.9) in
+the previous section. The details of the computation can be found in
+Chapter 10.2 of Bishop 2006 or Attias 2000.
+
+*** LDA
+ :PROPERTIES:
+ :CUSTOM_ID: lda
+ :END:
+As the second example of fully Bayesian mixture models, Latent Dirichlet
+allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of
+pLSA2, with the following plate notations:
+
+[[/assets/resources/lda.png]]
+
+It is the smoothed version in the paper.
+
+More specifically, on the basis of pLSA2, we add prior distributions to
+$\eta$ and $\pi$:
+
+$$\begin{aligned}
+p(\eta_k) &= \text{Dir} (\beta, ..., \beta), \qquad k = 1 : n_z \\
+p(\pi_\ell) &= \text{Dir} (\alpha, ..., \alpha), \qquad \ell = 1 : n_d \\
+\end{aligned}$$
+
+And as before, the prior of $z$ is
+
+$$p(z_{\ell, i}) = \text{Cat} (\pi_\ell), \qquad \ell = 1 : n_d, i = 1 : m$$
+
+We also denote posterior distribution
+
+$$\begin{aligned}
+q(\eta_k) &= \text{Dir} (\phi^{\eta_k}), \qquad k = 1 : n_z \\
+q(\pi_\ell) &= \text{Dir} (\phi^{\pi_\ell}), \qquad \ell = 1 : n_d \\
+p(z_{\ell, i}) &= \text{Cat} (r_{\ell, i}), \qquad \ell = 1 : n_d, i = 1 : m
+\end{aligned}$$
+
+As before, in E-step we update $r$, and in M-step we update $\lambda$
+and $\gamma$.
+
+But in the LDA paper, one treats optimisation over $r$, $\lambda$ and
+$\gamma$ as a E-step, and treats $\alpha$ and $\beta$ as parameters,
+which is optmised over at M-step. This makes it more akin to the
+classical EM where the E-step is Bayesian and M-step MLE. This is more
+complicated, and we do not consider it this way here.
+
+Plugging in (9.1) we obtain the updates at E-step
+
+$$r_{\ell i k} \propto \exp(\psi(\phi^{\pi_\ell}_k) + \psi(\phi^{\eta_k}_{x_{\ell i}}) - \psi(\sum_w \phi^{\eta_k}_w)), \qquad (10)$$
+
+where $\psi$ is the digamma function. Similarly, plugging in
+(9.3)(9.7)(9.9), at M-step, we update the posterior of $\pi$ and $\eta$:
+
+$$\begin{aligned}
+\phi^{\pi_\ell}_k &= \alpha + \sum_i r_{\ell i k}. \qquad (11)\\
+%%}}$
+%%As for $\eta$, we have
+%%{{$%align%
+%%\log q(\eta) &= \sum_k \log p(\eta_k) + \sum_{\ell, i} \mathbb E_{q(z_{\ell i})} \log p(x_{\ell i} | z_{\ell i}, \eta) \\
+%%&= \sum_{k, j} (\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = j} + \beta - 1) \log \eta_{k j}
+%%}}$
+%%which gives us
+%%{{$
+\phi^{\eta_k}_w &= \beta + \sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w}. \qquad (12)
+\end{aligned}$$
+
+So the algorithm iterates over (10) and (11)(12) until convergence.
+
+*** DPMM
+ :PROPERTIES:
+ :CUSTOM_ID: dpmm
+ :END:
+The Dirichlet process mixture model (DPMM) is like the fully Bayesian
+mixture model except $n_z = \infty$, i.e. $z$ can take any positive
+integer value.
+
+The probability of $z_i = k$ is defined using the so called
+stick-breaking process: let $v_i \sim \text{Beta} (\alpha, \beta)$ be
+i.i.d. random variables with Beta distributions, then
+
+$$\mathbb P(z_i = k | v_{1:\infty}) = (1 - v_1) (1 - v_2) ... (1 - v_{k - 1}) v_k.$$
+
+So $v$ plays a similar role to $\pi$ in the previous models.
+
+As before, we have that the distribution of $x$ belongs to the
+exponential family:
+
+$$p(x | z = k, \eta) = p(x | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$
+
+so the prior of $\eta_k$ is
+
+$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k)).$$
+
+Because of the infinities we can't directly apply the formulas in the
+general fully Bayesian mixture models. So let us carefully derive the
+whole thing again.
+
+As before, we can write down the ELBO:
+
+$$L(p(x, z, \theta), q(z, \theta)) = \mathbb E_{q(\theta)} \log {p(\theta) \over q(\theta)} + \mathbb E_{q(\theta) q(z)} \log {p(x, z | \theta) \over q(z)}.$$
+
+Both terms are infinite series:
+
+$$L(p, q) = \sum_{k = 1 : \infty} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_{i = 1 : m} \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}.$$
+
+There are several ways to deal with the infinities. One is to fix some
+level $T > 0$ and set $v_T = 1$ almost surely (Blei-Jordan 2006). This
+effectively turns the model into a finite one, and both terms become
+finite sums over $k = 1 : T$.
+
+Another walkaround (Kurihara-Welling-Vlassis 2007) is also a kind of
+truncation, but less heavy-handed: setting the posterior
+$q(\theta) = q(\eta) q(v)$ to be:
+
+$$q(\theta) = q(\theta_{1 : T}) p(\theta_{T + 1 : \infty}) =: q(\theta_{\le T}) p(\theta_{> T}).$$
+
+That is, tie the posterior after $T$ to the prior. This effectively
+turns the first term in the ELBO to a finite sum over $k = 1 : T$, while
+keeping the second sum an infinite series:
+
+$$L(p, q) = \sum_{k = 1 : T} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_i \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}. \qquad (13)$$
+
+The plate notation of this model is:
+
+[[/assets/resources/dpmm.png]]
+
+As it turns out, the infinities can be tamed in this case.
+
+As before, the optimal $q(z_i)$ is computed as
+
+$$r_{ik} = q(z_i = k) = s_{ik} / S_i$$
+
+where
+
+$$\begin{aligned}
+s_{ik} &= \exp(\mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta)) \\
+S_i &= \sum_{k = 1 : \infty} s_{ik}.
+\end{aligned}$$
+
+Plugging this back to (13) we have
+
+$$\begin{aligned}
+\sum_{k = 1 : \infty} r_{ik} &\mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over r_{ik}} \\
+&= \sum_{k = 1 : \infty} r_{ik} \mathbb E_{q(\theta)} (\log p(x_i, z_i = k | \theta) - \mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta) + \log S_i) = \log S_i.
+\end{aligned}$$
+
+So it all rests upon $S_i$ being finite.
+
+For $k \le T + 1$, we compute the quantity $s_{ik}$ directly. For
+$k > T$, it is not hard to show that
+
+$$s_{ik} = s_{i, T + 1} \exp((k - T - 1) \mathbb E_{p(w)} \log (1 - w)),$$
+
+where $w$ is a random variable with same distribution as $p(v_k)$, i.e.
+$\text{Beta}(\alpha, \beta)$.
+
+Hence
+
+$$S_i = \sum_{k = 1 : T} s_{ik} + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}$$
+
+is indeed finite. Similarly we also obtain
+
+$$q(z_i > k) = S^{-1} \left(\sum_{\ell = k + 1 : T} s_\ell + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}\right), k \le T \qquad (14)$$
+
+Now let us compute the posterior of $\theta_{\le T}$. In the following
+we exchange the integrals without justifying them (c.f. Fubini's
+Theorem). Equation (13) can be rewritten as
+
+$$L(p, q) = \mathbb E_{q(\theta_{\le T})} \left(\log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} - \log q(\theta_{\le T})\right).$$
+
+Note that unlike the derivation of the mean-field approximation, we keep
+the $- \mathbb E_{q(z)} \log q(z)$ term even though we are only
+interested in $\theta$ at this stage. This is again due to the problem
+of infinities: as in the computation of $S$, we would like to cancel out
+some undesirable unbounded terms using $q(z)$. We now have
+
+$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} + C.$$
+
+By plugging in $q(z = k)$ we obtain
+
+$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_{k = 1 : \infty} q(z_i = k) \left(\mathbb E_{p(\theta_{> T})} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)} - \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}\right) + C.$$
+
+Again, we separate the $v_k$'s and the $\eta_k$'s to obtain
+
+$$q(v_{\le T}) = \log p(v_{\le T}) + \sum_i \sum_k q(z_i = k) \left(\mathbb E_{p(v_{> T})} \log p(z_i = k | v) - \mathbb E_{q(v)} \log p (z_i = k | v)\right).$$
+
+Denote by $D_k$ the difference between the two expetations on the right
+hand side. It is easy to show that
+
+$$D_k = \begin{cases}
+\log(1 - v_1) ... (1 - v_{k - 1}) v_k - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_{k - 1}) v_k & k \le T\\
+\log(1 - v_1) ... (1 - v_T) - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_T) & k > T
+\end{cases}$$
+
+so $D_k$ is bounded. With this we can derive the update for
+$\phi^{v, 1}$ and $\phi^{v, 2}$:
+
+$$\begin{aligned}
+\phi^{v, 1}_k &= \alpha + \sum_i q(z_i = k) \\
+\phi^{v, 2}_k &= \beta + \sum_i q(z_i > k),
+\end{aligned}$$
+
+where $q(z_i > k)$ can be computed as in (14).
+
+When it comes to $\eta$, we have
+
+$$\log q(\eta_{\le T}) = \log p(\eta_{\le T}) + \sum_i \sum_{k = 1 : \infty} q(z_i = k) (\mathbb E_{p(\eta_k)} \log p(x_i | \eta_k) - \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k)).$$
+
+Since $q(\eta_k) = p(\eta_k)$ for $k > T$, the inner sum on the right
+hand side is a finite sum over $k = 1 : T$. By factorising
+$q(\eta_{\le T})$ and $p(\eta_{\le T})$, we have
+
+$$\log q(\eta_k) = \log p(\eta_k) + \sum_i q(z_i = k) \log (x_i | \eta_k) + C,$$
+
+which gives us
+
+$$\begin{aligned}
+\phi^{\eta, 1}_k &= \xi + \sum_i q(z_i = k) T(x_i) \\
+\phi^{\eta, 2}_k &= \nu + \sum_i q(z_i = k).
+\end{aligned}$$
+
+** SVI
+ :PROPERTIES:
+ :CUSTOM_ID: svi
+ :END:
+In variational inference, the computation of some parameters are more
+expensive than others.
+
+For example, the computation of M-step is often much more expensive than
+that of E-step:
+
+- In the vanilla mixture models with the EM algorithm, the update of
+ $\theta$ requires the computation of $r_{ik}$ for all $i = 1 : m$, see
+ Eq (2.3).
+- In the fully Bayesian mixture model with mean field approximation, the
+ updates of $\phi^\pi$ and $\phi^\eta$ require the computation of a sum
+ over all samples (see Eq (9.3)(9.7)(9.9)).
+
+Similarly, in pLSA2 (resp. LDA), the updates of $\eta_k$ (resp.
+$\phi^{\eta_k}$) requires a sum over $\ell = 1 : n_d$, whereas the
+updates of other parameters do not.
+
+In these cases, the parameter that requires more computations are called
+global and the other ones local.
+
+Stochastic variational inference (SVI, Hoffman-Blei-Wang-Paisley 2012)
+addresses this problem in the same way as stochastic gradient descent
+improves efficiency of gradient descent.
+
+Each time SVI picks a sample, updates the corresponding local
+parameters, and computes the update of the global parameters as if all
+the $m$ samples are identical to the picked sample. Finally it
+incorporates this global parameter value into previous computations of
+the global parameters, by means of an exponential moving average.
+
+As an example, here's SVI applied to LDA:
+
+1. Set $t = 1$.
+
+2. Pick $\ell$ uniformly from $\{1, 2, ..., n_d\}$.
+
+3. Repeat until convergence:
+
+ 1. Compute $(r_{\ell i k})_{i = 1 : m, k = 1 : n_z}$ using (10).
+ 2. Compute $(\phi^{\pi_\ell}_k)_{k = 1 : n_z}$ using (11).
+
+4. Compute $(\tilde \phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$ using
+ the following formula (compare with (12))
+ $$\tilde \phi^{\eta_k}_w = \beta + n_d \sum_{i} r_{\ell i k} 1_{x_{\ell i} = w}$$
+
+5. Update the exponential moving average
+ $(\phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$:
+ $$\phi^{\eta_k}_w = (1 - \rho_t) \phi^{\eta_k}_w + \rho_t \tilde \phi^{\eta_k}_w$$
+
+6. Increment $t$ and go back to Step 2.
+
+In the original paper, $\rho_t$ needs to satisfy some conditions that
+guarantees convergence of the global parameters:
+
+$$\begin{aligned}
+\sum_t \rho_t = \infty \\
+\sum_t \rho_t^2 < \infty
+\end{aligned}$$
+
+and the choice made there is
+
+$$\rho_t = (t + \tau)^{-\kappa}$$
+
+for some $\kappa \in (.5, 1]$ and $\tau \ge 0$.
+
+** AEVB
+ :PROPERTIES:
+ :CUSTOM_ID: aevb
+ :END:
+SVI adds to variational inference stochastic updates similar to
+stochastic gradient descent. Why not just use neural networks with
+stochastic gradient descent while we are at it? Autoencoding variational
+Bayes (AEVB) (Kingma-Welling 2013) is such an algorithm.
+
+Let's look back to the original problem of maximising the ELBO:
+
+$$\max_{\theta, q} \sum_{i = 1 : m} L(p(x_i | z_i; \theta) p(z_i; \theta), q(z_i))$$
+
+Since for any given $\theta$, the optimal $q(z_i)$ is the posterior
+$p(z_i | x_i; \theta)$, the problem reduces to
+
+$$\max_{\theta} \sum_i L(p(x_i | z_i; \theta) p(z_i; \theta), p(z_i | x_i; \theta))$$
+
+Let us assume $p(z_i; \theta) = p(z_i)$ is independent of $\theta$ to
+simplify the problem. In the old mixture models, we have
+$p(x_i | z_i; \theta) = p(x_i; \eta_{z_i})$, which we can generalise to
+$p(x_i; f(\theta, z_i))$ for some function $f$. Using Beyes' theorem we
+can also write down $p(z_i | x_i; \theta) = q(z_i; g(\theta, x_i))$ for
+some function $g$. So the problem becomes
+
+$$\max_{\theta} \sum_i L(p(x_i; f(\theta, z_i)) p(z_i), q(z_i; g(\theta, x_i)))$$
+
+In some cases $g$ can be hard to write down or compute. AEVB addresses
+this problem by replacing $g(\theta, x_i)$ with a neural network
+$g_\phi(x_i)$ with input $x_i$ and some separate parameters $\phi$. It
+also replaces $f(\theta, z_i)$ with a neural network $f_\theta(z_i)$
+with input $z_i$ and parameters $\theta$. And now the problem becomes
+
+$$\max_{\theta, \phi} \sum_i L(p(x_i; f_\theta(z_i)) p(z_i), q(z_i; g_\phi(x_i))).$$
+
+The objective function can be written as
+
+$$\sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - D(q(z_i; g_\phi(x_i)) || p(z_i)).$$
+
+The first term is called the negative reconstruction error, like the
+$- \|decoder(encoder(x)) - x\|$ in autoencoders, which is where the
+"autoencoder" in the name comes from.
+
+The second term is a regularisation term that penalises the posterior
+$q(z_i)$ that is very different from the prior $p(z_i)$. We assume this
+term can be computed analytically.
+
+So only the first term requires computing.
+
+We can approximate the sum over $i$ in a similar fashion as SVI: pick
+$j$ uniformly randomly from $\{1 ... m\}$ and treat the whole dataset as
+$m$ replicates of $x_j$, and approximate the expectation using
+Monte-Carlo:
+
+$$U(x_i, \theta, \phi) := \sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) \approx m \mathbb E_{q(z_j; g_\phi(x_j))} \log p(x_j; f_\theta(z_j)) \approx {m \over L} \sum_{\ell = 1}^L \log p(x_j; f_\theta(z_{j, \ell})),$$
+
+where each $z_{j, \ell}$ is sampled from $q(z_j; g_\phi(x_j))$.
+
+But then it is not easy to approximate the gradient over $\phi$. One can
+use the log trick as in policy gradients, but it has the problem of high
+variance. In policy gradients this is overcome by using baseline
+subtractions. In the AEVB paper it is tackled with the
+reparameterisation trick.
+
+Assume there exists a transformation $T_\phi$ and a random variable
+$\epsilon$ with distribution independent of $\phi$ or $\theta$, such
+that $T_\phi(x_i, \epsilon)$ has distribution $q(z_i; g_\phi(x_i))$. In
+this case we can rewrite $U(x, \phi, \theta)$ as
+
+$$\sum_i \mathbb E_{\epsilon \sim p(\epsilon)} \log p(x_i; f_\theta(T_\phi(x_i, \epsilon))),$$
+
+This way one can use Monte-Carlo to approximate
+$\nabla_\phi U(x, \phi, \theta)$:
+
+$$\nabla_\phi U(x, \phi, \theta) \approx {m \over L} \sum_{\ell = 1 : L} \nabla_\phi \log p(x_j; f_\theta(T_\phi(x_j, \epsilon_\ell))),$$
+
+where each $\epsilon_{\ell}$ is sampled from $p(\epsilon)$. The
+approximation of $U(x, \phi, \theta)$ itself can be done similarly.
+
+*** VAE
+ :PROPERTIES:
+ :CUSTOM_ID: vae
+ :END:
+As an example of AEVB, the paper introduces variational autoencoder
+(VAE), with the following instantiations:
+
+- The prior $p(z_i) = N(0, I)$ is standard normal, thus independent of
+ $\theta$.
+- The distribution $p(x_i; \eta)$ is either Gaussian or categorical.
+- The distribution $q(z_i; \mu, \Sigma)$ is Gaussian with diagonal
+ covariance matrix. So
+ $g_\phi(z_i) = (\mu_\phi(x_i), \text{diag}(\sigma^2_\phi(x_i)_{1 : d}))$.
+ Thus in the reparameterisation trick $\epsilon \sim N(0, I)$ and
+ $T_\phi(x_i, \epsilon) = \epsilon \odot \sigma_\phi(x_i) + \mu_\phi(x_i)$,
+ where $\odot$ is elementwise multiplication.
+- The KL divergence can be easily computed analytically as
+ $- D(q(z_i; g_\phi(x_i)) || p(z_i)) = {d \over 2} + \sum_{j = 1 : d} \log\sigma_\phi(x_i)_j - {1 \over 2} \sum_{j = 1 : d} (\mu_\phi(x_i)_j^2 + \sigma_\phi(x_i)_j^2)$.
+
+With this, one can use backprop to maximise the ELBO.
+
+*** Fully Bayesian AEVB
+ :PROPERTIES:
+ :CUSTOM_ID: fully-bayesian-aevb
+ :END:
+Let us turn to fully Bayesian version of AEVB. Again, we first recall
+the ELBO of the fully Bayesian mixture models:
+
+$$L(p(x, z, \pi, \eta; \alpha, \beta), q(z, \pi, \eta; r, \phi)) = L(p(x | z, \eta) p(z | \pi) p(\pi; \alpha) p(\eta; \beta), q(z; r) q(\eta; \phi^\eta) q(\pi; \phi^\pi)).$$
+
+We write $\theta = (\pi, \eta)$, rewrite $\alpha := (\alpha, \beta)$,
+$\phi := r$, and $\gamma := (\phi^\eta, \phi^\pi)$. Furthermore, as in
+the half-Bayesian version we assume $p(z | \theta) = p(z)$, i.e. $z$
+does not depend on $\theta$. Similarly we also assume
+$p(\theta; \alpha) = p(\theta)$. Now we have
+
+$$L(p(x, z, \theta; \alpha), q(z, \theta; \phi, \gamma)) = L(p(x | z, \theta) p(z) p(\theta), q(z; \phi) q(\theta; \gamma)).$$
+
+And the objective is to maximise it over $\phi$ and $\gamma$. We no
+longer maximise over $\theta$, because it is now a random variable, like
+$z$. Now let us transform it to a neural network model, as in the
+half-Bayesian case:
+
+$$L\left(\left(\prod_{i = 1 : m} p(x_i; f_\theta(z_i))\right) \left(\prod_{i = 1 : m} p(z_i) \right) p(\theta), \left(\prod_{i = 1 : m} q(z_i; g_\phi(x_i))\right) q(\theta; h_\gamma(x))\right).$$
+
+where $f_\theta$, $g_\phi$ and $h_\gamma$ are neural networks. Again, by
+separating out KL-divergence terms, the above formula becomes
+
+$$\sum_i \mathbb E_{q(\theta; h_\gamma(x))q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - \sum_i D(q(z_i; g_\phi(x_i)) || p(z_i)) - D(q(\theta; h_\gamma(x)) || p(\theta)).$$
+
+Again, we assume the latter two terms can be computed analytically.
+Using reparameterisation trick, we write
+
+$$\begin{aligned}
+\theta &= R_\gamma(\zeta, x) \\
+z_i &= T_\phi(\epsilon, x_i)
+\end{aligned}$$
+
+for some transformations $R_\gamma$ and $T_\phi$ and random variables
+$\zeta$ and $\epsilon$ so that the output has the desired distributions.
+
+Then the first term can be written as
+
+$$\mathbb E_{\zeta, \epsilon} \log p(x_i; f_{R_\gamma(\zeta, x)} (T_\phi(\epsilon, x_i))),$$
+
+so that the gradients can be computed accordingly.
+
+Again, one may use Monte-Carlo to approximate this expetation.
+
+** References
+ :PROPERTIES:
+ :CUSTOM_ID: references
+ :END:
+
+- Attias, Hagai. "A variational baysian framework for graphical models."
+ In Advances in neural information processing systems, pp.
+ 209-215. 2000.
+- Bishop, Christopher M. Neural networks for pattern recognition.
+ Springer. 2006.
+- Blei, David M., and Michael I. Jordan. "Variational Inference for
+ Dirichlet Process Mixtures." Bayesian Analysis 1, no. 1 (March 2006):
+ 121--43. [[https://doi.org/10.1214/06-BA104]].
+- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent Dirichlet
+ Allocation." Journal of Machine Learning Research 3, no. Jan (2003):
+ 993--1022.
+- Hofmann, Thomas. "Latent Semantic Models for Collaborative Filtering."
+ ACM Transactions on Information Systems 22, no. 1 (January 1, 2004):
+ 89--115. [[https://doi.org/10.1145/963770.963774]].
+- Hofmann, Thomas. "Learning the similarity of documents: An
+ information-geometric approach to document retrieval and
+ categorization." In Advances in neural information processing systems,
+ pp. 914-920. 2000.
+- Hoffman, Matt, David M. Blei, Chong Wang, and John Paisley.
+ "Stochastic Variational Inference." ArXiv:1206.7051 [Cs, Stat], June
+ 29, 2012. [[http://arxiv.org/abs/1206.7051]].
+- Kingma, Diederik P., and Max Welling. "Auto-Encoding Variational
+ Bayes." ArXiv:1312.6114 [Cs, Stat], December 20, 2013.
+ [[http://arxiv.org/abs/1312.6114]].
+- Kurihara, Kenichi, Max Welling, and Nikos Vlassis. "Accelerated
+ variational Dirichlet process mixtures." In Advances in neural
+ information processing systems, pp. 761-768. 2007.
+- Sudderth, Erik Blaine. "Graphical models for visual object recognition
+ and tracking." PhD diss., Massachusetts Institute of Technology, 2006.
diff --git a/posts/2019-03-13-a-tail-of-two-densities.org b/posts/2019-03-13-a-tail-of-two-densities.org
new file mode 100644
index 0000000..783e0c5
--- /dev/null
+++ b/posts/2019-03-13-a-tail-of-two-densities.org
@@ -0,0 +1,1304 @@
+#+title: A Tail of Two Densities
+
+#+date: <2019-03-13>
+
+This is Part 1 of a two-part post where I give an introduction to the
+mathematics of differential privacy.
+
+Practically speaking,
+[[https://en.wikipedia.org/wiki/Differential_privacy][differential
+privacy]] is a technique of perturbing database queries so that query
+results do not leak too much information while still being relatively
+accurate.
+
+This post however focuses on the mathematical aspects of differential
+privacy, which is a study of
+[[https://en.wikipedia.org/wiki/Concentration_inequality][tail bounds]]
+of the divergence between two probability measures, with the end goal of
+applying it to
+[[https://en.wikipedia.org/wiki/Stochastic_gradient_descent][stochastic
+gradient descent]]. This post should be suitable for anyone familiar
+with probability theory.
+
+I start with the definition of \(\epsilon\)-differential privacy
+(corresponding to max divergence), followed by
+\((\epsilon, \delta)\)-differential privacy (a.k.a. approximate
+differential privacy, corresponding to the \(\delta\)-approximate max
+divergence). I show a characterisation of the
+\((\epsilon, \delta)\)-differential privacy as conditioned
+\(\epsilon\)-differential privacy. Also, as examples, I illustrate the
+\(\epsilon\)-dp with Laplace mechanism and, using some common tail bounds,
+the approximate dp with the Gaussian mechanism.
+
+Then I continue to show the effect of combinatorial and sequential
+compositions of randomised queries (called mechanisms) on privacy by
+stating and proving the composition theorems for differential privacy,
+as well as the effect of mixing mechanisms, by presenting the
+subsampling theorem (a.k.a. amplification theorem).
+
+In [[/posts/2019-03-14-great-but-manageable-expectations.html][Part 2]],
+I discuss the Rényi differential privacy, corresponding to the Rényi
+divergence, a study of the
+[[https://en.wikipedia.org/wiki/Moment-generating_function][moment
+generating functions]] of the divergence between probability measures to
+derive the tail bounds.
+
+Like in Part 1, I prove a composition theorem and a subsampling theorem.
+
+I also attempt to reproduce a seemingly better moment bound for the
+Gaussian mechanism with subsampling, with one intermediate step which I
+am not able to prove.
+
+After that I explain the Tensorflow implementation of differential
+privacy in its
+[[https://github.com/tensorflow/privacy/tree/master/privacy][Privacy]]
+module, which focuses on the differentially private stochastic gradient
+descent algorithm (DP-SGD).
+
+Finally I use the results from both Part 1 and Part 2 to obtain some
+privacy guarantees for composed subsampling queries in general, and for
+DP-SGD in particular. I also compare these privacy guarantees.
+
+*Acknowledgement*. I would like to thank
+[[http://stockholm.ai][Stockholm AI]] for introducing me to the subject
+of differential privacy. Thanks to Amir Hossein Rahnama for hosting the
+discussions at Stockholm AI. Thanks to (in chronological order) Reynaldo
+Boulogne, Martin Abedi, Ilya Mironov, Kurt Johansson, Mark Bun, Salil
+Vadhan, Jonathan Ullman, Yuanyuan Xu and Yiting Li for communication and
+discussions. Also thanks to the
+[[https://www.reddit.com/r/MachineLearning/][r/MachineLearning]]
+community for comments and suggestions which result in improvement of
+readability of this post. The research was done while working at
+[[https://www.kth.se/en/sci/institutioner/math][KTH Department of
+Mathematics]].
+
+/If you are confused by any notations, ask me or try
+[[/notations.html][this]]. This post (including both Part 1 and Part2)
+is licensed under [[https://creativecommons.org/licenses/by-sa/4.0/][CC
+BY-SA]] and [[https://www.gnu.org/licenses/fdl.html][GNU FDL]]./
+
+** The gist of differential privacy
+ :PROPERTIES:
+ :CUSTOM_ID: the-gist-of-differential-privacy
+ :END:
+If you only have one minute, here is what differential privacy is about:
+
+Let \(p\) and \(q\) be two probability densities, we define the /divergence
+variable/[fn:1] of \((p, q)\) to be
+
+\[L(p || q) := \log {p(\xi) \over q(\xi)}\]
+
+where \(\xi\) is a random variable distributed according to \(p\).
+
+Roughly speaking, differential privacy is the study of the tail bound of
+\(L(p || q)\): for certain \(p\)s and \(q\)s, and for \(\epsilon > 0\), find
+\(\delta(\epsilon)\) such that
+
+\[\mathbb P(L(p || q) > \epsilon) < \delta(\epsilon),\]
+
+where \(p\) and \(q\) are the laws of the outputs of a randomised functions
+on two very similar inputs. Moreover, to make matters even simpler, only
+three situations need to be considered:
+
+1. (General case) \(q\) is in the form of \(q(y) = p(y + \Delta)\) for some
+ bounded constant \(\Delta\).
+2. (Compositions) \(p\) and \(q\) are combinatorial or sequential
+ compositions of some simpler \(p_i\)'s and \(q_i\)'s respectively
+3. (Subsampling) \(p\) and \(q\) are mixtures / averages of some simpler
+ \(p_i\)'s and \(q_i\)'s respectively
+
+In applications, the inputs are databases and the randomised functions
+are queries with an added noise, and the tail bounds give privacy
+guarantees. When it comes to gradient descent, the input is the training
+dataset, and the query updates the parameters, and privacy is achieved
+by adding noise to the gradients.
+
+Now if you have an hour...
+
+** \(\epsilon\)-dp
+ :PROPERTIES:
+ :CUSTOM_ID: epsilon-dp
+ :END:
+*Definition (Mechanisms)*. Let \(X\) be a space with a metric
+\(d: X \times X \to \mathbb N\). A /mechanism/ \(M\) is a function that
+takes \(x \in X\) as input and outputs a random variable on \(Y\).
+
+In this post, \(X = Z^m\) is the space of datasets of \(m\) rows for some
+integer \(m\), where each item resides in some space \(Z\). In this case the
+distance \(d(x, x') := \#\{i: x_i \neq x'_i\}\) is the number of rows that
+differ between \(x\) and \(x'\).
+
+Normally we have a query \(f: X \to Y\), and construct the mechanism \(M\)
+from \(f\) by adding a noise:
+
+\[M(x) := f(x) + \text{noise}.\]
+
+Later, we will also consider mechanisms constructed from composition or
+mixture of other mechanisms.
+
+In this post \(Y = \mathbb R^d\) for some \(d\).
+
+*Definition (Sensitivity)*. Let \(f: X \to \mathbb R^d\) be a function.
+The /sensitivity/ \(S_f\) of \(f\) is defined as
+
+\[S_f := \sup_{x, x' \in X: d(x, x') = 1} \|f(x) - f(x')\|_2,\]
+
+where \(\|y\|_2 = \sqrt{y_1^2 + ... + y_d^2}\) is the \(\ell^2\)-norm.
+
+*Definition (Differential Privacy)*. A mechanism \(M\) is called
+\(\epsilon\)/-differential privacy/ (\(\epsilon\)-dp) if it satisfies the
+following condition: for all \(x, x' \in X\) with \(d(x, x') = 1\), and for
+all measureable set \(S \subset \mathbb R^n\),
+
+\[\mathbb P(M(x) \in S) \le e^\epsilon P(M(x') \in S). \qquad (1)\]
+
+Practically speaking, this means given the results from perturbed query
+on two known databases that differs by one row, it is hard to determine
+which result is from which database.
+
+An example of \(\epsilon\)-dp mechanism is the Laplace mechanism.
+
+*Definition*. The /Laplace distribution/ over \(\mathbb R\) with parameter
+\(b > 0\) has probability density function
+
+\[f_{\text{Lap}(b)}(x) = {1 \over 2 b} e^{- {|x| \over b}}.\]
+
+*Definition*. Let \(d = 1\). The /Laplace mechanism/ is defined by
+
+\[M(x) = f(x) + \text{Lap}(b).\]
+
+*Claim*. The Laplace mechanism with
+
+\[b \ge \epsilon^{-1} S_f \qquad (1.5)\]
+
+is \(\epsilon\)-dp.
+
+*Proof*. Quite straightforward. Let \(p\) and \(q\) be the laws of \(M(x)\)
+and \(M(x')\) respectively.
+
+\[{p (y) \over q (y)} = {f_{\text{Lap}(b)} (y - f(x)) \over f_{\text{Lap}(b)} (y - f(x'))} = \exp(b^{-1} (|y - f(x')| - |y - f(x)|))\]
+
+Using triangular inequality \(|A| - |B| \le |A - B|\) on the right hand
+side, we have
+
+\[{p (y) \over q (y)} \le \exp(b^{-1} (|f(x) - f(x')|)) \le \exp(\epsilon)\]
+
+where in the last step we use the condition (1.5). \(\square\)
+
+** Approximate differential privacy
+ :PROPERTIES:
+ :CUSTOM_ID: approximate-differential-privacy
+ :END:
+Unfortunately, \(\epsilon\)-dp does not apply to the most commonly used
+noise, the Gaussian noise. To fix this, we need to relax the definition
+a bit.
+
+*Definition*. A mechanism \(M\) is said to be
+\((\epsilon, \delta)\)/-differentially private/ if for all \(x, x' \in X\)
+with \(d(x, x') = 1\) and for all measureable \(S \subset \mathbb R^d\)
+
+\[\mathbb P(M(x) \in S) \le e^\epsilon P(M(x') \in S) + \delta. \qquad (2)\]
+
+Immediately we see that the \((\epsilon, \delta)\)-dp is meaningful only
+if \(\delta < 1\).
+
+*** Indistinguishability
+ :PROPERTIES:
+ :CUSTOM_ID: indistinguishability
+ :END:
+To understand \((\epsilon, \delta)\)-dp, it is helpful to study
+\((\epsilon, \delta)\)-indistinguishability.
+
+*Definition*. Two probability measures \(p\) and \(q\) on the same space are
+called \((\epsilon, \delta)\)/-ind(istinguishable)/ if for all measureable
+sets \(S\):
+
+$$\begin{aligned}
+p(S) \le e^\epsilon q(S) + \delta, \qquad (3) \\
+q(S) \le e^\epsilon p(S) + \delta. \qquad (4)
+\end{aligned}$$
+
+As before, we also call random variables \(\xi\) and \(\eta\) to be
+\((\epsilon, \delta)\)-ind if their laws are \((\epsilon, \delta)\)-ind.
+When \(\delta = 0\), we call it \(\epsilon\)-ind.
+
+Immediately we have
+
+*Claim 0*. \(M\) is \((\epsilon, \delta)\)-dp (resp. \(\epsilon\)-dp) iff
+\(M(x)\) and \(M(x')\) are \((\epsilon, \delta)\)-ind (resp. \(\epsilon\)-ind)
+for all \(x\) and \(x'\) with distance \(1\).
+
+*Definition (Divergence Variable)*. Let \(p\) and \(q\) be two probability
+measures. Let \(\xi\) be a random variable distributed according to \(p\),
+we define a random variable \(L(p || q)\) by
+
+\[L(p || q) := \log {p(\xi) \over q(\xi)},\]
+
+and call it the /divergence variable/ of \((p, q)\).
+
+One interesting and readily verifiable fact is
+
+\[\mathbb E L(p || q) = D(p || q)\]
+
+where \(D\) is the
+[[https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence][KL-divergence]].
+
+*Claim 1*. If
+
+$$\begin{aligned}
+\mathbb P(L(p || q) \le \epsilon) &\ge 1 - \delta, \qquad(5) \\
+\mathbb P(L(q || p) \le \epsilon) &\ge 1 - \delta
+\end{aligned}$$
+
+then \(p\) and \(q\) are \((\epsilon, \delta)\)-ind.
+
+*Proof*. We verify (3), and (4) can be shown in the same way. Let
+\(A := \{y \in Y: \log {p(y) \over q(y)} > \epsilon\}\), then by (5) we
+have
+
+\[p(A) < \delta.\]
+
+So
+
+\[p(S) = p(S \cap A) + p(S \setminus A) \le \delta + e^\epsilon q(S \setminus A) \le \delta + e^\epsilon q(S).\]
+
+\(\square\)
+
+This Claim translates differential privacy to the tail bound of
+divergence variables, and for the rest of this post all dp results are
+obtained by estimating this tail bound.
+
+In the following we discuss the converse of Claim 1. The discussions are
+rather technical, and readers can skip to the
+[[#back-to-approximate-differential-privacy][next subsection]] on first
+reading.
+
+The converse of Claim 1 is not true.
+
+*Claim 2*. There exists \(\epsilon, \delta > 0\), and \(p\) and \(q\) that are
+\((\epsilon, \delta)\)-ind, such that
+
+$$\begin{aligned}
+\mathbb P(L(p || q) \le \epsilon) &< 1 - \delta, \\
+\mathbb P(L(q || p) \le \epsilon) &< 1 - \delta
+\end{aligned}$$
+
+*Proof*. Here's a example. Let \(Y = \{0, 1\}\), and \(p(0) = q(1) = 2 / 5\)
+and \(p(1) = q(0) = 3 / 5\). Then it is not hard to verify that \(p\) and
+\(q\) are \((\log {4 \over 3}, {1 \over 3})\)-ind: just check (3) for all
+four possible \(S \subset Y\) and (4) holds by symmetry. On the other
+hand,
+
+\[\mathbb P(L(p || q) \le \log {4 \over 3}) = \mathbb P(L(q || p) \le \log {4 \over 3}) = {2 \over 5} < {2 \over 3}.\]
+
+\(\square\)
+
+A weaker version of the converse of Claim 1 is true
+(Kasiviswanathan-Smith 2015), though:
+
+*Claim 3*. Let \(\alpha > 1\). If \(p\) and \(q\) are
+\((\epsilon, \delta)\)-ind, then
+
+\[\mathbb P(L(p || q) > \alpha \epsilon) < {1 \over 1 - \exp((1 - \alpha) \epsilon)} \delta.\]
+
+*Proof*. Define
+
+\[S = \{y: p(y) > e^{\alpha \epsilon} q(y)\}.\]
+
+Then we have
+
+\[e^{\alpha \epsilon} q(S) < p(S) \le e^\epsilon q(S) + \delta,\]
+
+where the first inequality is due to the definition of \(S\), and the
+second due to the \((\epsilon, \delta)\)-ind. Therefore
+
+\[q(S) \le {\delta \over e^{\alpha \epsilon} - e^\epsilon}.\]
+
+Using the \((\epsilon, \delta)\)-ind again we have
+
+\[p(S) \le e^\epsilon q(S) + \delta = {1 \over 1 - e^{(1 - \alpha) \epsilon}} \delta.\]
+
+\(\square\)
+
+This can be quite bad if \(\epsilon\) is small.
+
+To prove the composition theorems in the next section, we need a
+condition better than that in Claim 1 so that we can go back and forth
+between indistinguishability and such condition. In other words, we need
+a /characterisation/ of indistinguishability.
+
+Let us take a careful look at the condition in Claim 1 and call it *C1*:
+
+*C1*. \(\mathbb P(L(p || q) \le \epsilon) \ge 1 - \delta\) and
+\(\mathbb P(L(q || p) \le \epsilon) \ge 1 - \delta\)
+
+It is equivalent to
+
+*C2*. there exist events \(A, B \subset Y\) with probabilities \(p(A)\) and
+\(q(B)\) at least \(1 - \delta\) such that
+\(\log p(y) - \log q(y) \le \epsilon\) for all \(y \in A\) and
+\(\log q(y) - \log p(y) \le \epsilon\) for all \(y \in B\).
+
+A similar-looking condition to *C2* is the following:
+
+*C3*. Let \(\Omega\) be the
+[[https://en.wikipedia.org/wiki/Probability_space#Definition][underlying
+probability space]]. There exist two events \(E, F \subset \Omega\) with
+\(\mathbb P(E), \mathbb P(F) \ge 1 - \delta\), such that
+\(|\log p_{|E}(y) - \log q_{|F}(y)| \le \epsilon\) for all \(y \in Y\).
+
+Here \(p_{|E}\) (resp. \(q_{|F}\)) is \(p\) (resp. \(q\)) conditioned on event
+\(E\) (resp. \(F\)).
+
+*Remark*. Note that the events in *C2* and *C3* are in different spaces,
+and therefore we can not write \(p_{|E}(S)\) as \(p(S | E)\) or \(q_{|F}(S)\)
+as \(q(S | F)\). In fact, if we let \(E\) and \(F\) in *C3* be subsets of \(Y\)
+with \(p(E), q(F) \ge 1 - \delta\) and assume \(p\) and \(q\) have the same
+supports, then *C3* degenerates to a stronger condition than *C2*.
+Indeed, in this case \(p_E(y) = p(y) 1_{y \in E}\) and
+\(q_F(y) = q(y) 1_{y \in F}\), and so \(p_E(y) \le e^\epsilon q_F(y)\)
+forces \(E \subset F\). We also obtain \(F \subset E\) in the same way. This
+gives us \(E = F\), and *C3* becomes *C2* with \(A = B = E = F\).
+
+As it turns out, *C3* is the condition we need.
+
+*Claim 4*. Two probability measures \(p\) and \(q\) are
+\((\epsilon, \delta)\)-ind if and only if *C3* holds.
+
+*Proof*(Murtagh-Vadhan 2018). The "if" direction is proved in the same
+way as Claim 1. Without loss of generality we may assume
+\(\mathbb P(E) = \mathbb P(F) \ge 1 - \delta\). To see this, suppose \(F\)
+has higher probability than \(E\), then we can substitute \(F\) with a
+subset of \(F\) that has the same probability as \(E\) (with possible
+enlargement of the probability space).
+
+Let \(\xi \sim p\) and \(\eta \sim q\) be two independent random variables,
+then
+
+$$\begin{aligned}
+p(S) &= \mathbb P(\xi \in S | E) \mathbb P(E) + \mathbb P(\xi \in S; E^c) \\
+&\le e^\epsilon \mathbb P(\eta \in S | F) \mathbb P(E) + \delta \\
+&= e^\epsilon \mathbb P(\eta \in S | F) \mathbb P(F) + \delta\\
+&\le e^\epsilon q(S) + \delta.
+\end{aligned}$$
+
+The "only-if" direction is more involved.
+
+We construct events \(E\) and \(F\) by constructing functions
+\(e, f: Y \to [0, \infty)\) satisfying the following conditions:
+
+1. \(0 \le e(y) \le p(y)\) and \(0 \le f(y) \le q(y)\) for all \(y \in Y\).
+2. \(|\log e(y) - \log f(y)| \le \epsilon\) for all \(y \in Y\).
+3. \(e(Y), f(Y) \ge 1 - \delta\).
+4. \(e(Y) = f(Y)\).
+
+Here for a set \(S \subset Y\), \(e(S) := \int_S e(y) dy\), and the same
+goes for \(f(S)\).
+
+Let \(\xi \sim p\) and \(\eta \sim q\). Then we define \(E\) and \(F\) by
+
+$$\mathbb P(E | \xi = y) = e(y) / p(y) \\
+\mathbb P(F | \eta = y) = f(y) / q(y).$$
+
+*Remark inside proof*. This can seem a bit confusing. Intuitively, we
+can think of it this way when \(Y\) is finite: Recall a random variable on
+\(Y\) is a function from the probability space \(\Omega\) to \(Y\). Let event
+\(G_y \subset \Omega\) be defined as \(G_y = \xi^{-1} (y)\). We cut \(G_y\)
+into the disjoint union of \(E_y\) and \(G_y \setminus E_y\) such that
+\(\mathbb P(E_y) = e(y)\). Then \(E = \bigcup_{y \in Y} E_y\). So \(e(y)\) can
+be seen as the "density" of \(E\).
+
+Indeed, given \(E\) and \(F\) defined this way, we have
+
+\[p_E(y) = {e(y) \over e(Y)} \le {\exp(\epsilon) f(y) \over e(Y)} = {\exp(\epsilon) f(y) \over f(Y)} = \exp(\epsilon) q_F(y).\]
+
+and
+
+\[\mathbb P(E) = \int \mathbb P(E | \xi = y) p(y) dy = e(Y) \ge 1 - \delta,\]
+
+and the same goes for \(\mathbb P(F)\).
+
+What remains is to construct \(e(y)\) and \(f(y)\) satisfying the four
+conditions.
+
+Like in the proof of Claim 1, let \(S, T \subset Y\) be defined as
+
+$$\begin{aligned}
+S := \{y: p(y) > \exp(\epsilon) q(y)\},\\
+T := \{y: q(y) > \exp(\epsilon) p(y)\}.
+\end{aligned}$$
+
+Let
+
+$$\begin{aligned}
+e(y) &:= \exp(\epsilon) q(y) 1_{y \in S} + p(y) 1_{y \notin S}\\
+f(y) &:= \exp(\epsilon) p(y) 1_{y \in T} + q(y) 1_{y \notin T}. \qquad (6)
+\end{aligned}$$
+
+By checking them on the three disjoint subsets \(S\), \(T\), \((S \cup T)^c\),
+it is not hard to verify that the \(e(y)\) and \(f(y)\) constructed this way
+satisfy the first two conditions. They also satisfy the third condition:
+
+$$\begin{aligned}
+e(Y) &= 1 - (p(S) - \exp(\epsilon) q(S)) \ge 1 - \delta, \\
+f(Y) &= 1 - (q(T) - \exp(\epsilon) p(T)) \ge 1 - \delta.
+\end{aligned}$$
+
+If \(e(Y) = f(Y)\) then we are done. Otherwise, without loss of
+generality, assume \(e(Y) < f(Y)\), then all it remains to do is to reduce
+the value of \(f(y)\) while preserving Condition 1, 2 and 3, until
+\(f(Y) = e(Y)\).
+
+As it turns out, this can be achieved by reducing \(f(y)\) on the set
+\(\{y \in Y: q(y) > p(y)\}\). To see this, let us rename the \(f(y)\)
+defined in (6) \(f_+(y)\), and construct \(f_-(y)\) by
+
+\[f_-(y) := p(y) 1_{y \in T} + (q(y) \wedge p(y)) 1_{y \notin T}.\]
+
+It is not hard to show that not only \(e(y)\) and \(f_-(y)\) also satisfy
+conditions 1-3, but
+
+\[e(y) \ge f_-(y), \forall y \in Y,\]
+
+and thus \(e(Y) \ge f_-(Y)\). Therefore there exists an \(f\) that
+interpolates between \(f_-\) and \(f_+\) with \(f(Y) = e(Y)\). \(\square\)
+
+To prove the adaptive composition theorem for approximate differential
+privacy, we need a similar claim (We use index shorthand
+\(\xi_{< i} = \xi_{1 : i - 1}\) and similarly for other notations):
+
+*Claim 5*. Let \(\xi_{1 : i}\) and \(\eta_{1 : i}\) be random variables. Let
+
+$$\begin{aligned}
+p_i(S | y_{1 : i - 1}) := \mathbb P(\xi_i \in S | \xi_{1 : i - 1} = y_{1 : i - 1})\\
+q_i(S | y_{1 : i - 1}) := \mathbb P(\eta_i \in S | \eta_{1 : i - 1} = y_{1 : i - 1})
+\end{aligned}$$
+
+be the conditional laws of \(\xi_i | \xi_{< i}\) and \(\eta_i | \eta_{< i}\)
+respectively. Then the following are equivalent:
+
+1. For any \(y_{< i} \in Y^{i - 1}\), \(p_i(\cdot | y_{< i})\) and
+ \(q_i(\cdot | y_{< i})\) are \((\epsilon, \delta)\)-ind
+
+2. There exists events \(E_i, F_i \subset \Omega\) with
+ \(\mathbb P(E_i | \xi_{<i} = y_{<i}) = \mathbb P(F_i | \eta_{<i} = y_{< i}) \ge 1 - \delta\)
+ for any \(y_{< i}\), such that \(p_{i | E_i}(\cdot | y_{< i})\) and
+ \(q_{i | E_i} (\cdot | y_{< i})\) are \(\epsilon\)-ind for any \(y_{< i}\),
+ where $$\begin{aligned}
+ p_{i | E_i}(S | y_{1 : i - 1}) := \mathbb P(\xi_i \in S | E_i, \xi_{1 : i - 1} = y_{1 : i - 1})\\
+ q_{i | F_i}(S | y_{1 : i - 1}) := \mathbb P(\eta_i \in S | F_i, \eta_{1 : i - 1} = y_{1 : i - 1})
+ \end{aligned}$$
+
+ are \(p_i\) and \(q_i\) conditioned on \(E_i\) and \(F_i\) respectively.
+
+*Proof*. Item 2 => Item 1: as in the Proof of Claim 4,
+
+$$\begin{aligned}
+p_i(S | y_{< i}) &= p_{i | E_i} (S | y_{< i}) \mathbb P(E_i | \xi_{< i} = y_{< i}) + p_{i | E_i^c}(S | y_{< i}) \mathbb P(E_i^c | \xi_{< i} = y_{< i}) \\
+&\le p_{i | E_i} (S | y_{< i}) \mathbb P(E_i | \xi_{< i} = y_{< i}) + \delta \\
+&= p_{i | E_i} (S | y_{< i}) \mathbb P(F_i | \xi_{< i} = y_{< i}) + \delta \\
+&\le e^\epsilon q_{i | F_i} (S | y_{< i}) \mathbb P(F_i | \xi_{< i} = y_{< i}) + \delta \\
+&= e^\epsilon q_i (S | y_{< i}) + \delta.
+\end{aligned}$$
+
+The direction from
+\(q_i(S | y_{< i}) \le e^\epsilon p_i(S | y_{< i}) + \delta\) can be shown
+in the same way.
+
+Item 1 => Item 2: as in the Proof of Claim 4 we construct \(e(y_{1 : i})\)
+and \(f(y_{1 : i})\) as "densities" of events \(E_i\) and \(F_i\).
+
+Let
+
+$$\begin{aligned}
+e(y_{1 : i}) &:= e^\epsilon q_i(y_i | y_{< i}) 1_{y_i \in S_i(y_{< i})} + p_i(y_i | y_{< i}) 1_{y_i \notin S_i(y_{< i})}\\
+f(y_{1 : i}) &:= e^\epsilon p_i(y_i | y_{< i}) 1_{y_i \in T_i(y_{< i})} + q_i(y_i | y_{< i}) 1_{y_i \notin T_i(y_{< i})}\\
+\end{aligned}$$
+
+where
+
+$$\begin{aligned}
+S_i(y_{< i}) = \{y_i \in Y: p_i(y_i | y_{< i}) > e^\epsilon q_i(y_i | y_{< i})\}\\
+T_i(y_{< i}) = \{y_i \in Y: q_i(y_i | y_{< i}) > e^\epsilon p_i(y_i | y_{< i})\}.
+\end{aligned}$$
+
+Then \(E_i\) and \(F_i\) are defined as
+
+$$\begin{aligned}
+\mathbb P(E_i | \xi_{\le i} = y_{\le i}) &= {e(y_{\le i}) \over p_i(y_{\le i})},\\
+\mathbb P(F_i | \xi_{\le i} = y_{\le i}) &= {f(y_{\le i}) \over q_i(y_{\le i})}.
+\end{aligned}$$
+
+The rest of the proof is almost the same as the proof of Claim 4.
+\(\square\)
+
+*** Back to approximate differential privacy
+ :PROPERTIES:
+ :CUSTOM_ID: back-to-approximate-differential-privacy
+ :END:
+By Claim 0 and 1 we have
+
+*Claim 6*. If for all \(x, x' \in X\) with distance \(1\)
+
+\[\mathbb P(L(M(x) || M(x')) \le \epsilon) \ge 1 - \delta,\]
+
+then \(M\) is \((\epsilon, \delta)\)-dp.
+
+Note that in the literature the divergence variable \(L(M(x) || M(x'))\)
+is also called the /privacy loss/.
+
+By Claim 0 and Claim 4 we have
+
+*Claim 7*. \(M\) is \((\epsilon, \delta)\)-dp if and only if for every
+\(x, x' \in X\) with distance \(1\), there exist events
+\(E, F \subset \Omega\) with \(\mathbb P(E) = \mathbb P(F) \ge 1 - \delta\),
+\(M(x) | E\) and \(M(x') | F\) are \(\epsilon\)-ind.
+
+We can further simplify the privacy loss \(L(M(x) || M(x'))\), by
+observing the translational and scaling invariance of \(L(\cdot||\cdot)\):
+
+$$\begin{aligned}
+L(\xi || \eta) &\overset{d}{=} L(\alpha \xi + \beta || \alpha \eta + \beta), \qquad \alpha \neq 0. \qquad (6.1)
+\end{aligned}$$
+
+With this and the definition
+
+\[M(x) = f(x) + \zeta\]
+
+for some random variable \(\zeta\), we have
+
+\[L(M(x) || M(x')) \overset{d}{=} L(\zeta || \zeta + f(x') - f(x)).\]
+
+Without loss of generality, we can consider \(f\) with sensitivity \(1\),
+for
+
+\[L(f(x) + S_f \zeta || f(x') + S_f \zeta) \overset{d}{=} L(S_f^{-1} f(x) + \zeta || S_f^{-1} f(x') + \zeta)\]
+
+so for any noise \(\zeta\) that achieves \((\epsilon, \delta)\)-dp for a
+function with sensitivity \(1\), we have the same privacy guarantee by for
+an arbitrary function with sensitivity \(S_f\) by adding a noise
+\(S_f \zeta\).
+
+With Claim 6 we can show that the Gaussian mechanism is approximately
+differentially private. But first we need to define it.
+
+*Definition (Gaussian mechanism)*. Given a query \(f: X \to Y\), the
+/Gaussian mechanism/ \(M\) adds a Gaussian noise to the query:
+
+\[M(x) = f(x) + N(0, \sigma^2 I).\]
+
+Some tail bounds for the Gaussian distribution will be useful.
+
+*Claim 8 (Gaussian tail bounds)*. Let \(\xi \sim N(0, 1)\) be a standard
+normal distribution. Then for \(t > 0\)
+
+\[\mathbb P(\xi > t) < {1 \over \sqrt{2 \pi} t} e^{- {t^2 \over 2}}, \qquad (6.3)\]
+
+and
+
+\[\mathbb P(\xi > t) < e^{- {t^2 \over 2}}. \qquad (6.5)\]
+
+*Proof*. Both bounds are well known. The first can be proved using
+
+\[\int_t^\infty e^{- {y^2 \over 2}} dy < \int_t^\infty {y \over t} e^{- {y^2 \over 2}} dy.\]
+
+The second is shown using
+[[https://en.wikipedia.org/wiki/Chernoff_bound][Chernoff bound]]. For
+any random variable \(\xi\),
+
+\[\mathbb P(\xi > t) < {\mathbb E \exp(\lambda \xi) \over \exp(\lambda t)} = \exp(\kappa_\xi(\lambda) - \lambda t), \qquad (6.7)\]
+
+where \(\kappa_\xi(\lambda) = \log \mathbb E \exp(\lambda \xi)\) is the
+cumulant of \(\xi\). Since (6.7) holds for any \(\lambda\), we can get the
+best bound by minimising \(\kappa_\xi(\lambda) - \lambda t\) (a.k.a. the
+[[https://en.wikipedia.org/wiki/Legendre_transformation][Legendre
+transformation]]). When \(\xi\) is standard normal, we get (6.5).
+\(\square\)
+
+*Remark*. We will use the Chernoff bound extensively in the second part
+of this post when considering Rényi differential privacy.
+
+*Claim 9*. The Gaussian mechanism on a query \(f\) is
+\((\epsilon, \delta)\)-dp, where
+
+\[\delta = \exp(- (\epsilon \sigma / S_f - (2 \sigma / S_f)^{-1})^2 / 2). \qquad (6.8)\]
+
+Conversely, to achieve give \((\epsilon, \delta)\)-dp, we may set
+
+\[\sigma > \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right) S_f \qquad (6.81)\]
+
+or
+
+\[\sigma > (\epsilon^{-1} (1 \vee \sqrt{(\log (2 \pi)^{-1} \delta^{-2})_+}) + (2 \epsilon)^{- {1 \over 2}}) S_f \qquad (6.82)\]
+
+or
+
+\[\sigma > \epsilon^{-1} \sqrt{\log e^\epsilon \delta^{-2}} S_f \qquad (6.83)\]
+
+or
+
+\[\sigma > \epsilon^{-1} (\sqrt{1 + \epsilon} \vee \sqrt{(\log e^\epsilon (2 \pi)^{-1} \delta^{-2})_+}) S_f. \qquad (6.84)\]
+
+*Proof*. As discussed before we only need to consider the case where
+\(S_f = 1\). Fix arbitrary \(x, x' \in X\) with \(d(x, x') = 1\). Let
+\(\zeta = (\zeta_1, ..., \zeta_d) \sim N(0, I_d)\).
+
+By Claim 6 it suffices to bound
+
+\[\mathbb P(L(M(x) || M(x')) > \epsilon)\]
+
+We have by the linear invariance of \(L\),
+
+\[L(M(x) || M(x')) = L(f(x) + \sigma \zeta || f(x') + \sigma \zeta) \overset{d}{=} L(\zeta|| \zeta + \Delta / \sigma),\]
+
+where \(\Delta := f(x') - f(x)\).
+
+Plugging in the Gaussian density, we have
+
+\[L(M(x) || M(x')) \overset{d}{=} \sum_i {\Delta_i \over \sigma} \zeta_i + \sum_i {\Delta_i^2 \over 2 \sigma^2} \overset{d}{=} {\|\Delta\|_2 \over \sigma} \xi + {\|\Delta\|_2^2 \over 2 \sigma^2}.\]
+
+where \(\xi \sim N(0, 1)\).
+
+Hence
+
+\[\mathbb P(L(M(x) || M(x')) > \epsilon) = \mathbb P(\zeta > {\sigma \over \|\Delta\|_2} \epsilon - {\|\Delta\|_2 \over 2 \sigma}).\]
+
+Since \(\|\Delta\|_2 \le S_f = 1\), we have
+
+\[\mathbb P(L(M(x) || M(x')) > \epsilon) \le \mathbb P(\xi > \sigma \epsilon - (2 \sigma)^{-1}).\]
+
+Thus the problem is reduced to the tail bound of a standard normal
+distribution, so we can use Claim 8. Note that we implicitly require
+\(\sigma > (2 \epsilon)^{- 1 / 2}\) here so that
+\(\sigma \epsilon - (2 \sigma)^{-1} > 0\) and we can use the tail bounds.
+
+Using (6.3) we have
+
+\[\mathbb P(L(M(x) || M(x')) > \epsilon) < \exp(- (\epsilon \sigma - (2 \sigma)^{-1})^2 / 2).\]
+
+This gives us (6.8).
+
+To bound the right hand by \(\delta\), we require
+
+\[\epsilon \sigma - {1 \over 2 \sigma} > \sqrt{2 \log \delta^{-1}}. \qquad (6.91)\]
+
+Solving this inequality we have
+
+\[\sigma > {\sqrt{2 \log \delta^{-1}} + \sqrt{2 \log \delta^{-1} + 2 \epsilon} \over 2 \epsilon}.\]
+
+Using
+\(\sqrt{2 \log \delta^{-1} + 2 \epsilon} \le \sqrt{2 \log \delta^{-1}} + \sqrt{2 \epsilon}\),
+we can achieve the above inequality by having
+
+\[\sigma > \epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{-{1 \over 2}}.\]
+
+This gives us (6.81).
+
+Alternatively, we can use the concavity of \(\sqrt{\cdot}\):
+
+\[(2 \epsilon)^{-1} (\sqrt{2 \log \delta^{-1}} + \sqrt{2 \log \delta^{-1} + 2 \epsilon}) \le \epsilon^{-1} \sqrt{\log e^\epsilon \delta^{-2}},\]
+
+which gives us (6.83)
+
+Back to (6.9), if we use (6.5) instead, we need
+
+\[\log t + {t^2 \over 2} > \log {(2 \pi)^{- 1 / 2} \delta^{-1}}\]
+
+where \(t = \epsilon \sigma - (2 \sigma)^{-1}\). This can be satisfied if
+
+$$\begin{aligned}
+t &> 1 \qquad (6.93)\\
+t &> \sqrt{\log (2 \pi)^{-1} \delta^{-2}}. \qquad (6.95)
+\end{aligned}$$
+
+We can solve both inequalities as before and obtain
+
+\[\sigma > \epsilon^{-1} (1 \vee \sqrt{(\log (2 \pi)^{-1} \delta^{-2})_+}) + (2 \epsilon)^{- {1 \over 2}},\]
+
+or
+
+\[\sigma > \epsilon^{-1}(\sqrt{1 + \epsilon} \vee \sqrt{(\log e^\epsilon (2 \pi)^{-1} \delta^{-2})_+}).\]
+
+This gives us (6.82)(6.84). \(\square\)
+
+When \(\epsilon \le \alpha\) is bounded, by (6.83) (6.84) we can require
+either
+
+\[\sigma > \epsilon^{-1} (\sqrt{\log e^\alpha \delta^{-2}}) S_f\]
+
+or
+
+\[\sigma > \epsilon^{-1} (\sqrt{1 + \alpha} \vee \sqrt{(\log (2 \pi)^{-1} e^\alpha \delta^{-2})_+}) S_f.\]
+
+The second bound is similar to and slightly better than the one in
+Theorem A.1 of Dwork-Roth 2013, where \(\alpha = 1\):
+
+\[\sigma > \epsilon^{-1} \left({3 \over 2} \vee \sqrt{(2 \log {5 \over 4} \delta^{-1})_+}\right) S_f.\]
+
+Note that the lower bound of \({3 \over 2}\) is implicitly required in the
+proof of Theorem A.1.
+
+** Composition theorems
+ :PROPERTIES:
+ :CUSTOM_ID: composition-theorems
+ :END:
+So far we have seen how a mechanism made of a single query plus a noise
+can be proved to be differentially private. But we need to understand
+the privacy when composing several mechanisms, combinatorially or
+sequentially. Let us first define the combinatorial case:
+
+*Definition (Independent composition)*. Let \(M_1, ..., M_k\) be \(k\)
+mechanisms with independent noises. The mechanism \(M = (M_1, ..., M_k)\)
+is called the /independent composition/ of \(M_{1 : k}\).
+
+To define the adaptive composition, let us motivate it with an example
+of gradient descent. Consider the loss function \(\ell(x; \theta)\) of a
+neural network, where \(\theta\) is the parameter and \(x\) the input,
+gradient descent updates its parameter \(\theta\) at each time \(t\):
+
+\[\theta_{t} = \theta_{t - 1} - \alpha m^{-1} \sum_{i = 1 : m} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}}.\]
+
+We may add privacy by adding noise \(\zeta_t\) at each step:
+
+\[\theta_{t} = \theta_{t - 1} - \alpha m^{-1} \sum_{i = 1 : m} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}} + \zeta_t. \qquad (6.97)\]
+
+Viewed as a sequence of mechanism, we have that at each time \(t\), the
+mechanism \(M_t\) takes input \(x\), and outputs \(\theta_t\). But \(M_t\) also
+depends on the output of the previous mechanism \(M_{t - 1}\). To this
+end, we define the adaptive composition.
+
+*Definition (Adaptive composition)*. Let
+\(({M_i(y_{1 : i - 1})})_{i = 1 : k}\) be \(k\) mechanisms with independent
+noises, where \(M_1\) has no parameter, \(M_2\) has one parameter in \(Y\),
+\(M_3\) has two parameters in \(Y\) and so on. For \(x \in X\), define \(\xi_i\)
+recursively by
+
+$$\begin{aligned}
+\xi_1 &:= M_1(x)\\
+\xi_i &:= M_i(\xi_1, \xi_2, ..., \xi_{i - 1}) (x).
+\end{aligned}$$
+
+The /adaptive composition/ of \(M_{1 : k}\) is defined by
+\(M(x) := (\xi_1, \xi_2, ..., \xi_k)\).
+
+The definition of adaptive composition may look a bit complicated, but
+the point is to describe \(k\) mechanisms such that for each \(i\), the
+output of the first, second, ..., \(i - 1\)th mechanisms determine the
+\(i\)th mechanism, like in the case of gradient descent.
+
+It is not hard to write down the differentially private gradient descent
+as a sequential composition:
+
+\[M_t(\theta_{1 : t - 1})(x) = \theta_{t - 1} - \alpha m^{-1} \sum_{i = 1 : m} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}} + \zeta_t.\]
+
+In Dwork-Rothblum-Vadhan 2010 (see also Dwork-Roth 2013) the adaptive
+composition is defined in a more general way, but the definition is
+based on the same principle, and proofs in this post on adaptive
+compositions carry over.
+
+It is not hard to see that the adaptive composition degenerates to
+independent composition when each \(M_i(y_{1 : i})\) evaluates to the same
+mechanism regardless of \(y_{1 : i}\), in which case the \(\xi_i\)s are
+independent.
+
+In the following when discussing adaptive compositions we sometimes omit
+the parameters for convenience without risk of ambiguity, and write
+\(M_i(y_{1 : i})\) as \(M_i\), but keep in mind of the dependence on the
+parameters.
+
+It is time to state and prove the composition theorems. In this section
+we consider \(2 \times 2 \times 2 = 8\) cases, i.e. situations of three
+dimensions, where there are two choices in each dimension:
+
+1. Composition of \(\epsilon\)-dp or more generally
+ \((\epsilon, \delta)\)-dp mechanisms
+2. Composition of independent or more generally adaptive mechanisms
+3. Basic or advanced compositions
+
+Note that in the first two dimensions the second choice is more general
+than the first.
+
+The proofs of these composition theorems will be laid out as follows:
+
+1. Claim 10 - Basic composition theorem for \((\epsilon, \delta)\)-dp with
+ adaptive mechanisms: by a direct proof with an induction argument
+2. Claim 14 - Advanced composition theorem for \(\epsilon\)-dp with
+ independent mechanisms: by factorising privacy loss and using
+ [[https://en.wikipedia.org/wiki/Hoeffding%27s_inequality][Hoeffding's
+ Inequality]]
+3. Claim 16 - Advanced composition theorem for \(\epsilon\)-dp with
+ adaptive mechanisms: by factorising privacy loss and using
+ [[https://en.wikipedia.org/wiki/Azuma%27s_inequality][Azuma's
+ Inequality]]
+4. Claims 17 and 18 - Advanced composition theorem for
+ \((\epsilon, \delta)\)-dp with independent / adaptive mechanisms: by
+ using characterisations of \((\epsilon, \delta)\)-dp in Claims 4 and 5
+ as an approximation of \(\epsilon\)-dp and then using Proofs in Item 2
+ or 3.
+
+*Claim 10 (Basic composition theorem).* Let \(M_{1 : k}\) be \(k\)
+mechanisms with independent noises such that for each \(i\) and
+\(y_{1 : i - 1}\), \(M_i(y_{1 : i - 1})\) is \((\epsilon_i, \delta_i)\)-dp.
+Then the adpative composition of \(M_{1 : k}\) is
+\((\sum_i \epsilon_i, \sum_i \delta_i)\)-dp.
+
+*Proof (Dwork-Lei 2009, see also Dwork-Roth 2013 Appendix B.1)*. Let \(x\)
+and \(x'\) be neighbouring points in \(X\). Let \(M\) be the adaptive
+composition of \(M_{1 : k}\). Define
+
+\[\xi_{1 : k} := M(x), \qquad \eta_{1 : k} := M(x').\]
+
+Let \(p^i\) and \(q^i\) be the laws of \((\xi_{1 : i})\) and \((\eta_{1 : i})\)
+respectively.
+
+Let \(S_1, ..., S_k \subset Y\) and \(T_i := \prod_{j = 1 : i} S_j\). We use
+two tricks.
+
+1. Since \(\xi_i | \xi_{< i} = y_{< i}\) and
+ \(\eta_i | \eta_{< i} = y_{< i}\) are \((\epsilon_i, \delta_i)\)-ind, and
+ a probability is no greater than \(1\), $$\begin{aligned}
+ \mathbb P(\xi_i \in S_i | \xi_{< i} = y_{< i}) &\le (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) + \delta_i) \wedge 1 \\
+ &\le (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) + \delta_i) \wedge (1 + \delta_i) \\
+ &= (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) \wedge 1) + \delta_i
+ \end{aligned}$$
+
+2. Given \(p\) and \(q\) that are \((\epsilon, \delta)\)-ind, define
+ \[\mu(x) = (p(x) - e^\epsilon q(x))_+.\]
+
+ We have \[\mu(S) \le \delta, \forall S\]
+
+ In the following we define
+ \(\mu^{i - 1} = (p^{i - 1} - e^\epsilon q^{i - 1})_+\) for the same
+ purpose.
+
+We use an inductive argument to prove the theorem:
+
+$$\begin{aligned}
+\mathbb P(\xi_{\le i} \in T_i) &= \int_{T_{i - 1}} \mathbb P(\xi_i \in S_i | \xi_{< i} = y_{< i}) p^{i - 1} (y_{< i}) dy_{< i} \\
+&\le \int_{T_{i - 1}} (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) \wedge 1) p^{i - 1}(y_{< i}) dy_{< i} + \delta_i\\
+&\le \int_{T_{i - 1}} (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) \wedge 1) (e^{\epsilon_1 + ... + \epsilon_{i - 1}} q^{i - 1}(y_{< i}) + \mu^{i - 1} (y_{< i})) dy_{< i} + \delta_i\\
+&\le \int_{T_{i - 1}} e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) e^{\epsilon_1 + ... + \epsilon_{i - 1}} q^{i - 1}(y_{< i}) dy_{< i} + \mu_{i - 1}(T_{i - 1}) + \delta_i\\
+&\le e^{\epsilon_1 + ... + \epsilon_i} \mathbb P(\eta_{\le i} \in T_i) + \delta_1 + ... + \delta_{i - 1} + \delta_i.\\
+\end{aligned}$$
+
+In the second line we use Trick 1; in the third line we use the
+induction assumption; in the fourth line we multiply the first term in
+the first braket with first term in the second braket, and the second
+term (i.e. \(1\)) in the first braket with the second term in the second
+braket (i.e. the \(\mu\) term); in the last line we use Trick 2.
+
+The base case \(i = 1\) is true since \(M_1\) is
+\((\epsilon_1, \delta_1)\)-dp. \(\square\)
+
+To prove the advanced composition theorem, we start with some lemmas.
+
+*Claim 11*. If \(p\) and \(q\) are \(\epsilon\)-ind, then
+
+\[D(p || q) + D(q || p) \le \epsilon(e^\epsilon - 1).\]
+
+*Proof*. Since \(p\) and \(q\) are \(\epsilon\)-ind, we have
+\(|\log p(x) - \log q(x)| \le \epsilon\) for all \(x\). Let
+\(S := \{x: p(x) > q(x)\}\). Then we have on
+
+$$\begin{aligned}
+D(p || q) + D(q || p) &= \int (p(x) - q(x)) (\log p(x) - \log q(x)) dx\\
+&= \int_S (p(x) - q(x)) (\log p(x) - \log q(x)) dx + \int_{S^c} (q(x) - p(x)) (\log q(x) - \log p(x)) dx\\
+&\le \epsilon(\int_S p(x) - q(x) dx + \int_{S^c} q(x) - p(x) dx)
+\end{aligned}$$
+
+Since on \(S\) we have \(q(x) \le p(x) \le e^\epsilon q(x)\), and on \(S^c\)
+we have \(p(x) \le q(x) \le e^\epsilon p(x)\), we obtain
+
+\[D(p || q) + D(q || p) \le \epsilon \int_S (1 - e^{-\epsilon}) p(x) dx + \epsilon \int_{S^c} (e^{\epsilon} - 1) p(x) dx \le \epsilon (e^{\epsilon} - 1),\]
+
+where in the last step we use \(e^\epsilon - 1 \ge 1 - e^{- \epsilon}\)
+and \(p(S) + p(S^c) = 1\). \(\square\)
+
+*Claim 12*. If \(p\) and \(q\) are \(\epsilon\)-ind, then
+
+\[D(p || q) \le a(\epsilon) \ge D(q || p),\]
+
+where
+
+\[a(\epsilon) = \epsilon (e^\epsilon - 1) 1_{\epsilon \le \log 2} + \epsilon 1_{\epsilon > \log 2} \le (\log 2)^{-1} \epsilon^2 1_{\epsilon \le \log 2} + \epsilon 1_{\epsilon > \log 2}. \qquad (6.98)\]
+
+*Proof*. Since \(p\) and \(q\) are \(\epsilon\)-ind, we have
+
+\[D(p || q) = \mathbb E_{\xi \sim p} \log {p(\xi) \over q(\xi)} \le \max_y {\log p(y) \over \log q(y)} \le \epsilon.\]
+
+Comparing the quantity in Claim 11 (\(\epsilon(e^\epsilon - 1)\)) with the
+quantity above (\(\epsilon\)), we arrive at the conclusion. \(\square\)
+
+*Claim 13
+([[https://en.wikipedia.org/wiki/Hoeffding%27s_inequality][Hoeffding's
+Inequality]])*. Let \(L_i\) be independent random variables with
+\(|L_i| \le b\), and let \(L = L_1 + ... + L_k\), then for \(t > 0\),
+
+\[\mathbb P(L - \mathbb E L \ge t) \le \exp(- {t^2 \over 2 k b^2}).\]
+
+*Claim 14 (Advanced Independent Composition Theorem)* (\(\delta = 0\)).
+Fix \(0 < \beta < 1\). Let \(M_1, ..., M_k\) be \(\epsilon\)-dp, then the
+independent composition \(M\) of \(M_{1 : k}\) is
+\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} \epsilon, \beta)\)-dp.
+
+*Remark*. By (6.98) we know that
+\(k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} \epsilon = \sqrt{2 k \log \beta^{-1}} \epsilon + k O(\epsilon^2)\)
+when \(\epsilon\) is sufficiently small, in which case the leading term is
+of order \(O(\sqrt k \epsilon)\) and we save a \(\sqrt k\) in the
+\(\epsilon\)-part compared to the Basic Composition Theorem (Claim 10).
+
+*Remark*. In practice one can try different choices of \(\beta\) and
+settle with the one that gives the best privacy guarantee. See the
+discussions at the end of
+[[/posts/2019-03-14-great-but-manageable-expectations.html][Part 2 of
+this post]].
+
+*Proof*. Let \(p_i\), \(q_i\), \(p\) and \(q\) be the laws of \(M_i(x)\),
+\(M_i(x')\), \(M(x)\) and \(M(x')\) respectively.
+
+\[\mathbb E L_i = D(p_i || q_i) \le a(\epsilon),\]
+
+where \(L_i := L(p_i || q_i)\). Due to \(\epsilon\)-ind also have
+
+\[|L_i| \le \epsilon.\]
+
+Therefore, by Hoeffding's Inequality,
+
+\[\mathbb P(L - k a(\epsilon) \ge t) \le \mathbb P(L - \mathbb E L \ge t) \le \exp(- t^2 / 2 k \epsilon^2),\]
+
+where \(L := \sum_i L_i = L(p || q)\).
+
+Plugging in \(t = \sqrt{2 k \epsilon^2 \log \beta^{-1}}\), we have
+
+\[\mathbb P(L(p || q) \le k a(\epsilon) + \sqrt{2 k \epsilon^2 \log \beta^{-1}}) \ge 1 - \beta.\]
+
+Similarly we also have
+
+\[\mathbb P(L(q || p) \le k a(\epsilon) + \sqrt{2 k \epsilon^2 \log \beta^{-1}}) \ge 1 - \beta.\]
+
+By Claim 1 we arrive at the conclusion. \(\square\)
+
+*Claim 15 ([[https://en.wikipedia.org/wiki/Azuma%27s_inequality][Azuma's
+Inequality]])*. Let \(X_{0 : k}\) be a
+[[https://en.wikipedia.org/wiki/Martingale_(probability_theory)][supermartingale]].
+If \(|X_i - X_{i - 1}| \le b\), then
+
+\[\mathbb P(X_k - X_0 \ge t) \le \exp(- {t^2 \over 2 k b^2}).\]
+
+Azuma's Inequality implies a slightly weaker version of Hoeffding's
+Inequality. To see this, let \(L_{1 : k}\) be independent variables with
+\(|L_i| \le b\). Let \(X_i = \sum_{j = 1 : i} L_j - \mathbb E L_j\). Then
+\(X_{0 : k}\) is a martingale, and
+
+\[| X_i - X_{i - 1} | = | L_i - \mathbb E L_i | \le 2 b,\]
+
+since \(\|L_i\|_1 \le \|L_i\|_\infty\). Hence by Azuma's Inequality,
+
+\[\mathbb P(L - \mathbb E L \ge t) \le \exp(- {t^2 \over 8 k b^2}).\]
+
+Of course here we have made no assumption on \(\mathbb E L_i\). If instead
+we have some bound for the expectation, say \(|\mathbb E L_i| \le a\),
+then by the same derivation we have
+
+\[\mathbb P(L - \mathbb E L \ge t) \le \exp(- {t^2 \over 2 k (a + b)^2}).\]
+
+It is not hard to see what Azuma is to Hoeffding is like adaptive
+composition to independent composition. Indeed, we can use Azuma's
+Inequality to prove the Advanced Adaptive Composition Theorem for
+\(\delta = 0\).
+
+*Claim 16 (Advanced Adaptive Composition Theorem)* (\(\delta = 0\)). Let
+\(\beta > 0\). Let \(M_{1 : k}\) be \(k\) mechanisms with independent noises
+such that for each \(i\) and \(y_{1 : i}\), \(M_i(y_{1 : i})\) is
+\((\epsilon, 0)\)-dp. Then the adpative composition of \(M_{1 : k}\) is
+\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon)), \beta)\)-dp.
+
+*Proof*. As before, let \(\xi_{1 : k} \overset{d}{=} M(x)\) and
+\(\eta_{1 : k} \overset{d}{=} M(x')\), where \(M\) is the adaptive
+composition of \(M_{1 : k}\). Let \(p_i\) (resp. \(q_i\)) be the law of
+\(\xi_i | \xi_{< i}\) (resp. \(\eta_i | \eta_{< i}\)). Let \(p^i\) (resp.
+\(q^i\)) be the law of \(\xi_{\le i}\) (resp. \(\eta_{\le i}\)). We want to
+construct supermartingale \(X\). To this end, let
+
+\[X_i = \log {p^i(\xi_{\le i}) \over q^i(\xi_{\le i})} - i a(\epsilon) \]
+
+We show that \((X_i)\) is a supermartingale:
+
+$$\begin{aligned}
+\mathbb E(X_i - X_{i - 1} | X_{i - 1}) &= \mathbb E \left(\log {p_i (\xi_i | \xi_{< i}) \over q_i (\xi_i | \xi_{< i})} - a(\epsilon) | \log {p^{i - 1} (\xi_{< i}) \over q^{i - 1} (\xi_{< i})}\right) \\
+&= \mathbb E \left( \mathbb E \left(\log {p_i (\xi_i | \xi_{< i}) \over q_i (\xi_i | \xi_{< i})} | \xi_{< i}\right) | \log {p^{i - 1} (\xi_{< i}) \over q^{i - 1} (\xi_{< i})}\right) - a(\epsilon) \\
+&= \mathbb E \left( D(p_i (\cdot | \xi_{< i}) || q_i (\cdot | \xi_{< i})) | \log {p^{i - 1} (\xi_{< i}) \over q^{i - 1} (\xi_{< i})}\right) - a(\epsilon) \\
+&\le 0,
+\end{aligned}$$
+
+since by Claim 12
+\(D(p_i(\cdot | y_{< i}) || q_i(\cdot | y_{< i})) \le a(\epsilon)\) for
+all \(y_{< i}\).
+
+Since
+
+\[| X_i - X_{i - 1} | = | \log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})} - a(\epsilon) | \le \epsilon + a(\epsilon),\]
+
+by Azuma's Inequality,
+
+\[\mathbb P(\log {p^k(\xi_{1 : k}) \over q^k(\xi_{1 : k})} \ge k a(\epsilon) + t) \le \exp(- {t^2 \over 2 k (\epsilon + a(\epsilon))^2}). \qquad(6.99)\]
+
+Let \(t = \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon))\) we are
+done. \(\square\)
+
+*Claim 17 (Advanced Independent Composition Theorem)*. Fix
+\(0 < \beta < 1\). Let \(M_1, ..., M_k\) be \((\epsilon, \delta)\)-dp, then
+the independent composition \(M\) of \(M_{1 : k}\) is
+\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} \epsilon, k \delta + \beta)\)-dp.
+
+*Proof*. By Claim 4, there exist events \(E_{1 : k}\) and \(F_{1 : k}\) such
+that
+
+1. The laws \(p_{i | E_i}\) and \(q_{i | F_i}\) are \(\epsilon\)-ind.
+2. \(\mathbb P(E_i), \mathbb P(F_i) \ge 1 - \delta\).
+
+Let \(E := \bigcap E_i\) and \(F := \bigcap F_i\), then they both have
+probability at least \(1 - k \delta\), and \(p_{i | E}\) and \(q_{i | F}\) are
+\(\epsilon\)-ind.
+
+By Claim 14, \(p_{|E}\) and \(q_{|F}\) are
+\((\epsilon' := k a(\epsilon) + \sqrt{2 k \epsilon^2 \log \beta^{-1}}, \beta)\)-ind.
+Let us shrink the bigger event between \(E\) and \(F\) so that they have
+equal probabilities. Then
+
+$$\begin{aligned}
+p (S) &\le p_{|E}(S) \mathbb P(E) + \mathbb P(E^c) \\
+&\le (e^{\epsilon'} q_{|F}(S) + \beta) \mathbb P(F) + k \delta\\
+&\le e^{\epsilon'} q(S) + \beta + k \delta.
+\end{aligned}$$
+
+\(\square\)
+
+*Claim 18 (Advanced Adaptive Composition Theorem)*. Fix \(0 < \beta < 1\).
+Let \(M_{1 : k}\) be \(k\) mechanisms with independent noises such that for
+each \(i\) and \(y_{1 : i}\), \(M_i(y_{1 : i})\) is \((\epsilon, \delta)\)-dp.
+Then the adpative composition of \(M_{1 : k}\) is
+\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon)), \beta + k \delta)\)-dp.
+
+*Remark*. This theorem appeared in Dwork-Rothblum-Vadhan 2010, but I
+could not find a proof there. A proof can be found in Dwork-Roth 2013
+(See Theorem 3.20 there). Here I prove it in a similar way, except that
+instead of the use of an intermediate random variable there, I use the
+conditional probability results from Claim 5, the approach mentioned in
+Vadhan 2017.
+
+*Proof*. By Claim 5, there exist events \(E_{1 : k}\) and \(F_{1 : k}\) such
+that
+
+1. The laws \(p_{i | E_i}(\cdot | y_{< i})\) and
+ \(q_{i | F_i}(\cdot | y_{< i})\) are \(\epsilon\)-ind for all \(y_{< i}\).
+2. \(\mathbb P(E_i | y_{< i}), \mathbb P(F_i | y_{< i}) \ge 1 - \delta\)
+ for all \(y_{< i}\).
+
+Let \(E := \bigcap E_i\) and \(F := \bigcap F_i\), then they both have
+probability at least \(1 - k \delta\), and \(p_{i | E}(\cdot | y_{< i}\) and
+\(q_{i | F}(\cdot | y_{< i})\) are \(\epsilon\)-ind.
+
+By Advanced Adaptive Composition Theorem (\(\delta = 0\)), \(p_{|E}\) and
+\(q_{|F}\) are
+\((\epsilon' := k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon)), \beta)\)-ind.
+
+The rest is the same as in the proof of Claim 17. \(\square\)
+
+** Subsampling
+ :PROPERTIES:
+ :CUSTOM_ID: subsampling
+ :END:
+Stochastic gradient descent is like gradient descent, but with random
+subsampling.
+
+Recall we have been considering databases in the space \(Z^m\). Let
+\(n < m\) be a positive integer,
+\(\mathcal I := \{I \subset [m]: |I| = n\}\) be the set of subsets of
+\([m]\) of size \(n\), and \(\gamma\) a random subset sampled uniformly from
+\(\mathcal I\). Let \(r = {n \over m}\) which we call the subsampling rate.
+Then we may add a subsampling module to the noisy gradient descent
+algorithm (6.97) considered before
+
+\[\theta_{t} = \theta_{t - 1} - \alpha n^{-1} \sum_{i \in \gamma} \nabla_\theta h_\theta(x_i) |_{\theta = \theta_{t - 1}} + \zeta_t. \qquad (7)\]
+
+It turns out subsampling has an amplification effect on privacy.
+
+*Claim 19 (Ullman 2017)*. Fix \(r \in [0, 1]\). Let \(n \le m\) be two
+nonnegative integers with \(n = r m\). Let \(N\) be an
+\((\epsilon, \delta)\)-dp mechanism on \(Z^n\). Define mechanism \(M\) on
+\(Z^m\) by
+
+\[M(x) = N(x_\gamma)\]
+
+Then \(M\) is \((\log (1 + r(e^\epsilon - 1)), r \delta)\)-dp.
+
+*Remark*. Some seem to cite
+Kasiviswanathan-Lee-Nissim-Raskhodnikova-Smith 2005 for this result, but
+it is not clear to me how it appears there.
+
+*Proof*. Let \(x, x' \in Z^n\) such that they differ by one row
+\(x_i \neq x_i'\). Naturally we would like to consider the cases where the
+index \(i\) is picked and the ones where it is not separately. Let
+\(\mathcal I_\in\) and \(\mathcal I_\notin\) be these two cases:
+
+$$\begin{aligned}
+\mathcal I_\in = \{J \subset \mathcal I: i \in J\}\\
+\mathcal I_\notin = \{J \subset \mathcal I: i \notin J\}\\
+\end{aligned}$$
+
+We will use these notations later. Let \(A\) be the event
+\(\{\gamma \ni i\}\).
+
+Let \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively. We
+collect some useful facts about them. First due to \(N\) being
+\((\epsilon, \delta)\)-dp,
+
+\[p_{|A}(S) \le e^\epsilon q_{|A}(S) + \delta.\]
+
+Also,
+
+\[p_{|A}(S) \le e^\epsilon p_{|A^c}(S) + \delta.\]
+
+To see this, note that being conditional laws, \(p_A\) and \(p_{A^c}\) are
+averages of laws over \(\mathcal I_\in\) and \(\mathcal I_\notin\)
+respectively:
+
+$$\begin{aligned}
+p_{|A}(S) = |\mathcal I_\in|^{-1} \sum_{I \in \mathcal I_\in} \mathbb P(N(x_I) \in S)\\
+p_{|A^c}(S) = |\mathcal I_\notin|^{-1} \sum_{J \in \mathcal I_\notin} \mathbb P(N(x_J) \in S).
+\end{aligned}$$
+
+Now we want to pair the \(I\)'s in \(\mathcal I_\in\) and \(J\)'s in
+\(\mathcal I_\notin\) so that they differ by one index only, which means
+\(d(x_I, x_J) = 1\). Formally, this means we want to consider the set:
+
+\[\mathcal D := \{(I, J) \in \mathcal I_\in \times \mathcal I_\notin: |I \cap J| = n - 1\}.\]
+
+We may observe by trying out some simple cases that every
+\(I \in \mathcal I_\in\) is paired with \(n\) elements in
+\(\mathcal I_\notin\), and every \(J \in \mathcal I_\notin\) is paired with
+\(m - n\) elements in \(\mathcal I_\in\). Therefore
+
+\[p_{|A}(S) = |\mathcal D|^{-1} \sum_{(I, J) \in \mathcal D} \mathbb P(N(x_I \in S)) \le |\mathcal D|^{-1} \sum_{(I, J) \in \mathcal D} (e^\epsilon \mathbb P(N(x_J \in S)) + \delta) = e^\epsilon p_{|A^c} (S) + \delta.\]
+
+Since each of the \(m\) indices is picked independently with probability
+\(r\), we have
+
+\[\mathbb P(A) = r.\]
+
+Let \(t \in [0, 1]\) to be determined. We may write
+
+$$\begin{aligned}
+p(S) &= r p_{|A} (S) + (1 - r) p_{|A^c} (S)\\
+&\le r(t e^\epsilon q_{|A}(S) + (1 - t) e^\epsilon q_{|A^c}(S) + \delta) + (1 - r) q_{|A^c} (S)\\
+&= rte^\epsilon q_{|A}(S) + (r(1 - t) e^\epsilon + (1 - r)) q_{|A^c} (S) + r \delta\\
+&= te^\epsilon r q_{|A}(S) + \left({r \over 1 - r}(1 - t) e^\epsilon + 1\right) (1 - r) q_{|A^c} (S) + r \delta \\
+&\le \left(t e^\epsilon \wedge \left({r \over 1 - r} (1 - t) e^\epsilon + 1\right)\right) q(S) + r \delta. \qquad (7.5)
+\end{aligned}$$
+
+We can see from the last line that the best bound we can get is when
+
+\[t e^\epsilon = {r \over 1 - r} (1 - t) e^\epsilon + 1.\]
+
+Solving this equation we obtain
+
+\[t = r + e^{- \epsilon} - r e^{- \epsilon}\]
+
+and plugging this in (7.5) we have
+
+\[p(S) \le (1 + r(e^\epsilon - 1)) q(S) + r \delta.\]
+
+\(\square\)
+
+Since \(\log (1 + x) < x\) for \(x > 0\), we can rewrite the conclusion of
+the Claim to \((r(e^\epsilon - 1), r \delta)\)-dp. Further more, if
+\(\epsilon < \alpha\) for some \(\alpha\), we can rewrite it as
+\((r \alpha^{-1} (e^\alpha - 1) \epsilon, r \delta)\)-dp or
+\((O(r \epsilon), r \delta)\)-dp.
+
+Let \(\epsilon < 1\). We see that if the mechanism \(N\) is
+\((\epsilon, \delta)\)-dp on \(Z^n\), then \(M\) is
+\((2 r \epsilon, r \delta)\)-dp, and if we run it over \(k / r\)
+minibatches, by Advanced Adaptive Composition theorem, we have
+\((\sqrt{2 k r \log \beta^{-1}} \epsilon + 2 k r \epsilon^2, k \delta + \beta)\)-dp.
+
+This is better than the privacy guarantee without subsampling, where we
+run over \(k\) iterations and obtain
+\((\sqrt{2 k \log \beta^{-1}} \epsilon + 2 k \epsilon^2, k \delta + \beta)\)-dp.
+So with subsampling we gain an extra \(\sqrt r\) in the \(\epsilon\)-part of
+the privacy guarantee. But, smaller subsampling rate means smaller
+minibatch size, which would result in bigger variance, so there is a
+trade-off here.
+
+Finally we define the differentially private stochastic gradient descent
+(DP-SGD) with the Gaussian mechanism
+(Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016), which is (7)
+with the noise specialised to Gaussian and an added clipping operation
+to bound to sensitivity of the query to a chosen \(C\):
+
+\[\theta_{t} = \theta_{t - 1} - \alpha \left(n^{-1} \sum_{i \in \gamma} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}}\right)_{\text{Clipped at }C / 2} + N(0, \sigma^2 C^2 I),\]
+
+where
+
+\[y_{\text{Clipped at } \alpha} := y / (1 \vee {\|y\|_2 \over \alpha})\]
+
+is \(y\) clipped to have norm at most \(\alpha\).
+
+Note that the clipping in DP-SGD is much stronger than making the query
+have sensitivity \(C\). It makes the difference between the query results
+of two /arbitrary/ inputs bounded by \(C\), rather than /neighbouring/
+inputs.
+
+In [[/posts/2019-03-14-great-but-manageable-expectations.html][Part 2 of
+this post]] we will use the tools developed above to discuss the privacy
+guarantee for DP-SGD, among other things.
+
+** References
+ :PROPERTIES:
+ :CUSTOM_ID: references
+ :END:
+
+- Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya
+ Mironov, Kunal Talwar, and Li Zhang. "Deep Learning with Differential
+ Privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer
+ and Communications Security - CCS'16, 2016, 308--18.
+ [[https://doi.org/10.1145/2976749.2978318]].
+- Dwork, Cynthia, and Aaron Roth. "The Algorithmic Foundations of
+ Differential Privacy." Foundations and Trends® in Theoretical Computer
+ Science 9, no. 3--4 (2013): 211--407.
+ [[https://doi.org/10.1561/0400000042]].
+- Dwork, Cynthia, Guy N. Rothblum, and Salil Vadhan. "Boosting and
+ Differential Privacy." In 2010 IEEE 51st Annual Symposium on
+ Foundations of Computer Science, 51--60. Las Vegas, NV, USA:
+ IEEE, 2010. [[https://doi.org/10.1109/FOCS.2010.12]].
+- Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya
+ Raskhodnikova, and Adam Smith. "What Can We Learn Privately?" In 46th
+ Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).
+ Pittsburgh, PA, USA: IEEE, 2005.
+ [[https://doi.org/10.1109/SFCS.2005.1]].
+- Murtagh, Jack, and Salil Vadhan. "The Complexity of Computing the
+ Optimal Composition of Differential Privacy." In Theory of
+ Cryptography, edited by Eyal Kushilevitz and Tal Malkin, 9562:157--75.
+ Berlin, Heidelberg: Springer Berlin Heidelberg, 2016.
+ [[https://doi.org/10.1007/978-3-662-49096-9_7]].
+- Ullman, Jonathan. "Solution to CS7880 Homework 1.", 2017.
+ [[http://www.ccs.neu.edu/home/jullman/cs7880s17/HW1sol.pdf]]
+- Vadhan, Salil. "The Complexity of Differential Privacy." In Tutorials
+ on the Foundations of Cryptography, edited by Yehuda Lindell,
+ 347--450. Cham: Springer International Publishing, 2017.
+ [[https://doi.org/10.1007/978-3-319-57048-8_7]].
+
+[fn:1] For those who have read about differential privacy and never
+ heard of the term "divergence variable", it is closely related to
+ the notion of "privacy loss", see the paragraph under Claim 6 in
+ [[#back-to-approximate-differential-privacy][Back to approximate
+ differential privacy]]. I defined the term this way so that we
+ can focus on the more general stuff: compared to the privacy loss
+ \(L(M(x) || M(x'))\), the term \(L(p || q)\) removes the "distracting
+ information" that \(p\) and \(q\) are related to databases, queries,
+ mechanisms etc., but merely probability laws. By removing the
+ distraction, we simplify the analysis. And once we are done with
+ the analysis of \(L(p || q)\), we can apply the results obtained in
+ the general setting to the special case where \(p\) is the law of
+ \(M(x)\) and \(q\) is the law of \(M(x')\).
diff --git a/posts/2019-03-14-great-but-manageable-expectations.org b/posts/2019-03-14-great-but-manageable-expectations.org
new file mode 100644
index 0000000..6438090
--- /dev/null
+++ b/posts/2019-03-14-great-but-manageable-expectations.org
@@ -0,0 +1,836 @@
+#+title: Great but Manageable Expectations
+
+#+date: <2019-03-14>
+
+This is Part 2 of a two-part blog post on differential privacy.
+Continuing from [[file:2019-03-13-a-tail-of-two-densities.org][Part 1]], I discuss the Rényi differential privacy, corresponding to the
+Rényi divergence, a study of the
+[[https://en.wikipedia.org/wiki/Moment-generating_function][moment
+generating functions]] of the divergence between probability measures in
+order to derive the tail bounds.
+
+Like in Part 1, I prove a composition theorem and a subsampling theorem.
+
+I also attempt to reproduce a seemingly better moment bound for the
+Gaussian mechanism with subsampling, with one intermediate step which I
+am not able to prove.
+
+After that I explain the Tensorflow implementation of differential
+privacy in its
+[[https://github.com/tensorflow/privacy/tree/master/privacy][Privacy]]
+module, which focuses on the differentially private stochastic gradient
+descent algorithm (DP-SGD).
+
+Finally I use the results from both Part 1 and Part 2 to obtain some
+privacy guarantees for composed subsampling queries in general, and for
+DP-SGD in particular. I also compare these privacy guarantees.
+
+/If you are confused by any notations, ask me or try
+[[/notations.html][this]]./
+
+** Rényi divergence and differential privacy
+ :PROPERTIES:
+ :CUSTOM_ID: rényi-divergence-and-differential-privacy
+ :END:
+Recall in the proof of Gaussian mechanism privacy guarantee (Claim 8) we
+used the Chernoff bound for the Gaussian noise. Why not use the Chernoff
+bound for the divergence variable / privacy loss directly, since the
+latter is closer to the core subject than the noise? This leads us to
+the study of Rényi divergence.
+
+So far we have seen several notions of divergence used in differential
+privacy: the max divergence which is \(\epsilon\)-ind in disguise:
+
+\[D_\infty(p || q) := \max_y \log {p(y) \over q(y)},\]
+
+the \(\delta\)-approximate max divergence that defines the
+\((\epsilon, \delta)\)-ind:
+
+\[D_\infty^\delta(p || q) := \max_y \log{p(y) - \delta \over q(y)},\]
+
+and the KL-divergence which is the expectation of the divergence
+variable:
+
+\[D(p || q) = \mathbb E L(p || q) = \int \log {p(y) \over q(y)} p(y) dy.\]
+
+The Rényi divergence is an interpolation between the max divergence and
+the KL-divergence, defined as the log moment generating function /
+cumulants of the divergence variable:
+
+\[D_\lambda(p || q) = (\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1) L(p || q)) = (\lambda - 1)^{-1} \log \int {p(y)^\lambda \over q(y)^{\lambda - 1}} dy.\]
+
+Indeed, when \(\lambda \to \infty\) we recover the max divergence, and
+when \(\lambda \to 1\), by recognising \(D_\lambda\) as a derivative in
+\(\lambda\) at \(\lambda = 1\), we recover the KL divergence. In this post
+we only consider \(\lambda > 1\).
+
+Using the Rényi divergence we may define:
+
+*Definition (Rényi differential privacy)* (Mironov 2017). An mechanism
+\(M\) is \((\lambda, \rho)\)/-Rényi differentially private/
+(\((\lambda, \rho)\)-rdp) if for all \(x\) and \(x'\) with distance \(1\),
+
+\[D_\lambda(M(x) || M(x')) \le \rho.\]
+
+For convenience we also define two related notions, \(G_\lambda (f || g)\)
+and \(\kappa_{f, g} (t)\) for \(\lambda > 1\), \(t > 0\) and positive
+functions \(f\) and \(g\):
+
+\[G_\lambda(f || g) = \int f(y)^{\lambda} g(y)^{1 - \lambda} dy; \qquad \kappa_{f, g} (t) = \log G_{t + 1}(f || g).\]
+
+For probability densities \(p\) and \(q\), \(G_{t + 1}(p || q)\) and
+\(\kappa_{p, q}(t)\) are the \(t\)th moment generating function and
+[[https://en.wikipedia.org/wiki/Cumulant][cumulant]] of the divergence
+variable \(L(p || q)\), and
+
+\[D_\lambda(p || q) = (\lambda - 1)^{-1} \kappa_{p, q}(\lambda - 1).\]
+
+In the following, whenever you see \(t\), think of it as \(\lambda - 1\).
+
+*Example 1 (RDP for the Gaussian mechanism)*. Using the scaling and
+translation invariance of \(L\) (6.1), we have that the divergence
+variable for two Gaussians with the same variance is
+
+\[L(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) \overset{d}{=} L(N(0, I) || N((\mu_2 - \mu_1) / \sigma, I)).\]
+
+With this we get
+
+\[D_\lambda(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) = {\lambda \|\mu_2 - \mu_1\|_2^2 \over 2 \sigma^2} = D_\lambda(N(\mu_2, \sigma^2 I) || N(\mu_1, \sigma^2 I)).\]
+
+Again due to the scaling invariance of \(L\), we only need to consider \(f\)
+with sensitivity \(1\), see the discussion under (6.1). The Gaussian
+mechanism on query \(f\) is thus \((\lambda, \lambda / 2 \sigma^2)\)-rdp for
+any \(\lambda > 1\).
+
+From the example of Gaussian mechanism, we see that the relation between
+\(\lambda\) and \(\rho\) is like that between \(\epsilon\) and \(\delta\). Given
+\(\lambda\) (resp. \(\rho\)) and parameters like variance of the noise and
+the sensitivity of the query, we can write \(\rho = \rho(\lambda)\) (resp.
+\(\lambda = \lambda(\rho)\)).
+
+Using the Chernoff bound (6.7), we can bound the divergence variable:
+
+\[\mathbb P(L(p || q) \ge \epsilon) \le {\mathbb E \exp(t L(p || q)) \over \exp(t \epsilon))} = \exp (\kappa_{p, q}(t) - \epsilon t). \qquad (7.7)\]
+
+For a function \(f: I \to \mathbb R\), denote its
+[[https://en.wikipedia.org/wiki/Legendre_transformation][Legendre
+transform]] by
+
+\[f^*(\epsilon) := \sup_{t \in I} (\epsilon t - f(t)).\]
+
+By taking infimum on the RHS of (7.7), we obtain
+
+*Claim 20*. Two probability densities \(p\) and \(q\) are
+\((\epsilon, \exp(-\kappa_{p, q}^*(\epsilon)))\)-ind.
+
+Given a mechanism \(M\), let \(\kappa_M(t)\) denote an upper bound for the
+cumulant of its privacy loss:
+
+\[\log \mathbb E \exp(t L(M(x) || M(x'))) \le \kappa_M(t), \qquad \forall x, x'\text{ with } d(x, x') = 1.\]
+
+For example, we can set \(\kappa_M(t) = t \rho(t + 1)\). Using the same
+argument we have the following:
+
+*Claim 21*. If \(M\) is \((\lambda, \rho)\)-rdp, then
+
+1. it is also \((\epsilon, \exp((\lambda - 1) (\rho - \epsilon)))\)-dp for
+ any \(\epsilon \ge \rho\).
+2. Alternatively, \(M\) is \((\epsilon, - \exp(\kappa_M^*(\epsilon)))\)-dp
+ for any \(\epsilon > 0\).
+3. Alternatively, for any \(0 < \delta \le 1\), \(M\) is
+ \((\rho + (\lambda - 1)^{-1} \log \delta^{-1}, \delta)\)-dp.
+
+*Example 2 (Gaussian mechanism)*. We can apply the above argument to the
+Gaussian mechanism on query \(f\) and get:
+
+\[\delta \le \inf_{\lambda > 1} \exp((\lambda - 1) ({\lambda \over 2 \sigma^2} - \epsilon))\]
+
+By assuming \(\sigma^2 > (2 \epsilon)^{-1}\) we have that the infimum is
+achieved when \(\lambda = (1 + 2 \epsilon / \sigma^2) / 2\) and
+
+\[\delta \le \exp(- ((2 \sigma)^{-1} - \epsilon \sigma)^2 / 2)\]
+
+which is the same result as (6.8), obtained using the Chernoff bound of
+the noise.
+
+However, as we will see later, compositions will yield different results
+from those obtained from methods in
+[[/posts/2019-03-13-a-tail-of-two-densities.html][Part 1]] when
+considering Rényi dp.
+
+*** Moment Composition
+ :PROPERTIES:
+ :CUSTOM_ID: moment-composition
+ :END:
+*Claim 22 (Moment Composition Theorem)*. Let \(M\) be the adaptive
+composition of \(M_{1 : k}\). Suppose for any \(y_{< i}\), \(M_i(y_{< i})\) is
+\((\lambda, \rho)\)-rdp. Then \(M\) is \((\lambda, k\rho)\)-rdp.
+
+*Proof*. Rather straightforward. As before let \(p_i\) and \(q_i\) be the
+conditional laws of adpative composition of \(M_{1 : i}\) at \(x\) and \(x'\)
+respectively, and \(p^i\) and \(q^i\) be the joint laws of \(M_{1 : i}\) at
+\(x\) and \(x'\) respectively. Denote
+
+\[D_i = \mathbb E \exp((\lambda - 1)\log {p^i(\xi_{1 : i}) \over q^i(\xi_{1 : i})})\]
+
+Then
+
+$$\begin{aligned}
+D_i &= \mathbb E\mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) \exp((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}) \big| \xi_{< i}\right) \\
+&= \mathbb E \mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) | \xi_{< i}\right) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\
+&\le \mathbb E \exp((\lambda - 1) \rho) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\
+&= \exp((\lambda - 1) \rho) D_{i - 1}.
+\end{aligned}$$
+
+Applying this recursively we have
+
+\[D_k \le \exp(k(\lambda - 1) \rho),\]
+
+and so
+
+\[(\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1)\log {p^k(\xi_{1 : i}) \over q^k(\xi_{1 : i})}) = (\lambda - 1)^{-1} \log D_k \le k \rho.\]
+
+Since this holds for all \(x\) and \(x'\), we are done. \(\square\)
+
+This, together with the scaling property of the legendre transformation:
+
+\[(k f)^*(x) = k f^*(x / k)\]
+
+yields
+
+*Claim 23*. The \(k\)-fold adaptive composition of
+\((\lambda, \rho(\lambda))\)-rdp mechanisms is
+\((\epsilon, \exp(- k \kappa^*(\epsilon / k)))\)-dp, where
+\(\kappa(t) := t \rho(t + 1)\).
+
+*Example 3 (Gaussian mechanism)*. We can apply the above claim to
+Gaussian mechanism. Again, without loss of generality we assume
+\(S_f = 1\). But let us do it manually to get the same results. If we
+apply the Moment Composition Theorem to the an adaptive composition of
+Gaussian mechanisms on the same query, then since each \(M_i\) is
+\((\lambda, (2 \sigma^2)^{-1} \lambda)\)-rdp, the composition \(M\) is
+\((\lambda, (2 \sigma^2)^{-1} k \lambda)\)-rdp. Processing this using the
+Chernoff bound as in the previous example, we have
+
+\[\delta = \exp(- ((2 \sigma / \sqrt k)^{-1} - \epsilon \sigma / \sqrt k)^2 / 2),\]
+
+Substituting \(\sigma\) with \(\sigma / \sqrt k\) in (6.81), we conclude
+that if
+
+\[\sigma > \sqrt k \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right)\]
+
+then the composition \(M\) is \((\epsilon, \delta)\)-dp.
+
+As we will see in the discussions at the end of this post, this result
+is different from (and probably better than) the one obtained by using
+the Advanced Composition Theorem (Claim 18).
+
+*** Subsampling
+ :PROPERTIES:
+ :CUSTOM_ID: subsampling
+ :END:
+We also have a subsampling theorem for the Rényi dp.
+
+*Claim 24*. Fix \(r \in [0, 1]\). Let \(m \le n\) be two nonnegative
+integers with \(m = r n\). Let \(N\) be a \((\lambda, \rho)\)-rdp machanism on
+\(X^m\). Let \(\mathcal I := \{J \subset [n]: |J| = m\}\) be the set of
+subsets of \([n]\) of size \(m\). Define mechanism \(M\) on \(X^n\) by
+
+\[M(x) = N(x_\gamma)\]
+
+where \(\gamma\) is sampled uniformly from \(\mathcal I\). Then \(M\) is
+\((\lambda, {1 \over \lambda - 1} \log (1 + r(e^{(\lambda - 1) \rho} - 1)))\)-rdp.
+
+To prove Claim 24, we need a useful lemma:
+
+*Claim 25*. Let \(p_{1 : n}\) and \(q_{1 : n}\) be nonnegative integers, and
+\(\lambda > 1\). Then
+
+\[{(\sum p_i)^\lambda \over (\sum q_i)^{\lambda - 1}} \le \sum_i {p_i^\lambda \over q_i^{\lambda - 1}}. \qquad (8)\]
+
+*Proof*. Let
+
+\[r(i) := p_i / P, \qquad u(i) := q_i / Q\]
+
+where
+
+\[P := \sum p_i, \qquad Q := \sum q_i\]
+
+then \(r\) and \(u\) are probability mass functions. Plugging in
+\(p_i = r(i) P\) and \(q_i = u(i) Q\) into the objective (8), it suffices to
+show
+
+\[1 \le \sum_i {r(i)^\lambda \over u(i)^{\lambda - 1}} = \mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda\]
+
+This is true due to Jensen's Inequality:
+
+\[\mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda \ge \left(\mathbb E_{\xi \sim u} {r(\xi) \over u(\xi)} \right)^\lambda = 1.\]
+
+\(\square\)
+
+*Proof of Claim 24*. Define \(\mathcal I\) as before.
+
+Let \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively. For any
+\(I \in \mathcal I\), let \(p_I\) and \(q_I\) be the laws of \(N(x_I)\) and
+\(N(x_I')\) respectively. Then we have
+
+$$\begin{aligned}
+p(y) &= n^{-1} \sum_{I \in \mathcal I} p_I(y) \\
+q(y) &= n^{-1} \sum_{I \in \mathcal I} q_I(y),
+\end{aligned}$$
+
+where \(n = |\mathcal I|\).
+
+The MGF of \(L(p || q)\) is thus
+
+\[\mathbb E((\lambda - 1) L(p || q)) = n^{-1} \int {(\sum_I p_I(y))^\lambda \over (\sum_I q_I(y))^{\lambda - 1}} dy \le n^{-1} \sum_I \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy \qquad (9)\]
+
+where in the last step we used Claim 25. As in the proof of Claim 19, we
+divide \(\mathcal I\) into disjoint sets \(\mathcal I_\in\) and
+\(\mathcal I_\notin\). Furthermore we denote by \(n_\in\) and \(n_\notin\)
+their cardinalities. Then the right hand side of (9) becomes
+
+\[n^{-1} \sum_{I \in \mathcal I_\in} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy + n^{-1} \sum_{I \in \mathcal I_\notin} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy\]
+
+The summands in the first are the MGF of \(L(p_I || q_I)\), and the
+summands in the second term are \(1\), so
+
+$$\begin{aligned}
+\mathbb E((\lambda - 1) L(p || q)) &\le n^{-1} \sum_{I \in \mathcal I_\in} \mathbb E \exp((\lambda - 1) L(p_I || q_I)) + (1 - r) \\
+&\le n^{-1} \sum_{I \in \mathcal I_\in} \exp((\lambda - 1) D_\lambda(p_I || q_I)) + (1 - r) \\
+&\le r \exp((\lambda - 1) \rho) + (1 - r).
+\end{aligned}$$
+
+Taking log and dividing by \((\lambda - 1)\) on both sides we have
+
+\[D_\lambda(p || q) \le (\lambda - 1)^{-1} \log (1 + r(\exp((\lambda - 1) \rho) - 1)).\]
+
+\(\square\)
+
+As before, we can rewrite the conclusion of Lemma 6 using
+\(1 + z \le e^z\) and obtain
+\((\lambda, (\lambda - 1)^{-1} r (e^{(\lambda - 1) \rho} - 1))\)-rdp,
+which further gives \((\lambda, \alpha^{-1} (e^\alpha - 1) r \rho)\)-rdp
+(or \((\lambda, O(r \rho))\)-rdp) if \((\lambda - 1) \rho < \alpha\) for
+some \(\alpha\).
+
+It is not hard to see that the subsampling theorem in moment method,
+even though similar to the results of that in the usual method, does not
+help due to lack of an analogue of advanced composition theorem of the
+moments.
+
+*Example 4 (Gaussian mechanism)*. Applying the moment subsampling
+theorem to the Gaussian mechanism, we obtain
+\((\lambda, O(r \lambda / \sigma^2))\)-rdp for a subsampled Gaussian
+mechanism with rate \(r\).
+Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016 (ACGMMTZ16 in the
+following), however, gains an extra \(r\) in the bound given certain
+assumptions.
+
+** ACGMMTZ16
+ :PROPERTIES:
+ :CUSTOM_ID: acgmmtz16
+ :END:
+What follows is my understanding of this result. I call it a conjecture
+because there is a gap which I am not able to reproduce their proof or
+prove it myself. This does not mean the result is false. On the
+contrary, I am inclined to believe it is true.
+
+*Claim 26*. Assuming Conjecture 1 (see below) is true. For a subsampled
+Gaussian mechanism with ratio \(r\), if \(r = O(\sigma^{-1})\) and
+\(\lambda = O(\sigma^2)\), then we have
+\((\lambda, O(r^2 \lambda / \sigma^2))\)-rdp.
+
+Wait, why is there a conjecture? Well, I have tried but not been able to
+prove the following, which is a hidden assumption in the original proof:
+
+*Conjecture 1*. Let \(M\) be the Gaussian mechanism with subsampling rate
+\(r\), and \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively,
+where \(d(x, x') = 1\). Then
+
+\[D_\lambda (p || q) \le D_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0)\]
+
+where \(\mu_i = N(i, \sigma^2)\).
+
+*Remark*. Conjecture 1 is heuristically reasonable. To see this, let us
+use the notations \(p_I\) and \(q_I\) to be \(q\) and \(p\) conditioned on the
+subsampling index \(I\), just like in the proof of the subsampling
+theorems (Claim 19 and 24). Then for \(I \in \mathcal I_\in\),
+
+\[D_\lambda(p_I || q_I) \le D_\lambda(\mu_0 || \mu_1),\]
+
+and for \(I \in \mathcal I_\notin\),
+
+\[D_\lambda(p_I || q_I) = 0 = D_\lambda(\mu_0 || \mu_0).\]
+
+Since we are taking an average over \(\mathcal I\), of which
+\(r |\mathcal I|\) are in \(\mathcal I_\in\) and \((1 - r) |\mathcal I|\) are
+in \(\mathcal I_\notin\), (9.3) says "the inequalities carry over
+averaging".
+
+[[https://math.stackexchange.com/a/3152296/149540][A more general
+version of Conjecture 1 has been proven false]]. The counter example for
+the general version does not apply here, so it is still possible
+Conjecture 1 is true.
+
+Let \(p_\in\) (resp. \(q_\in\)) be the average of \(p_I\) (resp. \(q_I\)) over
+\(I \in \mathcal I_\in\), and \(p_\notin\) (resp. \(q_\notin\)) be the average
+of \(p_I\) (resp. \(q_I\)) over \(I \in \mathcal I_\notin\).
+
+Immediately we have \(p_\notin = q_\notin\), hence
+
+\[D_\lambda(p_\notin || q_\notin) = 0 = D_\lambda(\mu_0 || \mu_0). \qquad(9.7)\]
+
+By Claim 25, we have
+
+\[D_\lambda(p_\in || q_\in) \le D_\lambda (\mu_1 || \mu_0). \qquad(9.9) \]
+
+So one way to prove Conjecture 1 is perhaps prove a more specialised
+comparison theorem than the false conjecture:
+
+Given (9.7) and (9.9), show that
+
+\[D_\lambda(r p_\in + (1 - r) p_\notin || r q_\in + (1 - r) q_\notin) \le D_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0).\]
+
+[End of Remark]
+
+#+begin_html
+ <!---
+ **Conjecture 1** \[Probably [FALSE](https://math.stackexchange.com/a/3152296/149540), to be removed\]. Let \(p_i\), \(q_i\), \(\mu_i\), \(\nu_i\) be
+ probability densities on the same space for \(i = 1 : n\). If
+ \(D_\lambda(p_i || q_i) \le D_\lambda(\mu_i || \nu_i)\) for all \(i\), then
+
+ \[D_\lambda(n^{-1} \sum_i p_i || n^{-1} \sum_i q_i) \le D_\lambda(n^{-1} \sum_i \mu_i || n^{-1} \sum_i \nu_i).\]
+
+ Basically, it is saying \"if for each \(i\), \(p_i\) and \(q_i\) are closer to
+ each other than \(\mu_i\) and \(\nu_i\), then so are their averages over
+ \(i\)\".
+ So it is heuristically reasonable.
+ But it is probably [**FALSE**](https://math.stackexchange.com/a/3152296/149540).
+ This does not mean Claim 26 is false, as it might still be possible that Conjecture 2 (see below) is true.
+
+ This conjecture is equivalent to its special case when \(n = 2\) by an induction argument
+ (replacing one pair of densities at a time).
+ -->
+#+end_html
+
+Recall the definition of \(G_\lambda\) under the definition of Rényi
+differential privacy. The following Claim will be useful.
+
+*Claim 27*. Let \(\lambda\) be a positive integer, then
+
+\[G_\lambda(r p + (1 - r) q || q) = \sum_{k = 1 : \lambda} {\lambda \choose k} r^k (1 - r)^{\lambda - k} G_k(p || q).\]
+
+*Proof*. Quite straightforward, by expanding the numerator
+\((r p + (1 - r) q)^\lambda\) using binomial expansion. \(\square\)
+
+*Proof of Claim 26*. By Conjecture 1, it suffices to prove the
+following:
+
+If \(r \le c_1 \sigma^{-1}\) and \(\lambda \le c_2 \sigma^2\) for some
+positive constant \(c_1\) and \(c_2\), then there exists \(C = C(c_1, c_2)\)
+such that \(G_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0) \le C\) (since
+\(O(r^2 \lambda^2 / \sigma^2) = O(1)\)).
+
+*Remark in the proof*. Note that the choice of \(c_1\), \(c_2\) and the
+function \(C(c_1, c_2)\) are important to the practicality and usefulness
+of Claim 26.
+
+#+begin_html
+ <!---
+ Part 1 can be derived using Conjecture 1, but since Conjecture 1 is probably false,
+ let us rename Part 1 itself _Conjecture 2_, which needs to be verified by other means.
+ We use the notations \(p_I\) and \(q_I\) to be \(q\) and \(p\) conditioned on
+ the subsampling index \(I\), just like in the proof of the subsampling theorems (Claim 19 and 24).
+ Then
+
+ $$D_\lambda(q_I || p_I) = D_\lambda(p_I || q_I)
+ \begin{cases}
+ \le D_\lambda(\mu_0 || \mu_1) = D_\lambda(\mu_1 || \mu_0), & I \in \mathcal I_\in\\
+ = D_\lambda(\mu_0 || \mu_0) = D_\lambda(\mu_1 || \mu_1) = 0 & I \in \mathcal I_\notin
+ \end{cases}$$
+
+ Since \(p = |\mathcal I|^{-1} \sum_{I \in \mathcal I} p_I\) and
+ \(q = |\mathcal I|^{-1} \sum_{I \in \mathcal I} q_I\) and
+ \(|\mathcal I_\in| = r |\mathcal I|\), by Conjecture 1, we have Part 1.
+
+ **Remark in the proof**. As we can see here, instead of trying to prove Conjecture 1,
+ it suffices to prove a weaker version of it, by specialising on mixture of Gaussians,
+ in order to have a Claim 26 without any conjectural assumptions.
+ I have in fact posted the Conjecture on [Stackexchange](https://math.stackexchange.com/questions/3147963/an-inequality-related-to-the-renyi-divergence).
+
+ Now let us verify Part 2.
+ -->
+#+end_html
+
+Using Claim 27 and Example 1, we have
+
+$$\begin{aligned}
+G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0)) &= \sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} G_j(\mu_1 || \mu_0)\\
+&=\sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} \exp(j (j - 1) / 2 \sigma^2). \qquad (9.5)
+\end{aligned}$$
+
+Denote by \(n = \lceil \sigma^2 \rceil\). It suffices to show
+
+\[\sum_{j = 0 : n} {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j (j - 1) / 2 n) \le C\]
+
+Note that we can discard the linear term \(- c_2 j / \sigma^2\) in the
+exponential term since we want to bound the sum from above.
+
+We examine the asymptotics of this sum when \(n\) is large, and treat the
+sum as an approximation to an integration of a function
+\(\phi: [0, 1] \to \mathbb R\). For \(j = x n\), where \(x \in (0, 1)\),
+\(\phi\) is thus defined as (note we multiply the summand with \(n\) to
+compensate the uniform measure on \(1, ..., n\):
+
+$$\begin{aligned}
+\phi_n(x) &:= n {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j^2 / 2 n) \\
+&= n {n \choose x n} (c_1 n^{- 1 / 2})^{x n} (1 - c_1 n^{- 1 / 2})^{(1 - x) n} \exp(c_2 x^2 n / 2)
+\end{aligned}$$
+
+Using Stirling's approximation
+
+\[n! \approx \sqrt{2 \pi n} n^n e^{- n},\]
+
+we can approach the binomial coefficient:
+
+\[{n \choose x n} \approx (\sqrt{2 \pi x (1 - x)} x^{x n} (1 - x)^{(1 - x) n})^{-1}.\]
+
+We also approximate
+
+\[(1 - c_1 n^{- 1 / 2})^{(1 - x) n} \approx \exp(- c_1 \sqrt{n} (1 - x)).\]
+
+With these we have
+
+\[\phi_n(x) \approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left(- {1 \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x) + {1 \over 2} c_2 x^2) n + {1 \over 2} \log n\right).\]
+
+This vanishes as \(n \to \infty\), and since \(\phi_n(x)\) is bounded above
+by the integrable function \({1 \over \sqrt{2 \pi x (1 - x)}}\) (c.f. the
+arcsine law), and below by \(0\), we may invoke the dominant convergence
+theorem and exchange the integral with the limit and get
+
+$$\begin{aligned}
+\lim_{n \to \infty} &G_n (r \mu_1 + (1 - r) \mu_0 || \mu_0)) \\
+&\le \lim_{n \to \infty} \int \phi_n(x) dx = \int \lim_{n \to \infty} \phi_n(x) dx = 0.
+\end{aligned}$$
+
+Thus we have that the generating function of the divergence variable
+\(L(r \mu_1 + (1 - r) \mu_0 || \mu_0)\) is bounded.
+
+Can this be true for better orders
+
+\[r \le c_1 \sigma^{- d_r},\qquad \lambda \le c_2 \sigma^{d_\lambda}\]
+
+for some \(d_r \in (0, 1]\) and \(d_\lambda \in [2, \infty)\)? If we follow
+the same approximation using these exponents, then letting
+\(n = c_2 \sigma^{d_\lambda}\),
+
+$$\begin{aligned}
+{n \choose j} &r^j (1 - r)^{n - j} G_j(\mu_0 || \mu_1) \le \phi_n(x) \\
+&\approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left({1 \over 2} c_2^{2 \over d_\lambda} x^2 n^{2 - {2 \over d_\lambda}} - {d_r \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x)) n + {1 \over 2} \log n\right).
+\end{aligned}$$
+
+So we see that to keep the divergence moments bounded it is possible to
+have any \(r = O(\sigma^{- d_r})\) for \(d_r \in (0, 1)\), but relaxing
+\(\lambda\) may not be safe.
+
+If we relax \(r\), then we get
+
+\[G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0) = O(r^{2 / d_r} \lambda^2 \sigma^{-2}) = O(1).\]
+
+Note that now the constant \(C\) depends on \(d_r\) as well. Numerical
+experiments seem to suggest that \(C\) can increase quite rapidly as \(d_r\)
+decreases from \(1\). \(\square\)
+
+In the following for consistency we retain \(k\) as the number of epochs,
+and use \(T := k / r\) to denote the number of compositions / steps /
+minibatches. With Claim 26 we have:
+
+*Claim 28*. Assuming Conjecture 1 is true. Let \(\epsilon, c_1, c_2 > 0\),
+\(r \le c_1 \sigma^{-1}\),
+\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\). then the DP-SGD with
+subsampling rate \(r\), and \(T\) steps is \((\epsilon, \delta)\)-dp for
+
+\[\delta = \exp(- {1 \over 2} c_2 \sigma^2 \epsilon).\]
+
+In other words, for
+
+\[\sigma \ge \sqrt{2 c_2^{-1}} \epsilon^{- {1 \over 2}} \sqrt{\log \delta^{-1}},\]
+
+we can achieve \((\epsilon, \delta)\)-dp.
+
+*Proof*. By Claim 26 and the Moment Composition Theorem (Claim 22), for
+\(\lambda = c_2 \sigma^2\), substituting
+\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\), we have
+
+\[\mathbb P(L(p || q) \ge \epsilon) \le \exp(k C(c_1, c_2) - \lambda \epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right).\]
+
+\(\square\)
+
+*Remark*. Claim 28 is my understanding / version of Theorem 1 in
+[ACGMMTZ16], by using the same proof technique. Here I quote the
+original version of theorem with notions and notations altered for
+consistency with this post:
+
+#+begin_quote
+ There exists constants \(c_1', c_2' > 0\) so that for any
+ \(\epsilon < c_1' r^2 T\), DP-SGD is \((\epsilon, \delta)\)-differentially
+ private for any \(\delta > 0\) if we choose
+#+end_quote
+
+\[\sigma \ge c_2' {r \sqrt{T \log (1 / \delta)} \over \epsilon}. \qquad (10)\]
+
+I am however unable to reproduce this version, assuming Conjecture 1 is
+true, for the following reasons:
+
+1. In the proof in the paper, we have \(\epsilon = c_1' r^2 T\) instead of
+ "less than" in the statement of the Theorem. If we change it to
+ \(\epsilon < c_1' r^2 T\) then the direction of the inequality becomes
+ opposite to the direction we want to prove:
+ \[\exp(k C(c_1, c_2) - \lambda \epsilon) \ge ...\]
+
+2. The condition \(r = O(\sigma^{-1})\) of Claim 26 whose result is used
+ in the proof of this theorem is not mentioned in the statement of the
+ proof. The implication is that (10) becomes an ill-formed condition
+ as the right hand side also depends on \(\sigma\).
+
+** Tensorflow implementation
+ :PROPERTIES:
+ :CUSTOM_ID: tensorflow-implementation
+ :END:
+The DP-SGD is implemented in
+[[https://github.com/tensorflow/privacy][TensorFlow Privacy]]. In the
+following I discuss the package in the current state (2019-03-11). It is
+divided into two parts:
+[[https://github.com/tensorflow/privacy/tree/master/privacy/optimizers][=optimizers=]]
+which implements the actual differentially private algorithms, and
+[[https://github.com/tensorflow/privacy/tree/master/privacy/analysis][=analysis=]]
+which computes the privacy guarantee.
+
+The =analysis= part implements a privacy ledger that "keeps a record of
+all queries executed over a given dataset for the purpose of computing
+privacy guarantees". On the other hand, all the computation is done in
+[[https://github.com/tensorflow/privacy/blob/7e2d796bdee9b60dce21a82a397eefda35b0ac10/privacy/analysis/rdp_accountant.py][=rdp_accountant.py=]].
+At this moment, =rdp_accountant.py= only implements the computation of
+the privacy guarantees for DP-SGD with Gaussian mechanism. In the
+following I will briefly explain the code in this file.
+
+Some notational correspondences: their =alpha= is our \(\lambda\), their
+=q= is our \(r\), their =A_alpha= (in the comments) is our
+\(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2)} (\lambda - 1)\), at
+least when \(\lambda\) is an integer.
+
+- The function =_compute_log_a= presumably computes the cumulants
+ \(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2), N(0, \sigma^2)}(\lambda - 1)\).
+ It calls =_compute_log_a_int= or =_compute_log_a_frac= depending on
+ whether \(\lambda\) is an integer.
+- The function =_compute_log_a_int= computes the cumulant using (9.5).
+- When \(\lambda\) is not an integer, we can't use (9.5). I have yet to
+ decode how =_compute_log_a_frac= computes the cumulant (or an upper
+ bound of it) in this case
+- The function =_compute_delta= computes \(\delta\)s for a list of
+ \(\lambda\)s and \(\kappa\)s using Item 1 of Claim 25 and return the
+ smallest one, and the function =_compute_epsilon= computes epsilon
+ uses Item 3 in Claim 25 in the same way.
+
+In =optimizers=, among other things, the DP-SGD with Gaussian mechanism
+is implemented in =dp_optimizer.py= and =gaussian_query.py=. See the
+definition of =DPGradientDescentGaussianOptimizer= in =dp_optimizer.py=
+and trace the calls therein.
+
+At this moment, the privacy guarantee computation part and the optimizer
+part are separated, with =rdp_accountant.py= called in
+=compute_dp_sgd_privacy.py= with user-supplied parameters. I think this
+is due to the lack of implementation in =rdp_accountant.py= of any
+non-DPSGD-with-Gaussian privacy guarantee computation. There is already
+[[https://github.com/tensorflow/privacy/issues/23][an issue on this]],
+so hopefully it won't be long before the privacy guarantees can be
+automatically computed given a DP-SGD instance.
+
+** Comparison among different methods
+ :PROPERTIES:
+ :CUSTOM_ID: comparison-among-different-methods
+ :END:
+So far we have seen three routes to compute the privacy guarantees for
+DP-SGD with the Gaussian mechanism:
+
+1. Claim 9 (single Gaussian mechanism privacy guarantee) -> Claim 19
+ (Subsampling theorem) -> Claim 18 (Advanced Adaptive Composition
+ Theorem)
+2. Example 1 (RDP for the Gaussian mechanism) -> Claim 22 (Moment
+ Composition Theorem) -> Example 3 (Moment composition applied to the
+ Gaussian mechanism)
+3. Claim 26 (RDP for Gaussian mechanism with specific magnitudes for
+ subsampling rate) -> Claim 28 (Moment Composition Theorem and
+ translation to conventional DP)
+
+Which one is the best?
+
+To make fair comparison, we may use one parameter as the metric and set
+all others to be the same. For example, we can
+
+1. Given the same \(\epsilon\), \(r\) (in Route 1 and 3), \(k\), \(\sigma\),
+ compare the \(\delta\)s
+2. Given the same \(\epsilon\), \(r\) (in Route 1 and 3), \(k\), \(\delta\),
+ compare the \(\sigma\)s
+3. Given the same \(\delta\), \(r\) (in Route 1 and 3), \(k\), \(\sigma\),
+ compare the \(\epsilon\)s.
+
+I find that the first one, where \(\delta\) is used as a metric, the best.
+This is because we have the tightest bounds and the cleanest formula
+when comparing the \(\delta\). For example, the Azuma and Chernoff bounds
+are both expressed as a bound for \(\delta\). On the other hand, the
+inversion of these bounds either requires a cost in the tightness (Claim
+9, bounds on \(\sigma\)) or in the complexity of the formula (Claim 16
+Advanced Adaptive Composition Theorem, bounds on \(\epsilon\)).
+
+So if we use \(\sigma\) or \(\epsilon\) as a metric, either we get a less
+fair comparison, or have to use a much more complicated formula as the
+bounds.
+
+Let us first compare Route 1 and Route 2 without specialising to the
+Gaussian mechanism.
+
+*Warning*. What follows is a bit messy.
+
+Suppose each mechanism \(N_i\) satisfies
+\((\epsilon', \delta(\epsilon'))\)-dp. Let
+\(\tilde \epsilon := \log (1 + r (e^{\epsilon'} - 1))\), then we have the
+subsampled mechanism \(M_i(x) = N_i(x_\gamma)\) is
+\((\tilde \epsilon, r \tilde \delta(\tilde \epsilon))\)-dp, where
+
+\[\tilde \delta(\tilde \epsilon) = \delta(\log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1))\]
+
+Using the Azuma bound in the proof of Advanced Adaptive Composition
+Theorem (6.99):
+
+\[\mathbb P(L(p^k || q^k) \ge \epsilon) \le \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}).\]
+
+So we have the final bound for Route 1:
+
+\[\delta_1(\epsilon) = \min_{\tilde \epsilon: \epsilon > r^{-1} k a(\tilde \epsilon)} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \tilde \delta(\tilde \epsilon).\]
+
+As for Route 2, since we do not gain anything from subsampling in RDP,
+we do not subsample at all.
+
+By Claim 23, we have the bound for Route 2:
+
+\[\delta_2(\epsilon) = \exp(- k \kappa^* (\epsilon / k)).\]
+
+On one hand, one can compare \(\delta_1\) and \(\delta_2\) with numerical
+experiments. On the other hand, if we further specify
+\(\delta(\epsilon')\) in Route 1 as the Chernoff bound for the cumulants
+of divergence variable, i.e.
+
+\[\delta(\epsilon') = \exp(- \kappa^* (\epsilon')),\]
+
+we have
+
+\[\delta_1 (\epsilon) = \min_{\tilde \epsilon: a(\tilde \epsilon) < r k^{-1} \epsilon} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \exp(- \kappa^* (b(\tilde\epsilon))),\]
+
+where
+
+\[b(\tilde \epsilon) := \log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1) \le r^{-1} \tilde\epsilon.\]
+
+We note that since
+\(a(\tilde \epsilon) = \tilde\epsilon(e^{\tilde \epsilon} - 1) 1_{\tilde\epsilon < \log 2} + \tilde\epsilon 1_{\tilde\epsilon \ge \log 2}\),
+we may compare the two cases separately.
+
+Note that we have \(\kappa^*\) is a monotonously increasing function,
+therefore
+
+\[\kappa^* (b(\tilde\epsilon)) \le \kappa^*(r^{-1} \tilde\epsilon).\]
+
+So for \(\tilde \epsilon \ge \log 2\), we have
+
+\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(r^{-1} \tilde \epsilon)) \ge k \exp(- \kappa^*(k^{-1} \epsilon)) \ge \delta_2(\epsilon).\]
+
+For \(\tilde\epsilon < \log 2\), it is harder to compare, as now
+
+\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(\epsilon / \sqrt{r k})).\]
+
+It is tempting to believe that this should also be greater than
+\(\delta_2(\epsilon)\). But I can not say for sure. At least in the
+special case of Gaussian, we have
+
+\[k \exp(- \kappa^*(\epsilon / \sqrt{r k})) = k \exp(- (\sigma \sqrt{\epsilon / k r} - (2 \sigma)^{-1})^2) \ge \exp(- k ({\sigma \epsilon \over k} - (2 \sigma)^{-1})^2) = \delta_2(\epsilon)\]
+
+when \(\epsilon\) is sufficiently small. However we still need to consider
+the case where \(\epsilon\) is not too small. But overall it seems most
+likely Route 2 is superior than Route 1.
+
+So let us compare Route 2 with Route 3:
+
+Given the condition to obtain the Chernoff bound
+
+\[{\sigma \epsilon \over k} > (2 \sigma)^{-1}\]
+
+we have
+
+\[\delta_2(\epsilon) > \exp(- k (\sigma \epsilon / k)^2) = \exp(- \sigma^2 \epsilon^2 / k).\]
+
+For this to achieve the same bound
+
+\[\delta_3(\epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right)\]
+
+we need \(k < {2 \epsilon \over c_2}\). This is only possible if \(c_2\) is
+small or \(\epsilon\) is large, since \(k\) is a positive integer.
+
+So taking at face value, Route 3 seems to achieve the best results.
+However, it also has some similar implicit conditions that need to be
+satisfied: First \(T\) needs to be at least \(1\), meaning
+
+\[{c_2 \over C(c_1, c_2)} \epsilon \sigma^2 \ge 1.\]
+
+Second, \(k\) needs to be at least \(1\) as well, i.e.
+
+\[k = r T \ge {c_1 c_2 \over C(c_1, c_2)} \epsilon \sigma \ge 1.\]
+
+Both conditions rely on the magnitudes of \(\epsilon\), \(\sigma\), \(c_1\),
+\(c_2\), and the rate of growth of \(C(c_1, c_2)\). The biggest problem in
+this list is the last, because if we know how fast \(C\) grows then we'll
+have a better idea what are the constraints for the parameters to
+achieve the result in Route 3.
+
+** Further questions
+ :PROPERTIES:
+ :CUSTOM_ID: further-questions
+ :END:
+Here is a list of what I think may be interesting topics or potential
+problems to look at, with no guarantee that they are all awesome
+untouched research problems:
+
+1. Prove Conjecture 1
+2. Find a theoretically definitive answer whether the methods in Part 1
+ or Part 2 yield better privacy guarantees.
+3. Study the non-Gaussian cases, general or specific. Let \(p\) be some
+ probability density, what is the tail bound of
+ \(L(p(y) || p(y + \alpha))\) for \(|\alpha| \le 1\)? Can you find
+ anything better than Gaussian? For a start, perhaps the nice tables
+ of Rényi divergence in Gil-Alajaji-Linder 2013 may be useful?
+4. Find out how useful Claim 26 is. Perhaps start with computing the
+ constant \(C\) nemerically.
+5. Help with [[https://github.com/tensorflow/privacy/issues/23][the
+ aforementioned issue]] in the Tensorflow privacy package.
+
+** References
+ :PROPERTIES:
+ :CUSTOM_ID: references
+ :END:
+
+- Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya
+ Mironov, Kunal Talwar, and Li Zhang. "Deep Learning with Differential
+ Privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer
+ and Communications Security - CCS'16, 2016, 308--18.
+ [[https://doi.org/10.1145/2976749.2978318]].
+- Erven, Tim van, and Peter Harremoës. "R\'enyi Divergence and
+ Kullback-Leibler Divergence." IEEE Transactions on Information Theory
+ 60, no. 7 (July 2014): 3797--3820.
+ [[https://doi.org/10.1109/TIT.2014.2320500]].
+- Gil, M., F. Alajaji, and T. Linder. "Rényi Divergence Measures for
+ Commonly Used Univariate Continuous Distributions." Information
+ Sciences 249 (November 2013): 124--31.
+ [[https://doi.org/10.1016/j.ins.2013.06.018]].
+- Mironov, Ilya. "Renyi Differential Privacy." 2017 IEEE 30th Computer
+ Security Foundations Symposium (CSF), August 2017, 263--75.
+ [[https://doi.org/10.1109/CSF.2017.11]].
diff --git a/posts/blog.html b/posts/blog.html
new file mode 100644
index 0000000..80176e7
--- /dev/null
+++ b/posts/blog.html
@@ -0,0 +1,21 @@
+#+TITLE: All posts
+
+- *[[file:sitemap.org][All posts]]* - 2021-06-17
+- *[[file:2019-03-14-great-but-manageable-expectations.org][Great but Manageable Expectations]]* - 2019-03-14
+- *[[file:2019-03-13-a-tail-of-two-densities.org][A Tail of Two Densities]]* - 2019-03-13
+- *[[file:2019-02-14-raise-your-elbo.org][Raise your ELBO]]* - 2019-02-14
+- *[[file:2019-01-03-discriminant-analysis.org][Discriminant analysis]]* - 2019-01-03
+- *[[file:2018-12-02-lime-shapley.org][Shapley, LIME and SHAP]]* - 2018-12-02
+- *[[file:2018-06-03-automatic_differentiation.org][Automatic differentiation]]* - 2018-06-03
+- *[[file:2018-04-10-update-open-research.org][Updates on open research]]* - 2018-04-29
+- *[[file:2017-08-07-mathematical_bazaar.org][The Mathematical Bazaar]]* - 2017-08-07
+- *[[file:2017-04-25-open_research_toywiki.org][Open mathematical research and launching toywiki]]* - 2017-04-25
+- *[[file:2016-10-13-q-robinson-schensted-knuth-polymer.org][A \(q\)-Robinson-Schensted-Knuth algorithm and a \(q\)-polymer]]* - 2016-10-13
+- *[[file:2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org][AMS review of 'Double Macdonald polynomials as the stable limit of Macdonald superpolynomials' by Blondeau-Fournier, Lapointe and Mathieu]]* - 2015-07-15
+- *[[file:2015-07-01-causal-quantum-product-levy-area.org][On a causal quantum double product integral related to Lévy stochastic area.]]* - 2015-07-01
+- *[[file:2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org][AMS review of 'Infinite binary words containing repetitions of odd period' by Badkobeh and Crochemore]]* - 2015-05-30
+- *[[file:2015-04-02-juggling-skill-tree.org][jst]]* - 2015-04-02
+- *[[file:2015-04-01-unitary-double-products.org][Unitary causal quantum stochastic double products as universal]]* - 2015-04-01
+- *[[file:2015-01-20-weighted-interpretation-super-catalan-numbers.org][AMS review of 'A weighted interpretation for the super Catalan]]* - 2015-01-20
+- *[[file:2014-04-01-q-robinson-schensted-symmetry-paper.org][Symmetry property of \(q\)-weighted Robinson-Schensted algorithms and branching algorithms]]* - 2014-04-01
+- *[[file:2013-06-01-q-robinson-schensted-paper.org][A \(q\)-weighted Robinson-Schensted algorithm]]* - 2013-06-01 \ No newline at end of file