From 147a19e84a743f1379f05bf2f444143b4afd7bd6 Mon Sep 17 00:00:00 2001 From: Yuchen Pei Date: Fri, 18 Jun 2021 12:58:44 +1000 Subject: Updated. --- posts/2013-06-01-q-robinson-schensted-paper.org | 28 + ...4-04-01-q-robinson-schensted-symmetry-paper.org | 16 + ...ighted-interpretation-super-catalan-numbers.org | 39 + posts/2015-04-01-unitary-double-products.org | 10 + posts/2015-04-02-juggling-skill-tree.org | 28 + ...ry-words-containing-repetitions-odd-periods.org | 67 + ...2015-07-01-causal-quantum-product-levy-area.org | 26 + ...nald-polynomials-macdonald-superpolynomials.org | 64 + ...16-10-13-q-robinson-schensted-knuth-polymer.org | 50 + posts/2017-04-25-open_research_toywiki.org | 21 + posts/2017-08-07-mathematical_bazaar.org | 213 ++++ posts/2018-04-10-update-open-research.org | 185 +++ posts/2018-06-03-automatic_differentiation.org | 100 ++ posts/2018-12-02-lime-shapley.org | 362 ++++++ posts/2019-01-03-discriminant-analysis.org | 293 +++++ posts/2019-02-14-raise-your-elbo.org | 1150 +++++++++++++++++ posts/2019-03-13-a-tail-of-two-densities.org | 1304 ++++++++++++++++++++ ...019-03-14-great-but-manageable-expectations.org | 836 +++++++++++++ posts/blog.html | 21 + 19 files changed, 4813 insertions(+) create mode 100644 posts/2013-06-01-q-robinson-schensted-paper.org create mode 100644 posts/2014-04-01-q-robinson-schensted-symmetry-paper.org create mode 100644 posts/2015-01-20-weighted-interpretation-super-catalan-numbers.org create mode 100644 posts/2015-04-01-unitary-double-products.org create mode 100644 posts/2015-04-02-juggling-skill-tree.org create mode 100644 posts/2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org create mode 100644 posts/2015-07-01-causal-quantum-product-levy-area.org create mode 100644 posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org create mode 100644 posts/2016-10-13-q-robinson-schensted-knuth-polymer.org create mode 100644 posts/2017-04-25-open_research_toywiki.org create mode 100644 posts/2017-08-07-mathematical_bazaar.org create mode 100644 posts/2018-04-10-update-open-research.org create mode 100644 posts/2018-06-03-automatic_differentiation.org create mode 100644 posts/2018-12-02-lime-shapley.org create mode 100644 posts/2019-01-03-discriminant-analysis.org create mode 100644 posts/2019-02-14-raise-your-elbo.org create mode 100644 posts/2019-03-13-a-tail-of-two-densities.org create mode 100644 posts/2019-03-14-great-but-manageable-expectations.org create mode 100644 posts/blog.html (limited to 'posts') diff --git a/posts/2013-06-01-q-robinson-schensted-paper.org b/posts/2013-06-01-q-robinson-schensted-paper.org new file mode 100644 index 0000000..27a6b0e --- /dev/null +++ b/posts/2013-06-01-q-robinson-schensted-paper.org @@ -0,0 +1,28 @@ +#+title: A \(q\)-weighted Robinson-Schensted algorithm + +#+date: <2013-06-01> + +In [[https://projecteuclid.org/euclid.ejp/1465064320][this paper]] with +[[http://www.bristol.ac.uk/maths/people/neil-m-oconnell/][Neil]] we +construct a \(q\)-version of the Robinson-Schensted algorithm with +column insertion. Like the +[[http://en.wikipedia.org/wiki/Robinson–Schensted_correspondence][usual +RS correspondence]] with column insertion, this algorithm could take +words as input. Unlike the usual RS algorithm, the output is a set of +weighted pairs of semistandard and standard Young tableaux \((P,Q)\) +with the same shape. The weights are rational functions of indeterminant +\(q\). + +If \(q\in[0,1]\), the algorithm can be considered as a randomised RS +algorithm, with 0 and 1 being two interesting cases. When \(q\to0\), it +is reduced to the latter usual RS algorithm; while when \(q\to1\) with +proper scaling it should scale to directed random polymer model in +[[http://arxiv.org/abs/0910.0069][(O'Connell 2012)]]. When the input +word \(w\) is a random walk: + +\begin{align*}\mathbb +P(w=v)=\prod_{i=1}^na_{v_i},\qquad\sum_ja_j=1\end{align*} + +the shape of output evolves as a Markov chain with kernel related to +\(q\)-Whittaker functions, which are Macdonald functions when \(t=0\) +with a factor. diff --git a/posts/2014-04-01-q-robinson-schensted-symmetry-paper.org b/posts/2014-04-01-q-robinson-schensted-symmetry-paper.org new file mode 100644 index 0000000..b1c967d --- /dev/null +++ b/posts/2014-04-01-q-robinson-schensted-symmetry-paper.org @@ -0,0 +1,16 @@ +#+title: Symmetry property of \(q\)-weighted Robinson-Schensted algorithms and branching algorithms +#+date: <2014-04-01> + +In [[http://link.springer.com/article/10.1007/s10801-014-0505-x][this +paper]] a symmetry property analogous to the well known symmetry +property of the normal Robinson-Schensted algorithm has been shown for +the \(q\)-weighted Robinson-Schensted algorithm. The proof uses a +generalisation of the growth diagram approach introduced by Fomin. This +approach, which uses "growth graphs", can also be applied to a wider +class of insertion algorithms which have a branching structure. + +#+caption: Growth graph of q-RS for 1423 +[[../assets/resources/1423graph.jpg]] + +Above is the growth graph of the \(q\)-weighted Robinson-Schensted +algorithm for the permutation \({1 2 3 4\choose1 4 2 3}\). diff --git a/posts/2015-01-20-weighted-interpretation-super-catalan-numbers.org b/posts/2015-01-20-weighted-interpretation-super-catalan-numbers.org new file mode 100644 index 0000000..9cde382 --- /dev/null +++ b/posts/2015-01-20-weighted-interpretation-super-catalan-numbers.org @@ -0,0 +1,39 @@ +#+title: AMS review of 'A weighted interpretation for the super Catalan +numbers' by Allen and Gheorghiciuc + +#+date: <2015-01-20> + +The super Catalan numbers are defined as $$ T(m,n) = {(2 m)! (2 n)! +\over 2 m! n! (m + n)!}. $$ + +   This paper has two main results. First a combinatorial interpretation +of the super Catalan numbers is given: $$ T(m,n) = P(m,n) - N(m,n) $$ +where \(P(m,n)\) enumerates the number of 2-Motzkin paths whose \(m\) +-th step begins at an even level (called \(m\)-positive paths) and +\(N(m,n)\) those with \(m\)-th step beginning at an odd level +(\(m\)-negative paths). The proof uses a recursive argument on the +number of \(m\)-positive and -negative paths, based on a recursion of +the super Catalan numbers appearing in [I. M. Gessel, J. Symbolic +Comput. *14* (1992), no. 2-3, 179--194; +[[http://www.ams.org/mathscinet/search/publdoc.html?r=1&pg1=MR&s1=1187230&loc=fromrevtext][MR1187230]]]: +$$ 4T(m,n) = T(m+1, n) + T(m, n+1). $$ This result gives an expression +for the super Catalan numbers in terms of numbers counting the so-called +ballot paths. The latter sometimes are also referred to as the +generalised Catalan numbers forming the entries of the Catalan triangle. + +   Based on the first result, the second result is a combinatorial +interpretation of the super Catalan numbers \(T(2,n)\) in terms of +counting certain Dyck paths. This is equivalent to a theorem, which +represents \(T(2,n)\) as counting of certain pairs of Dyck paths, in [I. +M. Gessel and G. Xin, J. Integer Seq. *8* (2005), no. 2, Article 05.2.3, +13 pp.; +[[http://www.ams.org/mathscinet/search/publdoc.html?r=1&pg1=MR&s1=2134162&loc=fromrevtext][MR2134162]]], +and the equivalence is explained at the end of the paper by a bijection +between the Dyck paths and the pairs of Dyck paths. The proof of the +theorem itself is also done by constructing two bijections between Dyck +paths satisfying certain conditions. All the three bijections are +formulated by locating, removing and adding steps. + +Copyright notice: This review is published at +http://www.ams.org/mathscinet-getitem?mr=3275875, its copyright owned by +the AMS. diff --git a/posts/2015-04-01-unitary-double-products.org b/posts/2015-04-01-unitary-double-products.org new file mode 100644 index 0000000..d545b3a --- /dev/null +++ b/posts/2015-04-01-unitary-double-products.org @@ -0,0 +1,10 @@ +#+title: Unitary causal quantum stochastic double products as universal +interactions I + +#+date: <2015-04-01> + +In +[[http://www.actaphys.uj.edu.pl/findarticle?series=Reg&vol=46&page=1851][this +paper]] with [[http://homepages.lboro.ac.uk/~marh3/][Robin]] we show the +explicit formulae for a family of unitary triangular and rectangular +double product integrals which can be described as second quantisations. diff --git a/posts/2015-04-02-juggling-skill-tree.org b/posts/2015-04-02-juggling-skill-tree.org new file mode 100644 index 0000000..79b35ad --- /dev/null +++ b/posts/2015-04-02-juggling-skill-tree.org @@ -0,0 +1,28 @@ +#+title: jst + +#+date: <2015-04-02> + +jst = juggling skill tree + +If you have ever played a computer role playing game, you may have +noticed the protagonist sometimes has a skill "tree" (most of the time +it is actually a directed acyclic graph), where certain skills leads to +others. For example, +[[http://hydra-media.cursecdn.com/diablo.gamepedia.com/3/37/Sorceress_Skill_Trees_%28Diablo_II%29.png?version=b74b3d4097ef7ad4e26ebee0dcf33d01][here]] +is the skill tree of sorceress in +[[https://en.wikipedia.org/wiki/Diablo_II][Diablo II]]. + +Now suppose our hero embarks on a quest for learning all the possible +juggling patterns. Everyone would agree she should start with cascade, +the simplest nontrivial 3-ball pattern, but what afterwards? A few other +accessible patterns for beginners are juggler's tennis, two in one and +even reverse cascade, but what to learn after that? The encyclopeadic +[[http://libraryofjuggling.com/][Library of Juggling]] serves as a good +guide, as it records more than 160 patterns, some of which very +aesthetically appealing. On this website almost all the patterns have a +"prerequisite" section, indicating what one should learn beforehand. I +have therefore written a script using [[http://python.org][Python]], +[[http://www.crummy.com/software/BeautifulSoup/][BeautifulSoup]] and +[[http://pygraphviz.github.io/][pygraphviz]] to generate a jst (graded +by difficulties, which is the leftmost column) from the Library of +Juggling (click the image for the full size): diff --git a/posts/2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org b/posts/2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org new file mode 100644 index 0000000..b632c03 --- /dev/null +++ b/posts/2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org @@ -0,0 +1,67 @@ +#+title: AMS review of 'Infinite binary words containing repetitions of +#+title: odd period' by Badkobeh and Crochemore + +#+date: <2015-05-30> + +This paper is about the existence of pattern-avoiding infinite binary +words, where the patterns are squares, cubes and \(3^+\)-powers.    +There are mainly two kinds of results, positive (existence of an +infinite binary word avoiding a certain pattern) and negative +(non-existence of such a word). Each positive result is proved by the +construction of a word with finitely many squares and cubes which are +listed explicitly. First a synchronising (also known as comma-free) +uniform morphism \(g\: \Sigma_3^* \to \Sigma_2^*\) + +is constructed. Then an argument is given to show that the length of +squares in the code \(g(w)\) for a squarefree \(w\) is bounded, hence +all the squares can be obtained by examining all \(g(s)\) for \(s\) of +bounded lengths. The argument resembles that of the proof of, e.g., +Theorem 1, Lemma 2, Theorem 3 and Lemma 4 in [N. Rampersad, J. O. +Shallit and M. Wang, Theoret. Comput. Sci. *339* (2005), no. 1, 19--34; +[[http://www.ams.org/mathscinet/search/publdoc.html?r=1&pg1=MR&s1=2142071&loc=fromrevtext][MR2142071]]]. +The negative results are proved by traversing all possible finite words +satisfying the conditions. + +   Let \(L(n_2, n_3, S)\) be the maximum length of a word with \(n_2\) +distinct squares, \(n_3\) distinct cubes and that the periods of the +squares can take values only in \(S\) , where \(n_2, n_3 \in \Bbb N \cup +\{\infty, \omega\}\) and \(S \subset \Bbb N_+\) .    \(n_k = 0\) +corresponds to \(k\)-free, \(n_k = \infty\) means no restriction on the +number of distinct \(k\)-powers, and \(n_k = \omega\) means +\(k^+\)-free. + +   Below is the summary of the positive and negative results: + +1) (Negative) \(L(\infty, \omega, 2 \Bbb N) < \infty\) : \(\nexists\) an + infinite \(3^+\) -free binary word avoiding all squares of odd + periods. (Proposition 1) + +2) (Negative) \(L(\infty, 0, 2 \Bbb N + 1) \le 23\) : \(\nexists\) an + infinite 3-free binary word, avoiding squares of even periods. The + longest one has length \(\le 23\) (Proposition 2). + +3) (Positive) \(L(\infty, \omega, 2 \Bbb N + + + 1) - = \infty\) :: \(\exists\) an infinite \(3^+\) -free binary word + avoiding squares of even periods (Theorem 1). + +4) (Positive) \(L(\infty, \omega, \{1, 3\}) = \infty\) : \(\exists\) an + infinite \(3^+\) -free binary word containing only squares of period + 1 or 3 (Theorem 2). + +5) (Negative) \(L(6, 1, 2 \Bbb N + 1) = 57\) : \(\nexists\) an infinite + binary word avoiding squares of even period containing \(< 7\) + squares and \(< 2\) cubes. The longest one containing 6 squares and 1 + cube has length 57 (Proposition 6). + +6) (Positive) \(L(7, 1, 2 \Bbb N + 1) = \infty\) : \(\exists\) an + infinite \(3^+\) -free binary word avoiding squares of even period + with 1 cube and 7 squares (Theorem 3). + +7) (Positive) \(L(4, 2, 2 \Bbb N + 1) = \infty\) : \(\exists\) an + infinite \(3^+\) -free binary words avoiding squares of even period + and containing 2 cubes and 4 squares (Theorem 4). + +Copyright notice: This review is published at +http://www.ams.org/mathscinet-getitem?mr=3313467, its copyright owned by +the AMS. diff --git a/posts/2015-07-01-causal-quantum-product-levy-area.org b/posts/2015-07-01-causal-quantum-product-levy-area.org new file mode 100644 index 0000000..528b9b7 --- /dev/null +++ b/posts/2015-07-01-causal-quantum-product-levy-area.org @@ -0,0 +1,26 @@ +#+title: On a causal quantum double product integral related to Lévy +#+title: stochastic area. + +#+date: <2015-07-01> + +In [[https://arxiv.org/abs/1506.04294][this paper]] with +[[http://homepages.lboro.ac.uk/~marh3/][Robin]] we study the family of +causal double product integrals \[ \prod_{a < x < y < b}\left(1 + +i{\lambda \over 2}(dP_x dQ_y - dQ_x dP_y) + i {\mu \over 2}(dP_x dP_y + +dQ_x dQ_y)\right) \] + +where $P$ and $Q$ are the mutually noncommuting momentum and position +Brownian motions of quantum stochastic calculus. The evaluation is +motivated heuristically by approximating the continuous double product +by a discrete product in which infinitesimals are replaced by finite +increments. The latter is in turn approximated by the second +quantisation of a discrete double product of rotation-like operators in +different planes due to a result in +[[http://www.actaphys.uj.edu.pl/findarticle?series=Reg&vol=46&page=1851][(Hudson-Pei2015)]]. +The main problem solved in this paper is the explicit evaluation of the +continuum limit $W$ of the latter, and showing that $W$ is a unitary +operator. The kernel of $W$ is written in terms of Bessel functions, and +the evaluation is achieved by working on a lattice path model and +enumerating linear extensions of related partial orderings, where the +enumeration turns out to be heavily related to Dyck paths and +generalisations of Catalan numbers. diff --git a/posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org b/posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org new file mode 100644 index 0000000..cda6967 --- /dev/null +++ b/posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org @@ -0,0 +1,64 @@ +#+title: AMS review of 'Double Macdonald polynomials as the stable limit +#+title: of Macdonald superpolynomials' by Blondeau-Fournier, Lapointe and +#+title: Mathieu + +#+date: <2015-07-15> + +A Macdonald superpolynomial (introduced in [O. Blondeau-Fournier et al., +Lett. Math. Phys. 101 (2012), no. 1, 27--47; +[[http://www.ams.org/mathscinet/search/publdoc.html?pg1=MR&s1=2935476&loc=fromrevtext][MR2935476]]; +J. Comb. 3 (2012), no. 3, 495--561; +[[http://www.ams.org/mathscinet/search/publdoc.html?pg1=MR&s1=3029444&loc=fromrevtext][MR3029444]]]) +in \(N\) Grassmannian variables indexed by a superpartition \(\Lambda\) +is said to be stable if \({m (m + 1) \over 2} \ge |\Lambda|\) and \(N +\ge |\Lambda| - {m (m - 3) \over 2}\) , where \(m\) is the fermionic +degree. A stable Macdonald superpolynomial (corresponding to a +bisymmetric polynomial) is also called a double Macdonald polynomial +(dMp). The main result of this paper is the factorisation of a dMp into +plethysms of two classical Macdonald polynomials (Theorem 5). Based on +this result, this paper + +1) shows that the dMp has a unique decomposition into bisymmetric + monomials; + +2) calculates the norm of the dMp; + +3) calculates the kernel of the Cauchy-Littlewood-type identity of the + dMp; + +4) shows the specialisation of the aforementioned factorisation to the + Jack, Hall-Littlewood and Schur cases. One of the three Schur + specialisations, denoted as \(s_{\lambda, \mu}\), also appears in (7) + and (9) below; + +5) defines the \(\omega\) -automorphism in this setting, which was used + to prove an identity involving products of four Littlewood-Richardson + coefficients; + +6) shows an explicit evaluation of the dMp motivated by the most general + evaluation of the usual Macdonald polynomials; + +7) relates dMps with the representation theory of the hyperoctahedral + group \(B_n\) via the double Kostka coefficients (which are defined + as the entries of the transition matrix from the bisymmetric Schur + functions \(s_{\lambda, \mu}\) to the modified dMps); + +8) shows that the double Kostka coefficients have the positivity and the + symmetry property, and can be written as sums of products of the + usual Kostka coefficients; + +9) defines an operator \(\nabla^B\) as an analogue of the nabla operator + \(\nabla\) introduced in [F. Bergeron and A. M. Garsia, in /Algebraic + methods and \(q\)-special functions/ (Montréal, QC, 1996), 1--52, CRM + Proc. Lecture Notes, 22, Amer. Math. Soc., Providence, RI, 1999; + [[http://www.ams.org/mathscinet/search/publdoc.html?r=1&pg1=MR&s1=1726826&loc=fromrevtext][MR1726826]]]. + The action of \(\nabla^B\) on the bisymmetric Schur function + \(s_{\lambda, \mu}\) yields the dimension formula \((h + 1)^r\) for + the corresponding representation of \(B_n\) , where \(h\) and \(r\) + are the Coxeter number and the rank of \(B_n\) , in the same way that + the action of \(\nabla\) on the \(n\) th elementary symmetric + function leads to the same formula for the group of type \(A_n\) . + +Copyright notice: This review is published at +http://www.ams.org/mathscinet-getitem?mr=3306078, its copyright owned by +the AMS. diff --git a/posts/2016-10-13-q-robinson-schensted-knuth-polymer.org b/posts/2016-10-13-q-robinson-schensted-knuth-polymer.org new file mode 100644 index 0000000..93da639 --- /dev/null +++ b/posts/2016-10-13-q-robinson-schensted-knuth-polymer.org @@ -0,0 +1,50 @@ +#+title: A \(q\)-Robinson-Schensted-Knuth algorithm and a \(q\)-polymer + +#+date: <2016-10-13> + +(Latest update: 2017-01-12) In +[[http://arxiv.org/abs/1504.00666][Matveev-Petrov 2016]] a +\(q\)-deformed Robinson-Schensted-Knuth algorithm (\(q\)RSK) was +introduced. In this article we give reformulations of this algorithm in +terms of Noumi-Yamada description, growth diagrams and local moves. We +show that the algorithm is symmetric, namely the output tableaux pair +are swapped in a sense of distribution when the input matrix is +transposed. We also formulate a \(q\)-polymer model based on the +\(q\)RSK and prove the corresponding Burke property, which we use to +show a strong law of large numbers for the partition function given +stationary boundary conditions and \(q\)-geometric weights. We use the +\(q\)-local moves to define a generalisation of the \(q\)RSK taking a +Young diagram-shape of array as the input. We write down the joint +distribution of partition functions in the space-like direction of the +\(q\)-polymer in \(q\)-geometric environment, formulate a \(q\)-version +of the multilayer polynuclear growth model (\(q\)PNG) and write down the +joint distribution of the \(q\)-polymer partition functions at a fixed +time. + +This article is available at +[[https://arxiv.org/abs/1610.03692][arXiv]]. It seems to me that one +difference between arXiv and Github is that on arXiv each preprint has a +few versions only. In Github many projects have a "dev" branch hosting +continuous updates, whereas the master branch is where the stable +releases live. + +[[file:%7B%7B%20site.url%20%7D%7D/assets/resources/qrsklatest.pdf][Here]] +is a "dev" version of the article, which I shall push to arXiv when it +stablises. Below is the changelog. + +- 2017-01-12: Typos and grammar, arXiv v2. +- 2016-12-20: Added remarks on the geometric \(q\)-pushTASEP. Added + remarks on the converse of the Burke property. Added natural language + description of the \(q\)RSK. Fixed typos. +- 2016-11-13: Fixed some typos in the proof of Theorem 3. +- 2016-11-07: Fixed some typos. The \(q\)-Burke property is now stated + in a more symmetric way, so is the law of large numbers Theorem 2. +- 2016-10-20: Fixed a few typos. Updated some references. Added a + reference: [[http://web.mit.edu/~shopkins/docs/rsk.pdf][a set of notes + titled "RSK via local transformations"]]. It is written by + [[http://web.mit.edu/~shopkins/][Sam Hopkins]] in 2014 as an + expository article based on MIT combinatorics preseminar presentations + of Alex Postnikov. It contains some idea (applying local moves to a + general Young-diagram shaped array in the order that matches any + growth sequence of the underlying Young diagram) which I thought I was + the first one to write down. diff --git a/posts/2017-04-25-open_research_toywiki.org b/posts/2017-04-25-open_research_toywiki.org new file mode 100644 index 0000000..1e672b0 --- /dev/null +++ b/posts/2017-04-25-open_research_toywiki.org @@ -0,0 +1,21 @@ +#+title: Open mathematical research and launching toywiki + +#+date: <2017-04-25> + +As an experimental project, I am launching toywiki. + +It hosts a collection of my research notes. + +It takes some ideas from the open source culture and apply them to +mathematical research: 1. It uses a very permissive license (CC-BY-SA). +For example anyone can fork the project and make their own version if +they have a different vision and want to build upon the project. 2. All +edits will done with maximum transparency, and discussions of any of +notes should also be as public as possible (e.g. Github issues) 3. +Anyone can suggest changes by opening issues and submitting pull +requests + +Here are the links: [[http://toywiki.xyz][toywiki]] and +[[https://github.com/ycpei/toywiki][github repo]]. + +Feedbacks are welcome by email. diff --git a/posts/2017-08-07-mathematical_bazaar.org b/posts/2017-08-07-mathematical_bazaar.org new file mode 100644 index 0000000..64bf335 --- /dev/null +++ b/posts/2017-08-07-mathematical_bazaar.org @@ -0,0 +1,213 @@ +#+title: The Mathematical Bazaar + +#+date: <2017-08-07> + +In this essay I describe some problems in academia of mathematics and +propose an open source model, which I call open research in mathematics. + +This essay is a work in progress - comments and criticisms are +welcome! [fn:1] + +Before I start I should point out that + +1. Open research is /not/ open access. In fact the latter is a + prerequisite to the former. +2. I am not proposing to replace the current academic model with the + open model - I know academia works well for many people and I am + happy for them, but I think an open research community is long + overdue since the wide adoption of the World Wide Web more than two + decades ago. In fact, I fail to see why an open model can not run in + tandem with the academia, just like open source and closed source + software development coexist today. + +** problems of academia + :PROPERTIES: + :CUSTOM_ID: problems-of-academia + :END: +Open source projects are characterised by publicly available source +codes as well as open invitations for public collaborations, whereas +closed source projects do not make source codes accessible to the +public. How about mathematical academia then, is it open source or +closed source? The answer is neither. + +Compared to some other scientific disciplines, mathematics does not +require expensive equipments or resources to replicate results; compared +to programming in conventional software industry, mathematical findings +are not meant to be commercial, as credits and reputation rather than +money are the direct incentives (even though the former are commonly +used to trade for the latter). It is also a custom and common belief +that mathematical derivations and theorems shouldn't be patented. +Because of this, mathematical research is an open source activity in the +sense that proofs to new results are all available in papers, and thanks +to open access e.g. the arXiv preprint repository most of the new +mathematical knowledge is accessible for free. + +Then why, you may ask, do I claim that maths research is not open +sourced? Well, this is because 1. mathematical arguments are not easily +replicable and 2. mathematical research projects are mostly not open for +public participation. + +Compared to computer programs, mathematical arguments are not written in +an unambiguous language, and they are terse and not written in maximum +verbosity (this is especially true in research papers as journals +encourage limiting the length of submissions), so the understanding of a +proof depends on whether the reader is equipped with the right +background knowledge, and the completeness of a proof is highly +subjective. More generally speaking, computer programs are mostly +portable because all machines with the correct configurations can +understand and execute a piece of program, whereas humans are subject to +their environment, upbringings, resources etc. to have a brain ready to +comprehend a proof that interests them. (these barriers are softer than +the expensive equipments and resources in other scientific fields +mentioned before because it is all about having access to the right +information) + +On the other hand, as far as the pursuit of reputation and prestige +(which can be used to trade for the scarce resource of research +positions and grant money) goes, there is often little practical +motivation for career mathematicians to explain their results to the +public carefully. And so the weird reality of the mathematical academia +is that it is not an uncommon practice to keep trade secrets in order to +protect one's territory and maintain a monopoly. This is doable because +as long as a paper passes the opaque and sometimes political peer review +process and is accepted by a journal, it is considered work done, +accepted by the whole academic community and adds to the reputation of +the author(s). Just like in the software industry, trade secrets and +monopoly hinder the development of research as a whole, as well as +demoralise outsiders who are interested in participating in related +research. + +Apart from trade secrets and territoriality, another reason to the +nonexistence of open research community is an elitist tradition in the +mathematical academia, which goes as follows: + +- Whoever is not good at mathematics or does not possess a degree in + maths is not eligible to do research, or else they run high risks of + being labelled a crackpot. +- Mistakes made by established mathematicians are more tolerable than + those less established. +- Good mathematical writings should be deep, and expositions of + non-original results are viewed as inferior work and do not add to + (and in some cases may even damage) one's reputation. + +All these customs potentially discourage public participations in +mathematical research, and I do not see them easily go away unless an +open source community gains momentum. + +To solve the above problems, I propose a open source model of +mathematical research, which has high levels of openness and +transparency and also has some added benefits listed in the last section +of this essay. This model tries to achieve two major goals: + +- Open and public discussions and collaborations of mathematical + research projects online +- Open review to validate results, where author name, reviewer name, + comments and responses are all publicly available online. + +To this end, a Github model is fitting. Let me first describe how open +source collaboration works on Github. + +** open source collaborations on Github + :PROPERTIES: + :CUSTOM_ID: open-source-collaborations-on-github + :END: +On [[https://github.com][Github]], every project is publicly available +in a repository (we do not consider private repos). The owner can update +the project by "committing" changes, which include a message of what has +been changed, the author of the changes and a timestamp. Each project +has an issue tracker, which is basically a discussion forum about the +project, where anyone can open an issue (start a discussion), and the +owner of the project as well as the original poster of the issue can +close it if it is resolved, e.g. bug fixed, feature added, or out of the +scope of the project. Closing the issue is like ending the discussion, +except that the thread is still open to more posts for anyone +interested. People can react to each issue post, e.g. upvote, downvote, +celebration, and importantly, all the reactions are public too, so you +can find out who upvoted or downvoted your post. + +When one is interested in contributing code to a project, they fork it, +i.e. make a copy of the project, and make the changes they like in the +fork. Once they are happy with the changes, they submit a pull request +to the original project. The owner of the original project may accept or +reject the request, and they can comment on the code in the pull +request, asking for clarification, pointing out problematic part of the +code etc and the author of the pull request can respond to the comments. +Anyone, not just the owner can participate in this review process, +turning it into a public discussion. In fact, a pull request is a +special issue thread. Once the owner is happy with the pull request, +they accept it and the changes are merged into the original project. The +author of the changes will show up in the commit history of the original +project, so they get the credits. + +As an alternative to forking, if one is interested in a project but has +a different vision, or that the maintainer has stopped working on it, +they can clone it and make their own version. This is a more independent +kind of fork because there is no longer intention to contribute back to +the original project. + +Moreover, on Github there is no way to send private messages, which +forces people to interact publicly. If say you want someone to see and +reply to your comment in an issue post or pull request, you simply +mention them by =@someone=. + +** open research in mathematics + :PROPERTIES: + :CUSTOM_ID: open-research-in-mathematics + :END: +All this points to a promising direction of open research. A maths +project may have a wiki / collection of notes, the paper being written, +computer programs implementing the results etc. The issue tracker can +serve as a discussion forum about the project as well as a platform for +open review (bugs are analogous to mistakes, enhancements are possible +ways of improving the main results etc.), and anyone can make their own +version of the project, and (optionally) contribute back by making pull +requests, which will also be openly reviewed. One may want to add an +extra "review this project" functionality, so that people can comment on +the original project like they do in a pull request. This may or may not +be necessary, as anyone can make comments or point out mistakes in the +issue tracker. + +One may doubt this model due to concerns of credits because work in +progress is available to anyone. Well, since all the contributions are +trackable in project commit history and public discussions in issues and +pull request reviews, there is in fact /less/ room for cheating than the +current model in academia, where scooping can happen without any +witnesses. What we need is a platform with a good amount of trust like +arXiv, so that the open research community honours (and can not ignore) +the commit history, and the chance of mis-attribution can be reduced to +minimum. + +Compared to the academic model, open research also has the following +advantages: + +- Anyone in the world with Internet access will have a chance to + participate in research, whether they are affiliated to a university, + have the financial means to attend conferences, or are colleagues of + one of the handful experts in a specific field. +- The problem of replicating / understanding maths results will be + solved, as people help each other out. This will also remove the + burden of answering queries about one's research. For example, say one + has a project "Understanding the fancy results in [paper name]", they + write up some initial notes but get stuck understanding certain + arguments. In this case they can simply post the questions on the + issue tracker, and anyone who knows the answer, or just has a + speculation can participate in the discussion. In the end the problem + may be resolved without the authors of the paper being bothered, who + may be too busy to answer. +- Similarly, the burden of peer review can also be shifted from a few + appointed reviewers to the crowd. + +** related readings + :PROPERTIES: + :CUSTOM_ID: related-readings + :END: + +- [[http://www.catb.org/esr/writings/cathedral-bazaar/][The Cathedral + and the Bazaar by Eric Raymond]] +- [[http://michaelnielsen.org/blog/doing-science-online/][Doing sience + online by Michael Nielson]] +- [[https://gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible/][Is + massively collaborative mathematics possible? by Timothy Gowers]] + +[fn:1] Please send your comments to my email address - I am still + looking for ways to add a comment functionality to this website. diff --git a/posts/2018-04-10-update-open-research.org b/posts/2018-04-10-update-open-research.org new file mode 100644 index 0000000..4b078d5 --- /dev/null +++ b/posts/2018-04-10-update-open-research.org @@ -0,0 +1,185 @@ +#+title: Updates on open research + +#+date: <2018-04-29> + +It has been 9 months since I last wrote about open (maths) research. +Since then two things happened which prompted me to write an update. + +As always I discuss open research only in mathematics, not because I +think it should not be applied to other disciplines, but simply because +I do not have experience nor sufficient interests in non-mathematical +subjects. + +First, I read about Richard Stallman the founder of the free software +movement, in [[http://shop.oreilly.com/product/9780596002879.do][his +biography by Sam Williams]] and his own collection of essays +[[https://shop.fsf.org/books-docs/free-software-free-society-selected-essays-richard-m-stallman-3rd-edition][/Free +software, free society/]], from which I learned a bit more about the +context and philosophy of free software and its relation to that of open +source software. For anyone interested in open research, I highly +recommend having a look at these two books. I am also reading Levy's +[[http://www.stevenlevy.com/index.php/books/hackers][Hackers]], which +documented the development of the hacker culture predating Stallman. I +can see the connection of ideas from the hacker ethic to the free +software philosophy and to the open source philosophy. My guess is that +the software world is fortunate to have pioneers who advocated for +various kinds of freedom and openness from the beginning, whereas for +academia which has a much longer history, credit protection has always +been a bigger concern. + +Also a month ago I attended a workshop called +[[https://www.perimeterinstitute.ca/conferences/open-research-rethinking-scientific-collaboration][Open +research: rethinking scientific collaboration]]. That was the first time +I met a group of people (mostly physicists) who also want open research +to happen, and we had some stimulating discussions. Many thanks to the +organisers at Perimeter Institute for organising the event, and special +thanks to +[[https://www.perimeterinstitute.ca/people/matteo-smerlak][Matteo +Smerlak]] and +[[https://www.perimeterinstitute.ca/people/ashley-milsted][Ashley +Milsted]] for invitation and hosting. + +From both of these I feel like I should write an updated post on open +research. + +*** Freedom and community + :PROPERTIES: + :CUSTOM_ID: freedom-and-community + :END: +Ideals matter. Stallman's struggles stemmed from the frustration of +denied request of source code (a frustration I shared in academia except +source code is replaced by maths knowledge), and revolved around two +things that underlie the free software movement: freedom and community. +That is, the freedom to use, modify and share a work, and by sharing, to +help the community. + +Likewise, as for open research, apart from the utilitarian view that +open research is more efficient / harder for credit theft, we should not +ignore the ethical aspect that open research is right and fair. In +particular, I think freedom and community can also serve as principles +in open research. One way to make this argument more concrete is to +describe what I feel are the central problems: NDAs (non-disclosure +agreements) and reproducibility. + +*NDAs*. It is assumed that when establishing a research collaboration, +or just having a discussion, all those involved own the joint work in +progress, and no one has the freedom to disclose any information +e.g. intermediate results without getting permission from all +collaborators. In effect this amounts to signing an NDA. NDAs are +harmful because they restrict people's freedom from sharing information +that can benefit their own or others' research. Considering that in +contrast to the private sector, the primary goal of academia is +knowledge but not profit, NDAs in research are unacceptable. + +*Reproducibility*. Research papers written down are not necessarily +reproducible, even though they appear on peer-reviewed journals. This is +because the peer-review process is opaque and the proofs in the papers +may not be clear to everyone. To make things worse, there are no open +channels to discuss results in these papers and one may have to rely on +interacting with the small circle of the informed. One example is folk +theorems. Another is trade secrets required to decipher published works. + +I should clarify that freedom works both ways. One should have the +freedom to disclose maths knowledge, but they should also be free to +withhold any information that does not hamper the reproducibility of +published works (e.g. results in ongoing research yet to be published), +even though it may not be nice to do so when such information can help +others with their research. + +Similar to the solution offered by the free software movement, we need a +community that promotes and respects free flow of maths knowledge, in +the spirit of the [[https://www.gnu.org/philosophy/][four essential +freedoms]], a community that rejects NDAs and upholds reproducibility. + +Here are some ideas on how to tackle these two problems and build the +community: + +1. Free licensing. It solves NDA problem - free licenses permit + redistribution and modification of works, so if you adopt them in + your joint work, then you have the freedom to modify and distribute + the work; it also helps with reproducibility - if a paper is not + clear, anyone can write their own version and publish it. Bonus + points with the use of copyleft licenses like + [[https://creativecommons.org/licenses/by-sa/4.0/][Creative Commons + Share-Alike]] or the [[https://www.gnu.org/licenses/fdl.html][GNU + Free Documentation License]]. +2. A forum for discussions of mathematics. It helps solve the + reproducibility problem - public interaction may help quickly clarify + problems. By the way, Math Overflow is not a forum. +3. An infrastructure of mathematical knowledge. Like the GNU system, a + mathematics encyclopedia under a copyleft license maintained in the + Github-style rather than Wikipedia-style by a "Free Mathematics + Foundation", and drawing contributions from the public (inside or + outside of the academia). To begin with, crowd-source (again, + Github-style) the proofs of say 1000 foundational theorems covered in + the curriculum of a bachelor's degree. Perhaps start with taking + contributions from people with some credentials (e.g. having a + bachelor degree in maths) and then expand the contribution permission + to the public, or taking advantage of existing corpus under free + license like Wikipedia. +4. Citing with care: if a work is considered authorative but you + couldn't reproduce the results, whereas another paper which tries to + explain or discuss similar results makes the first paper + understandable to you, give both papers due attribution (something + like: see [1], but I couldn't reproduce the proof in [1], and the + proofs in [2] helped clarify it). No one should be offended if you + say you can not reproduce something - there may be causes on both + sides, whereas citing [2] is fairer and helps readers with a similar + background. + +*** Tools for open research + :PROPERTIES: + :CUSTOM_ID: tools-for-open-research + :END: +The open research workshop revolved around how to lead academia towards +a more open culture. There were discussions on open research tools, +improving credit attributions, the peer-review process and the path to +adoption. + +During the workshop many efforts for open research were mentioned, and +afterwards I was also made aware by more of them, like the following: + +- [[https://osf.io][OSF]], an online research platform. It has a clean + and simple interface with commenting, wiki, citation generation, DOI + generation, tags, license generation etc. Like Github it supports + private and public repositories (but defaults to private), version + control, with the ability to fork or bookmark a project. +- [[https://scipost.org/][SciPost]], physics journals whose peer review + reports and responses are public (peer-witnessed refereeing), and + allows comments (post-publication evaluation). Like arXiv, it requires + some academic credential (PhD or above) to register. +- [[https://knowen.org/][Knowen]], a platform to organise knowledge in + directed acyclic graphs. Could be useful for building the + infrastructure of mathematical knowledge. +- [[https://fermatslibrary.com/][Fermat's Library]], the journal club + website that crowd-annotates one notable paper per week released a + Chrome extension [[https://fermatslibrary.com/librarian][Librarian]] + that overlays a commenting interface on arXiv. As an example Ian + Goodfellow did an + [[https://fermatslibrary.com/arxiv_comments?url=https://arxiv.org/pdf/1406.2661.pdf][AMA + (ask me anything) on his GAN paper]]. +- [[https://polymathprojects.org/][The Polymath project]], the famous + massive collaborative mathematical project. Not exactly new, the + Polymath project is the only open maths research project that has + gained some traction and recognition. However, it does not have many + active projects + ([[http://michaelnielsen.org/polymath1/index.php?title=Main_Page][currently + only one active project]]). +- [[https://stacks.math.columbia.edu/][The Stacks Project]]. I was made + aware of this project by [[https://people.kth.se/~yitingl/][Yiting]]. + Its data is hosted on github and accepts contributions via pull + requests and is licensed under the GNU Free Documentation License, + ticking many boxes of the free and open source model. + +*** An anecdote from the workshop + :PROPERTIES: + :CUSTOM_ID: an-anecdote-from-the-workshop + :END: +In a conversation during the workshop, one of the participants called +open science "normal science", because reproducibility, open access, +collaborations, and fair attributions are all what science is supposed +to be, and practices like treating the readers as buyers rather than +users should be called "bad science", rather than "closed science". + +To which an organiser replied: maybe we should rename the workshop +"Not-bad science". diff --git a/posts/2018-06-03-automatic_differentiation.org b/posts/2018-06-03-automatic_differentiation.org new file mode 100644 index 0000000..cebcf8c --- /dev/null +++ b/posts/2018-06-03-automatic_differentiation.org @@ -0,0 +1,100 @@ +#+title: Automatic differentiation + +#+date: <2018-06-03> + +This post serves as a note and explainer of autodiff. It is licensed +under [[https://www.gnu.org/licenses/fdl.html][GNU FDL]]. + +For my learning I benefited a lot from +[[http://www.cs.toronto.edu/%7Ergrosse/courses/csc321_2018/slides/lec10.pdf][Toronto +CSC321 slides]] and the +[[https://github.com/mattjj/autodidact/][autodidact]] project which is a +pedagogical implementation of +[[https://github.com/hips/autograd][Autograd]]. That said, any mistakes +in this note are mine (especially since some of the knowledge is +obtained from interpreting slides!), and if you do spot any I would be +grateful if you can let me know. + +Automatic differentiation (AD) is a way to compute derivatives. It does +so by traversing through a computational graph using the chain rule. + +There are the forward mode AD and reverse mode AD, which are kind of +symmetric to each other and understanding one of them results in little +to no difficulty in understanding the other. + +In the language of neural networks, one can say that the forward mode AD +is used when one wants to compute the derivatives of functions at all +layers with respect to input layer weights, whereas the reverse mode AD +is used to compute the derivatives of output functions with respect to +weights at all layers. Therefore reverse mode AD (rmAD) is the one to +use for gradient descent, which is the one we focus in this post. + +Basically rmAD requires the computation to be sufficiently decomposed, +so that in the computational graph, each node as a function of its +parent nodes is an elementary function that the AD engine has knowledge +about. + +For example, the Sigmoid activation $a' = \sigma(w a + b)$ is quite +simple, but it should be decomposed to simpler computations: + +- $a' = 1 / t_1$ +- $t_1 = 1 + t_2$ +- $t_2 = \exp(t_3)$ +- $t_3 = - t_4$ +- $t_4 = t_5 + b$ +- $t_5 = w a$ + +Thus the function $a'(a)$ is decomposed to elementary operations like +addition, subtraction, multiplication, reciprocitation, exponentiation, +logarithm etc. And the rmAD engine stores the Jacobian of these +elementary operations. + +Since in neural networks we want to find derivatives of a single loss +function $L(x; \theta)$, we can omit $L$ when writing derivatives and +denote, say $\bar \theta_k := \partial_{\theta_k} L$. + +In implementations of rmAD, one can represent the Jacobian as a +transformation $j: (x \to y) \to (y, \bar y, x) \to \bar x$. $j$ is +called the /Vector Jacobian Product/ (VJP). For example, +$j(\exp)(y, \bar y, x) = y \bar y$ since given $y = \exp(x)$, + +$\partial_x L = \partial_x y \cdot \partial_y L = \partial_x \exp(x) \cdot \partial_y L = y \bar y$ + +as another example, $j(+)(y, \bar y, x_1, x_2) = (\bar y, \bar y)$ since +given $y = x_1 + x_2$, $\bar{x_1} = \bar{x_2} = \bar y$. + +Similarly, + +1. $j(/)(y, \bar y, x_1, x_2) = (\bar y / x_2, - \bar y x_1 / x_2^2)$ +2. $j(\log)(y, \bar y, x) = \bar y / x$ +3. $j((A, \beta) \mapsto A \beta)(y, \bar y, A, \beta) = (\bar y \otimes \beta, A^T \bar y)$. +4. etc... + +In the third one, the function is a matrix $A$ multiplied on the right +by a column vector $\beta$, and $\bar y \otimes \beta$ is the tensor +product which is a fancy way of writing $\bar y \beta^T$. See +[[https://github.com/mattjj/autodidact/blob/master/autograd/numpy/numpy_vjps.py][numpy_vjps.py]] +for the implementation in autodidact. + +So, given a node say $y = y(x_1, x_2, ..., x_n)$, and given the value of +$y$, $x_{1 : n}$ and $\bar y$, rmAD computes the values of +$\bar x_{1 : n}$ by using the Jacobians. + +This is the gist of rmAD. It stores the values of each node in a forward +pass, and computes the derivatives of each node exactly once in a +backward pass. + +It is a nice exercise to derive the backpropagation in the fully +connected feedforward neural networks +(e.g. [[http://neuralnetworksanddeeplearning.com/chap2.html#the_four_fundamental_equations_behind_backpropagation][the +one for MNIST in Neural Networks and Deep Learning]]) using rmAD. + +AD is an approach lying between the extremes of numerical approximation +(e.g. finite difference) and symbolic evaluation. It uses exact formulas +(VJP) at each elementary operation like symbolic evaluation, while +evaluates each VJP numerically rather than lumping all the VJPs into an +unwieldy symbolic formula. + +Things to look further into: the higher-order functional currying form +$j: (x \to y) \to (y, \bar y, x) \to \bar x$ begs for a functional +programming implementation. diff --git a/posts/2018-12-02-lime-shapley.org b/posts/2018-12-02-lime-shapley.org new file mode 100644 index 0000000..05ef4ee --- /dev/null +++ b/posts/2018-12-02-lime-shapley.org @@ -0,0 +1,362 @@ +#+title: Shapley, LIME and SHAP + +#+date: <2018-12-02> + +In this post I explain LIME (Ribeiro et. al. 2016), the Shapley values +(Shapley, 1953) and the SHAP values (Strumbelj-Kononenko, 2014; +Lundberg-Lee, 2017). + +*Acknowledgement*. Thanks to Josef Lindman Hörnlund for bringing the +LIME and SHAP papers to my attention. The research was done while +working at KTH mathematics department. + +/If you are reading on a mobile device, you may need to "request desktop +site" for the equations to be properly displayed. This post is licensed +under CC BY-SA and GNU FDL./ + +** Shapley values + :PROPERTIES: + :CUSTOM_ID: shapley-values + :END: +A coalitional game $(v, N)$ of $n$ players involves + +- The set $N = \{1, 2, ..., n\}$ that represents the players. +- A function $v: 2^N \to \mathbb R$, where $v(S)$ is the worth of + coalition $S \subset N$. + +The Shapley values $\phi_i(v)$ of such a game specify a fair way to +distribute the total worth $v(N)$ to the players. It is defined as (in +the following, for a set $S \subset N$ we use the convention $s = |S|$ +to be the number of elements of set $S$ and the shorthand +$S - i := S \setminus \{i\}$ and $S + i := S \cup \{i\}$) + +$$\phi_i(v) = \sum_{S: i \in S} {(n - s)! (s - 1)! \over n!} (v(S) - v(S - i)).$$ + +It is not hard to see that $\phi_i(v)$ can be viewed as an expectation: + +$$\phi_i(v) = \mathbb E_{S \sim \nu_i} (v(S) - v(S - i))$$ + +where $\nu_i(S) = n^{-1} {n - 1 \choose s - 1}^{-1} 1_{i \in S}$, that +is, first pick the size $s$ uniformly from $\{1, 2, ..., n\}$, then pick +$S$ uniformly from the subsets of $N$ that has size $s$ and contains +$i$. + +The Shapley values satisfy some nice properties which are readily +verified, including: + +- *Efficiency*. $\sum_i \phi_i(v) = v(N) - v(\emptyset)$. +- *Symmetry*. If for some $i, j \in N$, for all $S \subset N$, we have + $v(S + i) = v(S + j)$, then $\phi_i(v) = \phi_j(v)$. +- *Null player*. If for some $i \in N$, for all $S \subset N$, we have + $v(S + i) = v(S)$, then $\phi_i(v) = 0$. +- *Linearity*. $\phi_i$ is linear in games. That is + $\phi_i(v) + \phi_i(w) = \phi_i(v + w)$, where $v + w$ is defined by + $(v + w)(S) := v(S) + w(S)$. + +In the literature, an added assumption $v(\emptyset) = 0$ is often +given, in which case the Efficiency property is defined as +$\sum_i \phi_i(v) = v(N)$. Here I discard this assumption to avoid minor +inconsistencies across different sources. For example, in the LIME +paper, the local model is defined without an intercept, even though the +underlying $v(\emptyset)$ may not be $0$. In the SHAP paper, an +intercept $\phi_0 = v(\emptyset)$ is added which fixes this problem when +making connections to the Shapley values. + +Conversely, according to Strumbelj-Kononenko (2010), it was shown in +Shapley's original paper (Shapley, 1953) that these four properties +together with $v(\emptyset) = 0$ defines the Shapley values. + +** LIME + :PROPERTIES: + :CUSTOM_ID: lime + :END: +LIME (Ribeiro et. al. 2016) is a model that offers a way to explain +feature contributions of supervised learning models locally. + +Let $f: X_1 \times X_2 \times ... \times X_n \to \mathbb R$ be a +function. We can think of $f$ as a model, where $X_j$ is the space of +$j$th feature. For example, in a language model, $X_j$ may correspond to +the count of the $j$th word in the vocabulary, i.e. the bag-of-words +model. + +The output may be something like housing price, or log-probability of +something. + +LIME tries to assign a value to each feature /locally/. By locally, we +mean that given a specific sample $x \in X := \prod_{i = 1}^n X_i$, we +want to fit a model around it. + +More specifically, let $h_x: 2^N \to X$ be a function defined by + +$$(h_x(S))_i = +\begin{cases} +x_i, & \text{if }i \in S; \\ +0, & \text{otherwise.} +\end{cases}$$ + +That is, $h_x(S)$ masks the features that are not in $S$, or in other +words, we are perturbing the sample $x$. Specifically, $h_x(N) = x$. +Alternatively, the $0$ in the "otherwise" case can be replaced by some +kind of default value (see the section titled SHAP in this post). + +For a set $S \subset N$, let us denote $1_S \in \{0, 1\}^n$ to be an +$n$-bit where the $k$th bit is $1$ if and only if $k \in S$. + +Basically, LIME samples $S_1, S_2, ..., S_m \subset N$ to obtain a set +of perturbed samples $x_i = h_x(S_i)$ in the $X$ space, and then fits a +linear model $g$ using $1_{S_i}$ as the input samples and $f(h_x(S_i))$ +as the output samples: + +*Problem*(LIME). Find $w = (w_1, w_2, ..., w_n)$ that minimises + +$$\sum_i (w \cdot 1_{S_i} - f(h_x(S_i)))^2 \pi_x(h_x(S_i))$$ + +where $\pi_x(x')$ is a function that penalises $x'$s that are far away +from $x$. In the LIME paper the Gaussian kernel was used: + +$$\pi_x(x') = \exp\left({- \|x - x'\|^2 \over \sigma^2}\right).$$ + +Then $w_i$ represents the importance of the $i$th feature. + +The LIME model has a more general framework, but the specific model +considered in the paper is the one described above, with a Lasso for +feature selection. + +*Remark*. One difference between our account here and the one in the +LIME paper is: the dimension of the data space may differ from $n$ (see +Section 3.1 of that paper). But in the case of text data, they do use +bag-of-words (our $X$) for an "intermediate" representation. So my +understanding is, in their context, there is an "original" data space +(let's call it $X'$). And there is a one-one correspondence between $X'$ +and $X$ (let's call it $r: X' \to X$), so that given a sample +$x' \in X'$, we can compute the output of $S$ in the local model with +$f(r^{-1}(h_{r(x')}(S)))$. As an example, in the example of $X$ being +the bag of words, $X'$ may be the embedding vector space, so that +$r(x') = A^{-1} x'$, where $A$ is the word embedding matrix. Therefore, +without loss of generality, we assume the input space to be $X$ which is +of dimension $n$. + +** Shapley values and LIME + :PROPERTIES: + :CUSTOM_ID: shapley-values-and-lime + :END: +The connection between the Shapley values and LIME is noted in +Lundberg-Lee (2017), but the underlying connection goes back to 1988 +(Charnes et. al.). + +To see the connection, we need to modify LIME a bit. + +First, we need to make LIME less efficient by considering /all/ the +$2^n$ subsets instead of the $m$ samples $S_1, S_2, ..., S_{m}$. + +Then we need to relax the definition of $\pi_x$. It no longer needs to +penalise samples that are far away from $x$. In fact, we will see later +than the choice of $\pi_x(x')$ that yields the Shapley values is high +when $x'$ is very close or very far away from $x$, and low otherwise. We +further add the restriction that $\pi_x(h_x(S))$ only depends on the +size of $S$, thus we rewrite it as $q(s)$ instead. + +We also denote $v(S) := f(h_x(S))$ and $w(S) = \sum_{i \in S} w_i$. + +Finally, we add the Efficiency property as a constraint: +$\sum_{i = 1}^n w_i = f(x) - f(h_x(\emptyset)) = v(N) - v(\emptyset)$. + +Then the problem becomes a weighted linear regression: + +*Problem*. minimises $\sum_{S \subset N} (w(S) - v(S))^2 q(s)$ over $w$ +subject to $w(N) = v(N) - v(\emptyset)$. + +*Claim* (Charnes et. al. 1988). The solution to this problem is + +$$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s)\right)^{-1} \sum_{S \subset N: i \in S} \left({n - s \over n} q(s) v(S) - {s - 1 \over n} q(s - 1) v(S - i)\right). \qquad (-1)$$ + +Specifically, if we choose + +$$q(s) = c {n - 2 \choose s - 1}^{-1}$$ + +for any constant $c$, then $w_i = \phi_i(v)$ are the Shapley values. + +*Remark*. Don't worry about this specific choice of $q(s)$ when $s = 0$ +or $n$, because $q(0)$ and $q(n)$ do not appear on the right hand side +of (-1). Therefore they can be defined to be of any value. A common +convention of the binomial coefficients is to set ${\ell \choose k} = 0$ +if $k < 0$ or $k > \ell$, in which case $q(0) = q(n) = \infty$. + +In Lundberg-Lee (2017), $c$ is chosen to be $1 / n$, see Theorem 2 +there. + +In Charnes et. al. 1988, the $w_i$s defined in (-1) are called the +generalised Shapley values. + +*Proof*. The Lagrangian is + +$$L(w, \lambda) = \sum_{S \subset N} (v(S) - w(S))^2 q(s) - \lambda(w(N) - v(N) + v(\emptyset)).$$ + +and by making $\partial_{w_i} L(w, \lambda) = 0$ we have + +$${1 \over 2} \lambda = \sum_{S \subset N: i \in S} (w(S) - v(S)) q(s). \qquad (0)$$ + +Summing (0) over $i$ and divide by $n$, we have + +$${1 \over 2} \lambda = {1 \over n} \sum_i \sum_{S: i \in S} (w(S) q(s) - v(S) q(s)). \qquad (1)$$ + +We examine each of the two terms on the right hand side. + +Counting the terms involving $w_i$ and $w_j$ for $j \neq i$, and using +$w(N) = v(N) - v(\emptyset)$ we have + +$$\begin{aligned} +&\sum_{S \subset N: i \in S} w(S) q(s) \\ +&= \sum_{s = 1}^n {n - 1 \choose s - 1} q(s) w_i + \sum_{j \neq i}\sum_{s = 2}^n {n - 2 \choose s - 2} q(s) w_j \\ +&= q(1) w_i + \sum_{s = 2}^n q(s) \left({n - 1 \choose s - 1} w_i + \sum_{j \neq i} {n - 2 \choose s - 2} w_j\right) \\ +&= q(1) w_i + \sum_{s = 2}^n \left({n - 2 \choose s - 1} w_i + {n - 2 \choose s - 2} (v(N) - v(\emptyset))\right) q(s) \\ +&= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) w_i + \sum_{s = 2}^n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset)). \qquad (2) +\end{aligned}$$ + +Summing (2) over $i$, we obtain + +$$\begin{aligned} +&\sum_i \sum_{S: i \in S} w(S) q(s)\\ +&= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) (v(N) - v(\emptyset)) + \sum_{s = 2}^n n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset))\\ +&= \sum_{s = 1}^n s{n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset)). \qquad (3) +\end{aligned}$$ + +For the second term in (1), we have + +$$\sum_i \sum_{S: i \in S} v(S) q(s) = \sum_{S \subset N} s v(S) q(s). \qquad (4)$$ + +Plugging (3)(4) in (1), we have + +$${1 \over 2} \lambda = {1 \over n} \left(\sum_{s = 1}^n s {n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset)) - \sum_{S \subset N} s q(s) v(S) \right). \qquad (5)$$ + +Plugging (5)(2) in (0) and solve for $w_i$, we have + +$$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) \right)^{-1} \left( \sum_{S: i \in S} q(s) v(S) - {1 \over n} \sum_{S \subset N} s q(s) v(S) \right). \qquad (6)$$ + +By splitting all subsets of $N$ into ones that contain $i$ and ones that +do not and pair them up, we have + +$$\sum_{S \subset N} s q(s) v(S) = \sum_{S: i \in S} (s q(s) v(S) + (s - 1) q(s - 1) v(S - i)).$$ + +Plugging this back into (6) we get the desired result. $\square$ + +** SHAP + :PROPERTIES: + :CUSTOM_ID: shap + :END: +The paper that coined the term "SHAP values" (Lundberg-Lee 2017) is not +clear in its definition of the "SHAP values" and its relation to LIME, +so the following is my interpretation of their interpretation model, +which coincide with a model studied in Strumbelj-Kononenko 2014. + +Recall that we want to calculate feature contributions to a model $f$ at +a sample $x$. + +Let $\mu$ be a probability density function over the input space +$X = X_1 \times ... \times X_n$. A natural choice would be the density +that generates the data, or one that approximates such density (e.g. +empirical distribution). + +The feature contribution (SHAP value) is thus defined as the Shapley +value $\phi_i(v)$, where + +$$v(S) = \mathbb E_{z \sim \mu} (f(z) | z_S = x_S). \qquad (7)$$ + +So it is a conditional expectation where $z_i$ is clamped for $i \in S$. +In fact, the definition of feature contributions in this form predates +Lundberg-Lee 2017. For example, it can be found in +Strumbelj-Kononenko 2014. + +One simplification is to assume the $n$ features are independent, thus +$\mu = \mu_1 \times \mu_2 \times ... \times \mu_n$. In this case, (7) +becomes + +$$v(S) = \mathbb E_{z_{N \setminus S} \sim \mu_{N \setminus S}} f(x_S, z_{N \setminus S}) \qquad (8)$$ + +For example, Strumbelj-Kononenko (2010) considers this scenario where +$\mu$ is the uniform distribution over $X$, see Definition 4 there. + +A further simplification is model linearity, which means $f$ is linear. +In this case, (8) becomes + +$$v(S) = f(x_S, \mathbb E_{\mu_{N \setminus S}} z_{N \setminus S}). \qquad (9)$$ + +It is worth noting that to make the modified LIME model considered in +the previous section fall under the linear SHAP framework (9), we need +to make two further specialisations, the first is rather cosmetic: we +need to change the definition of $h_x(S)$ to + +$$(h_x(S))_i = +\begin{cases} +x_i, & \text{if }i \in S; \\ +\mathbb E_{\mu_i} z_i, & \text{otherwise.} +\end{cases}$$ + +But we also need to boldly assume the original $f$ to be linear, which +in my view, defeats the purpose of interpretability, because linear +models are interpretable by themselves. + +One may argue that perhaps we do not need linearity to define $v(S)$ as +in (9). If we do so, however, then (9) loses mathematical meaning. A +bigger question is: how effective is SHAP? An even bigger question: in +general, how do we evaluate models of interpretation? + +** Evaluating SHAP + :PROPERTIES: + :CUSTOM_ID: evaluating-shap + :END: +The quest of the SHAP paper can be decoupled into two independent +components: showing the niceties of Shapley values and choosing the +coalitional game $v$. + +The SHAP paper argues that Shapley values $\phi_i(v)$ are a good +measurement because they are the only values satisfying the some nice +properties including the Efficiency property mentioned at the beginning +of the post, invariance under permutation and monotonicity, see the +paragraph below Theorem 1 there, which refers to Theorem 2 of Young +(1985). + +Indeed, both efficiency (the "additive feature attribution methods" in +the paper) and monotonicity are meaningful when considering $\phi_i(v)$ +as the feature contribution of the $i$th feature. + +The question is thus reduced to the second component: what constitutes a +nice choice of $v$? + +The SHAP paper answers this question with 3 options with increasing +simplification: (7)(8)(9) in the previous section of this post +(corresponding to (9)(11)(12) in the paper). They are intuitive, but it +will be interesting to see more concrete (or even mathematical) +justifications of such choices. + +** References + :PROPERTIES: + :CUSTOM_ID: references + :END: + +- Charnes, A., B. Golany, M. Keane, and J. Rousseau. "Extremal Principle + Solutions of Games in Characteristic Function Form: Core, Chebychev + and Shapley Value Generalizations." In Econometrics of Planning and + Efficiency, edited by Jati K. Sengupta and Gopal K. Kadekodi, 123--33. + Dordrecht: Springer Netherlands, 1988. + [[https://doi.org/10.1007/978-94-009-3677-5_7]]. +- Lundberg, Scott, and Su-In Lee. "A Unified Approach to Interpreting + Model Predictions." ArXiv:1705.07874 [Cs, Stat], May 22, 2017. + [[http://arxiv.org/abs/1705.07874]]. +- Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "'Why Should + I Trust You?': Explaining the Predictions of Any Classifier." + ArXiv:1602.04938 [Cs, Stat], February 16, 2016. + [[http://arxiv.org/abs/1602.04938]]. +- Shapley, L. S. "17. A Value for n-Person Games." In Contributions to + the Theory of Games (AM-28), Volume II, Vol. 2. Princeton: Princeton + University Press, 1953. [[https://doi.org/10.1515/9781400881970-018]]. +- Strumbelj, Erik, and Igor Kononenko. "An Efficient Explanation of + Individual Classifications Using Game Theory." J. Mach. Learn. Res. 11 + (March 2010): 1--18. +- Strumbelj, Erik, and Igor Kononenko. "Explaining Prediction Models and + Individual Predictions with Feature Contributions." Knowledge and + Information Systems 41, no. 3 (December 2014): 647--65. + [[https://doi.org/10.1007/s10115-013-0679-x]]. +- Young, H. P. "Monotonic Solutions of Cooperative Games." International + Journal of Game Theory 14, no. 2 (June 1, 1985): 65--72. + [[https://doi.org/10.1007/BF01769885]]. diff --git a/posts/2019-01-03-discriminant-analysis.org b/posts/2019-01-03-discriminant-analysis.org new file mode 100644 index 0000000..34c16bf --- /dev/null +++ b/posts/2019-01-03-discriminant-analysis.org @@ -0,0 +1,293 @@ +#+title: Discriminant analysis + +#+DATE: <2019-01-03> + +In this post I talk about the theory and implementation of linear and +quadratic discriminant analysis, classical methods in statistical +learning. + +*Acknowledgement*. Various sources were of great help to my +understanding of the subject, including Chapter 4 of +[[https://web.stanford.edu/~hastie/ElemStatLearn/][The Elements of +Statistical Learning]], +[[http://cs229.stanford.edu/notes/cs229-notes2.pdf][Stanford CS229 +Lecture notes]], and +[[https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/discriminant_analysis.py][the +scikit-learn code]]. Research was done while working at KTH mathematics +department. + +/If you are reading on a mobile device, you may need to "request desktop +site" for the equations to be properly displayed. This post is licensed +under CC BY-SA and GNU FDL./ + +** Theory + :PROPERTIES: + :CUSTOM_ID: theory + :END: +Quadratic discriminant analysis (QDA) is a classical classification +algorithm. It assumes that the data is generated by Gaussian +distributions, where each class has its own mean and covariance. + +$$(x | y = i) \sim N(\mu_i, \Sigma_i).$$ + +It also assumes a categorical class prior: + +$$\mathbb P(y = i) = \pi_i$$ + +The log-likelihood is thus + +$$\begin{aligned} +\log \mathbb P(y = i | x) &= \log \mathbb P(x | y = i) \log \mathbb P(y = i) + C\\ +&= - {1 \over 2} \log \det \Sigma_i - {1 \over 2} (x - \mu_i)^T \Sigma_i^{-1} (x - \mu_i) + \log \pi_i + C', \qquad (0) +\end{aligned}$$ + +where $C$ and $C'$ are constants. + +Thus the prediction is done by taking argmax of the above formula. + +In training, let $X$, $y$ be the input data, where $X$ is of shape +$m \times n$, and $y$ of shape $m$. We adopt the convention that each +row of $X$ is a sample $x^{(i)T}$. So there are $m$ samples and $n$ +features. Denote by $m_i = \#\{j: y_j = i\}$ be the number of samples in +class $i$. Let $n_c$ be the number of classes. + +We estimate $\mu_i$ by the sample means, and $\pi_i$ by the frequencies: + +$$\begin{aligned} +\mu_i &:= {1 \over m_i} \sum_{j: y_j = i} x^{(j)}, \\ +\pi_i &:= \mathbb P(y = i) = {m_i \over m}. +\end{aligned}$$ + +Linear discriminant analysis (LDA) is a specialisation of QDA: it +assumes all classes share the same covariance, i.e. $\Sigma_i = \Sigma$ +for all $i$. + +Guassian Naive Bayes is a different specialisation of QDA: it assumes +that all $\Sigma_i$ are diagonal, since all the features are assumed to +be independent. + +*** QDA + :PROPERTIES: + :CUSTOM_ID: qda + :END: +We look at QDA. + +We estimate $\Sigma_i$ by the variance mean: + +$$\begin{aligned} +\Sigma_i &= {1 \over m_i - 1} \sum_{j: y_j = i} \hat x^{(j)} \hat x^{(j)T}. +\end{aligned}$$ + +where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ are the centred $x^{(j)}$. +Plugging this into (0) we are done. + +There are two problems that can break the algorithm. First, if one of +the $m_i$ is $1$, then $\Sigma_i$ is ill-defined. Second, one of +$\Sigma_i$'s might be singular. + +In either case, there is no way around it, and the implementation should +throw an exception. + +This won't be a problem of the LDA, though, unless there is only one +sample for each class. + +*** Vanilla LDA + :PROPERTIES: + :CUSTOM_ID: vanilla-lda + :END: +Now let us look at LDA. + +Since all classes share the same covariance, we estimate $\Sigma$ using +sample variance + +$$\begin{aligned} +\Sigma &= {1 \over m - n_c} \sum_j \hat x^{(j)} \hat x^{(j)T}, +\end{aligned}$$ + +where $\hat x^{(j)} = x^{(j)} - \mu_{y_j}$ and ${1 \over m - n_c}$ comes +from [[https://en.wikipedia.org/wiki/Bessel%27s_correction][Bessel's +Correction]]. + +Let us write down the decision function (0). We can remove the first +term on the right hand side, since all $\Sigma_i$ are the same, and we +only care about argmax of that equation. Thus it becomes + +$$- {1 \over 2} (x - \mu_i)^T \Sigma^{-1} (x - \mu_i) + \log\pi_i. \qquad (1)$$ + +Notice that we just walked around the problem of figuring out +$\log \det \Sigma$ when $\Sigma$ is singular. + +But how about $\Sigma^{-1}$? + +We sidestep this problem by using the pseudoinverse of $\Sigma$ instead. +This can be seen as applying a linear transformation to $X$ to turn its +covariance matrix to identity. And thus the model becomes a sort of a +nearest neighbour classifier. + +*** Nearest neighbour classifier + :PROPERTIES: + :CUSTOM_ID: nearest-neighbour-classifier + :END: +More specifically, we want to transform the first term of (0) to a norm +to get a classifier based on nearest neighbour modulo $\log \pi_i$: + +$$- {1 \over 2} \|A(x - \mu_i)\|^2 + \log\pi_i$$ + +To compute $A$, we denote + +$$X_c = X - M,$$ + +where the $i$th row of $M$ is $\mu_{y_i}^T$, the mean of the class $x_i$ +belongs to, so that $\Sigma = {1 \over m - n_c} X_c^T X_c$. + +Let + +$${1 \over \sqrt{m - n_c}} X_c = U_x \Sigma_x V_x^T$$ + +be the SVD of ${1 \over \sqrt{m - n_c}}X_c$. Let +$D_x = \text{diag} (s_1, ..., s_r)$ be the diagonal matrix with all the +nonzero singular values, and rewrite $V_x$ as an $n \times r$ matrix +consisting of the first $r$ columns of $V_x$. + +Then with an abuse of notation, the pseudoinverse of $\Sigma$ is + +$$\Sigma^{-1} = V_x D_x^{-2} V_x^T.$$ + +So we just need to make $A = D_x^{-1} V_x^T$. When it comes to +prediction, just transform $x$ with $A$, and find the nearest centroid +$A \mu_i$ (again, modulo $\log \pi_i$) and label the input with $i$. + +*** Dimensionality reduction + :PROPERTIES: + :CUSTOM_ID: dimensionality-reduction + :END: +We can further simplify the prediction by dimensionality reduction. +Assume $n_c \le n$. Then the centroid spans an affine space of dimension +$p$ which is at most $n_c - 1$. So what we can do is to project both the +transformed sample $Ax$ and centroids $A\mu_i$ to the linear subspace +parallel to the affine space, and do the nearest neighbour +classification there. + +So we can perform SVD on the matrix $(M - \bar x) V_x D_x^{-1}$ where +$\bar x$, a row vector, is the sample mean of all data i.e. average of +rows of $X$: + +$$(M - \bar x) V_x D_x^{-1} = U_m \Sigma_m V_m^T.$$ + +Again, we let $V_m$ be the $r \times p$ matrix by keeping the first $p$ +columns of $V_m$. + +The projection operator is thus $V_m$. And so the final transformation +is $V_m^T D_x^{-1} V_x^T$. + +There is no reason to stop here, and we can set $p$ even smaller, which +will result in a lossy compression / regularisation equivalent to doing +[[https://en.wikipedia.org/wiki/Principal_component_analysis][principle +component analysis]] on $(M - \bar x) V_x D_x^{-1}$. + +Note that as of 2019-01-04, in the +[[https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py][scikit-learn +implementation of LDA]], the prediction is done without any lossy +compression, even if the parameter =n_components= is set to be smaller +than dimension of the affine space spanned by the centroids. In other +words, the prediction does not change regardless of =n_components=. + +*** Fisher discriminant analysis + :PROPERTIES: + :CUSTOM_ID: fisher-discriminant-analysis + :END: +The Fisher discriminant analysis involves finding an $n$-dimensional +vector $a$ that maximises between-class covariance with respect to +within-class covariance: + +$${a^T M_c^T M_c a \over a^T X_c^T X_c a},$$ + +where $M_c = M - \bar x$ is the centred sample mean matrix. + +As it turns out, this is (almost) equivalent to the derivation above, +modulo a constant. In particular, $a = c V_m^T D_x^{-1} V_x^T$ where +$p = 1$ for arbitrary constant $c$. + +To see this, we can first multiply the denominator with a constant +${1 \over m - n_c}$ so that the matrix in the denominator becomes the +covariance estimate $\Sigma$. + +We decompose $a$: $a = V_x D_x^{-1} b + \tilde V_x \tilde b$, where +$\tilde V_x$ consists of column vectors orthogonal to the column space +of $V_x$. + +We ignore the second term in the decomposition. In other words, we only +consider $a$ in the column space of $V_x$. + +Then the problem is to find an $r$-dimensional vector $b$ to maximise + +$${b^T (M_c V_x D_x^{-1})^T (M_c V_x D_x^{-1}) b \over b^T b}.$$ + +This is the problem of principle component analysis, and so $b$ is first +column of $V_m$. + +Therefore, the solution to Fisher discriminant analysis is +$a = c V_x D_x^{-1} V_m$ with $p = 1$. + +*** Linear model + :PROPERTIES: + :CUSTOM_ID: linear-model + :END: +The model is called linear discriminant analysis because it is a linear +model. To see this, let $B = V_m^T D_x^{-1} V_x^T$ be the matrix of +transformation. Now we are comparing + +$$- {1 \over 2} \| B x - B \mu_k\|^2 + \log \pi_k$$ + +across all $k$s. Expanding the norm and removing the common term +$\|B x\|^2$, we see a linear form: + +$$\mu_k^T B^T B x - {1 \over 2} \|B \mu_k\|^2 + \log\pi_k$$ + +So the prediction for $X_{\text{new}}$ is + +$$\text{argmax}_{\text{axis}=0} \left(K B^T B X_{\text{new}}^T - {1 \over 2} \|K B^T\|_{\text{axis}=1}^2 + \log \pi\right)$$ + +thus the decision boundaries are linear. + +This is how scikit-learn implements LDA, by inheriting from +=LinearClassifierMixin= and redirecting the classification there. + +** Implementation + :PROPERTIES: + :CUSTOM_ID: implementation + :END: +This is where things get interesting. How do I validate my understanding +of the theory? By implementing and testing the algorithm. + +I try to implement it as close as possible to the natural language / +mathematical descriptions of the model, which means clarity over +performance. + +How about testing? Numerical experiments are harder to test than +combinatorial / discrete algorithms in general because the output is +less verifiable by hand. My shortcut solution to this problem is to test +against output from the scikit-learn package. + +It turned out to be harder than expected, as I had to dig into the code +of scikit-learn when the outputs mismatch. Their code is quite +well-written though. + +The result is +[[https://github.com/ycpei/machine-learning/tree/master/discriminant-analysis][here]]. + +*** Fun facts about LDA + :PROPERTIES: + :CUSTOM_ID: fun-facts-about-lda + :END: +One property that can be used to test the LDA implementation is the fact +that the scatter matrix $B(X - \bar x)^T (X - \bar X) B^T$ of the +transformed centred sample is diagonal. + +This can be derived by using another fun fact that the sum of the +in-class scatter matrix and the between-class scatter matrix is the +sample scatter matrix: + +$$X_c^T X_c + M_c^T M_c = (X - \bar x)^T (X - \bar x) = (X_c + M_c)^T (X_c + M_c).$$ + +The verification is not very hard and left as an exercise. diff --git a/posts/2019-02-14-raise-your-elbo.org b/posts/2019-02-14-raise-your-elbo.org new file mode 100644 index 0000000..9e15552 --- /dev/null +++ b/posts/2019-02-14-raise-your-elbo.org @@ -0,0 +1,1150 @@ +#+title: Raise your ELBO + +#+date: <2019-02-14> + +In this post I give an introduction to variational inference, which is +about maximising the evidence lower bound (ELBO). + +I use a top-down approach, starting with the KL divergence and the ELBO, +to lay the mathematical framework of all the models in this post. + +Then I define mixture models and the EM algorithm, with Gaussian mixture +model (GMM), probabilistic latent semantic analysis (pLSA) and the +hidden markov model (HMM) as examples. + +After that I present the fully Bayesian version of EM, also known as +mean field approximation (MFA), and apply it to fully Bayesian mixture +models, with fully Bayesian GMM (also known as variational GMM), latent +Dirichlet allocation (LDA) and Dirichlet process mixture model (DPMM) as +examples. + +Then I explain stochastic variational inference, a modification of EM +and MFA to improve efficiency. + +Finally I talk about autoencoding variational Bayes (AEVB), a +Monte-Carlo + neural network approach to raising the ELBO, exemplified +by the variational autoencoder (VAE). I also show its fully Bayesian +version. + +*Acknowledgement*. The following texts and resources were illuminating +during the writing of this post: the Stanford CS228 notes +([[https://ermongroup.github.io/cs228-notes/inference/variational/][1]],[[https://ermongroup.github.io/cs228-notes/learning/latent/][2]]), +the +[[https://www.cs.tau.ac.il/~rshamir/algmb/presentations/EM-BW-Ron-16%20.pdf][Tel +Aviv Algorithms in Molecular Biology slides]] (clear explanations of the +connection between EM and Baum-Welch), Chapter 10 of +[[https://www.springer.com/us/book/9780387310732][Bishop's book]] +(brilliant introduction to variational GMM), Section 2.5 of +[[http://cs.brown.edu/~sudderth/papers/sudderthPhD.pdf][Sudderth's +thesis]] and [[https://metacademy.org][metacademy]]. Also thanks to +Josef Lindman Hörnlund for discussions. The research was done while +working at KTH mathematics department. + +/If you are reading on a mobile device, you may need to "request desktop +site" for the equations to be properly displayed. This post is licensed +under CC BY-SA and GNU FDL./ + +** KL divergence and ELBO + :PROPERTIES: + :CUSTOM_ID: kl-divergence-and-elbo + :END: +Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL) +divergence is defined as + +$$D(q||p) = E_q \log{q \over p}.$$ + +It achieves minimum $0$ when $p = q$. + +If $p$ can be further written as + +$$p(x) = {w(x) \over Z}, \qquad (0)$$ + +where $Z$ is a normaliser, then + +$$\log Z = D(q||p) + L(w, q), \qquad(1)$$ + +where $L(w, q)$ is called the evidence lower bound (ELBO), defined by + +$$L(w, q) = E_q \log{w \over q}. \qquad (1.25)$$ + +From (1), we see that to minimise the nonnegative term $D(q || p)$, one +can maximise the ELBO. + +To this end, we can simply discard $D(q || p)$ in (1) and obtain: + +$$\log Z \ge L(w, q) \qquad (1.3)$$ + +and keep in mind that the inequality becomes an equality when +$q = {w \over Z}$. + +It is time to define the task of variational inference (VI), also known +as variational Bayes (VB). + +*Definition*. Variational inference is concerned with maximising the +ELBO $L(w, q)$. + +There are mainly two versions of VI, the half Bayesian and the fully +Bayesian cases. Half Bayesian VI, to which expectation-maximisation +algorithms (EM) apply, instantiates (1.3) with + +$$\begin{aligned} +Z &= p(x; \theta)\\ +w &= p(x, z; \theta)\\ +q &= q(z) +\end{aligned}$$ + +and the dummy variable $x$ in Equation (0) is substituted with $z$. + +Fully Bayesian VI, often just called VI, has the following +instantiations: + +$$\begin{aligned} +Z &= p(x) \\ +w &= p(x, z, \theta) \\ +q &= q(z, \theta) +\end{aligned}$$ + +and $x$ in Equation (0) is substituted with $(z, \theta)$. + +In both cases $\theta$ are parameters and $z$ are latent variables. + +*Remark on the naming of things*. The term "variational" comes from the +fact that we perform calculus of variations: maximise some functional +($L(w, q)$) over a set of functions ($q$). Note however, most of the VI +/ VB algorithms do not concern any techniques in calculus of variations, +but only uses Jensen's inequality / the fact the $D(q||p)$ reaches +minimum when $p = q$. Due to this reasoning of the naming, EM is also a +kind of VI, even though in the literature VI often referes to its fully +Bayesian version. + +** EM + :PROPERTIES: + :CUSTOM_ID: em + :END: +To illustrate the EM algorithms, we first define the mixture model. + +*Definition (mixture model)*. Given dataset $x_{1 : m}$, we assume the +data has some underlying latent variable $z_{1 : m}$ that may take a +value from a finite set $\{1, 2, ..., n_z\}$. Let $p(z_{i}; \pi)$ be +categorically distributed according to the probability vector $\pi$. +That is, $p(z_{i} = k; \pi) = \pi_k$. Also assume +$p(x_{i} | z_{i} = k; \eta) = p(x_{i}; \eta_k)$. Find +$\theta = (\pi, \eta)$ that maximises the likelihood +$p(x_{1 : m}; \theta)$. + +Represented as a DAG (a.k.a the plate notations), the model looks like +this: + +[[/assets/resources/mixture-model.png]] + +where the boxes with $m$ mean repitition for $m$ times, since there $m$ +indepdent pairs of $(x, z)$, and the same goes for $\eta$. + +The direct maximisation + +$$\max_\theta \sum_i \log p(x_{i}; \theta) = \max_\theta \sum_i \log \int p(x_{i} | z_i; \theta) p(z_i; \theta) dz_i$$ + +is hard because of the integral in the log. + +We can fit this problem in (1.3) by having $Z = p(x_{1 : m}; \theta)$ +and $w = p(z_{1 : m}, x_{1 : m}; \theta)$. The plan is to update +$\theta$ repeatedly so that $L(p(z, x; \theta_t), q(z))$ is non +decreasing over time $t$. + +Equation (1.3) at time $t$ for the $i$th datapoint is + +$$\log p(x_{i}; \theta_t) \ge L(p(z_i, x_{i}; \theta_t), q(z_i)) \qquad (2)$$ + +Each timestep consists of two steps, the E-step and the M-step. + +At E-step, we set + +$$q(z_{i}) = p(z_{i}|x_{i}; \theta_t), $$ + +to turn the inequality into equality. We denote $r_{ik} = q(z_i = k)$ +and call them responsibilities, so the posterior $q(z_i)$ is categorical +distribution with parameter $r_i = r_{i, 1 : n_z}$. + +At M-step, we maximise $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_{i}))$ +over $\theta$: + +$$\begin{aligned} +\theta_{t + 1} &= \text{argmax}_\theta \sum_i L(p(x_{i}, z_{i}; \theta), p(z_{i} | x_{i}; \theta_t)) \\ +&= \text{argmax}_\theta \sum_i \mathbb E_{p(z_{i} | x_{i}; \theta_t)} \log p(x_{i}, z_{i}; \theta) \qquad (2.3) +\end{aligned}$$ + +So $\sum_i L(p(x_{i}, z_{i}; \theta), q(z_i))$ is non-decreasing at both +the E-step and the M-step. + +We can see from this derivation that EM is half-Bayesian. The E-step is +Bayesian it computes the posterior of the latent variables and the +M-step is frequentist because it performs maximum likelihood estimate of +$\theta$. + +It is clear that the ELBO sum coverges as it is nondecreasing with an +upper bound, but it is not clear whether the sum converges to the +correct value, i.e. $\max_\theta p(x_{1 : m}; \theta)$. In fact it is +said that the EM does get stuck in local maximum sometimes. + +A different way of describing EM, which will be useful in hidden Markov +model is: + +- At E-step, one writes down the formula + $$\sum_i \mathbb E_{p(z_i | x_{i}; \theta_t)} \log p(x_{i}, z_i; \theta). \qquad (2.5)$$ + +- At M-setp, one finds $\theta_{t + 1}$ to be the $\theta$ that + maximises the above formula. + +*** GMM + :PROPERTIES: + :CUSTOM_ID: gmm + :END: +Gaussian mixture model (GMM) is an example of mixture models. + +The space of the data is $\mathbb R^n$. We use the hypothesis that the +data is Gaussian conditioned on the latent variable: + +$$(x_i; \eta_k) \sim N(\mu_k, \Sigma_k),$$ + +so we write $\eta_k = (\mu_k, \Sigma_k)$, + +During E-step, the $q(z_i)$ can be directly computed using Bayes' +theorem: + +$$r_{ik} = q(z_i = k) = \mathbb P(z_i = k | x_{i}; \theta_t) += {g_{\mu_{t, k}, \Sigma_{t, k}} (x_{i}) \pi_{t, k} \over \sum_{j = 1 : n_z} g_{\mu_{t, j}, \Sigma_{t, j}} (x_{i}) \pi_{t, j}},$$ + +where +$g_{\mu, \Sigma} (x) = (2 \pi)^{- n / 2} \det \Sigma^{-1 / 2} \exp(- {1 \over 2} (x - \mu)^T \Sigma^{-1} (x - \mu))$ +is the pdf of the Gaussian distribution $N(\mu, \Sigma)$. + +During M-step, we need to compute + +$$\text{argmax}_{\Sigma, \mu, \pi} \sum_{i = 1 : m} \sum_{k = 1 : n_z} r_{ik} \log (g_{\mu_k, \Sigma_k}(x_{i}) \pi_k).$$ + +This is similar to the quadratic discriminant analysis, and the solution +is + +$$\begin{aligned} +\pi_{k} &= {1 \over m} \sum_{i = 1 : m} r_{ik}, \\ +\mu_{k} &= {\sum_i r_{ik} x_{i} \over \sum_i r_{ik}}, \\ +\Sigma_{k} &= {\sum_i r_{ik} (x_{i} - \mu_{t, k}) (x_{i} - \mu_{t, k})^T \over \sum_i r_{ik}}. +\end{aligned}$$ + +*Remark*. The k-means algorithm is the $\epsilon \to 0$ limit of the GMM +with constraints $\Sigma_k = \epsilon I$. See Section 9.3.2 of Bishop +2006 for derivation. It is also briefly mentioned there that a variant +in this setting where the covariance matrix is not restricted to +$\epsilon I$ is called elliptical k-means algorithm. + +*** SMM + :PROPERTIES: + :CUSTOM_ID: smm + :END: +As a transition to the next models to study, let us consider a simpler +mixture model obtained by making one modification to GMM: change +$(x; \eta_k) \sim N(\mu_k, \Sigma_k)$ to +$\mathbb P(x = w; \eta_k) = \eta_{kw}$ where $\eta$ is a stochastic +matrix and $w$ is an arbitrary element of the space for $x$. So now the +space for both $x$ and $z$ are finite. We call this model the simple +mixture model (SMM). + +As in GMM, at E-step $r_{ik}$ can be explicitly computed using Bayes' +theorem. + +It is not hard to write down the solution to the M-step in this case: + +$$\begin{aligned} +\pi_{k} &= {1 \over m} \sum_i r_{ik}, \qquad (2.7)\\ +\eta_{k, w} &= {\sum_i r_{ik} 1_{x_i = w} \over \sum_i r_{ik}}. \qquad (2.8) +\end{aligned}$$ + +where $1_{x_i = w}$ is the +[[https://en.wikipedia.org/wiki/Indicator_function][indicator +function]], and evaluates to $1$ if $x_i = w$ and $0$ otherwise. + +Two trivial variants of the SMM are the two versions of probabilistic +latent semantic analysis (pLSA), which we call pLSA1 and pLSA2. + +The model pLSA1 is a probabilistic version of latent semantic analysis, +which is basically a simple matrix factorisation model in collaborative +filtering, whereas pLSA2 has a fully Bayesian version called latent +Dirichlet allocation (LDA), not to be confused with the other LDA +(linear discriminant analysis). + +*** pLSA + :PROPERTIES: + :CUSTOM_ID: plsa + :END: +The pLSA model (Hoffman 2000) is a mixture model, where the dataset is +now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$ +are words and $d$ are documents, and a pair $(d, x)$ represent an +ocurrance of word $x$ in document $d$. + +For each datapoint $(d_{i}, x_{i})$, + +$$\begin{aligned} +p(d_i, x_i; \theta) &= \sum_{z_i} p(z_i; \theta) p(d_i | z_i; \theta) p(x_i | z_i; \theta) \qquad (2.91)\\ +&= p(d_i; \theta) \sum_z p(x_i | z_i; \theta) p (z_i | d_i; \theta) \qquad (2.92). +\end{aligned}$$ + +Of the two formulations, (2.91) corresponds to pLSA type 1, and (2.92) +corresponds to type 2. + +**** pLSA1 + :PROPERTIES: + :CUSTOM_ID: plsa1 + :END: +The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted +with $(d_i, x_i)$, which conditioned on $z_i$ are independently +categorically distributed: + +$$p(d_i = u, x_i = w | z_i = k; \theta) = p(d_i ; \xi_k) p(x_i; \eta_k) = \xi_{ku} \eta_{kw}.$$ + +The model can be illustrated in the plate notations: + +[[/assets/resources/plsa1.png]] + +So the solution of the M-step is + +$$\begin{aligned} +\pi_{k} &= {1 \over m} \sum_i r_{ik} \\ +\xi_{k, u} &= {\sum_i r_{ik} 1_{d_{i} = u} \over \sum_i r_{ik}} \\ +\eta_{k, w} &= {\sum_i r_{ik} 1_{x_{i} = w} \over \sum_i r_{ik}}. +\end{aligned}$$ + +*Remark*. pLSA1 is the probabilistic version of LSA, also known as +matrix factorisation. + +Let $n_d$ and $n_x$ be the number of values $d_i$ and $x_i$ can take. + +*Problem* (LSA). Let $R$ be a $n_d \times n_x$ matrix, fix +$s \le \min\{n_d, n_x\}$. Find $n_d \times s$ matrix $D$ and +$n_x \times s$ matrix $X$ that minimises + +$$J(D, X) = \|R - D X^T\|_F.$$ + +where $\|\cdot\|_F$ is the Frobenius norm. + +*Claim*. Let $R = U \Sigma V^T$ be the SVD of $R$, then the solution to +the above problem is $D = U_s \Sigma_s^{{1 \over 2}}$ and +$X = V_s \Sigma_s^{{1 \over 2}}$, where $U_s$ (resp. $V_s$) is the +matrix of the first $s$ columns of $U$ (resp. $V$) and $\Sigma_s$ is the +$s \times s$ submatrix of $\Sigma$. + +One can compare pLSA1 with LSA. Both procedures produce embeddings of +$d$ and $x$: in pLSA we obtain $n_z$ dimensional embeddings +$\xi_{\cdot, u}$ and $\eta_{\cdot, w}$, whereas in LSA we obtain $s$ +dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$. + +**** pLSA2 + :PROPERTIES: + :CUSTOM_ID: plsa2 + :END: +Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite +it as + +$$p(x_i | d_i; \theta) = \sum_{z_i} p(x_i | z_i; \theta) p(z_i | d_i; \theta).$$ + +To simplify notations, we collect all the $x_i$s with the corresponding +$d_i$ equal to 1 (suppose there are $m_1$ of them), and write them as +$(x_{1, j})_{j = 1 : m_1}$. In the same fashion we construct +$x_{2, 1 : m_2}, x_{3, 1 : m_3}, ... x_{n_d, 1 : m_{n_d}}$. Similarly, +we relabel the corresponding $d_i$ and $z_i$ accordingly. + +With almost no loss of generality, we assume all $m_\ell$s are equal and +write them as $m$. + +Now the model becomes + +$$p(x_{\ell, i} | d_{\ell, i} = \ell; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k | d_{\ell, i} = \ell; \theta).$$ + +Since we have regrouped the $x$'s and $z$'s whose indices record the +values of the $d$'s, we can remove the $d$'s from the equation +altogether: + +$$p(x_{\ell, i}; \theta) = \sum_k p(x_{\ell, i} | z_{\ell, i} = k; \theta) p(z_{\ell, i} = k; \theta).$$ + +It is effectively a modification of SMM by making $n_d$ copies of $\pi$. +More specifically the parameters are +$\theta = (\pi_{1 : n_d, 1 : n_z}, \eta_{1 : n_z, 1 : n_x})$, where we +model $(z | d = \ell) \sim \text{Cat}(\pi_{\ell, \cdot})$ and, as in +pLSA1, $(x | z = k) \sim \text{Cat}(\eta_{k, \cdot})$. + +Illustrated in the plate notations, pLSA2 is: + +[[/assets/resources/plsa2.png]] + +The computation is basically adding an index $\ell$ to the computation +of SMM wherever applicable. + +The updates at the E-step is + +$$r_{\ell i k} = p(z_{\ell i} = k | x_{\ell i}; \theta) \propto \pi_{\ell k} \eta_{k, x_{\ell i}}.$$ + +And at the M-step + +$$\begin{aligned} +\pi_{\ell k} &= {1 \over m} \sum_i r_{\ell i k} \\ +\eta_{k w} &= {\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w} \over \sum_{\ell, i} r_{\ell i k}}. +\end{aligned}$$ + +*** HMM + :PROPERTIES: + :CUSTOM_ID: hmm + :END: +The hidden markov model (HMM) is a sequential version of SMM, in the +same sense that recurrent neural networks are sequential versions of +feed-forward neural networks. + +HMM is an example where the posterior $p(z_i | x_i; \theta)$ is not easy +to compute, and one has to utilise properties of the underlying Bayesian +network to go around it. + +Now each sample is a sequence $x_i = (x_{ij})_{j = 1 : T}$, and so are +the latent variables $z_i = (z_{ij})_{j = 1 : T}$. + +The latent variables are assumed to form a Markov chain with transition +matrix $(\xi_{k \ell})_{k \ell}$, and $x_{ij}$ is completely dependent +on $z_{ij}$: + +$$\begin{aligned} +p(z_{ij} | z_{i, j - 1}) &= \xi_{z_{i, j - 1}, z_{ij}},\\ +p(x_{ij} | z_{ij}) &= \eta_{z_{ij}, x_{ij}}. +\end{aligned}$$ + +Also, the distribution of $z_{i1}$ is again categorical with parameter +$\pi$: + +$$p(z_{i1}) = \pi_{z_{i1}}$$ + +So the parameters are $\theta = (\pi, \xi, \eta)$. And HMM can be shown +in plate notations as: + +[[/assets/resources/hmm.png]] + +Now we apply EM to HMM, which is called the +[[https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm][Baum-Welch +algorithm]]. Unlike the previous examples, it is too messy to compute +$p(z_i | x_{i}; \theta)$, so during the E-step we instead write down +formula (2.5) directly in hope of simplifying it: + +$$\begin{aligned} +\mathbb E_{p(z_i | x_i; \theta_t)} \log p(x_i, z_i; \theta_t) &=\mathbb E_{p(z_i | x_i; \theta_t)} \left(\log \pi_{z_{i1}} + \sum_{j = 2 : T} \log \xi_{z_{i, j - 1}, z_{ij}} + \sum_{j = 1 : T} \log \eta_{z_{ij}, x_{ij}}\right). \qquad (3) +\end{aligned}$$ + +Let us compute the summand in second term: + +$$\begin{aligned} +\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \xi_{z_{i, j - 1}, z_{ij}} &= \sum_{k, \ell} (\log \xi_{k, \ell}) \mathbb E_{p(z_{i} | x_{i}; \theta_t)} 1_{z_{i, j - 1} = k, z_{i, j} = \ell} \\ +&= \sum_{k, \ell} p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t) \log \xi_{k, \ell}. \qquad (4) +\end{aligned}$$ + +Similarly, one can write down the first term and the summand in the +third term to obtain + +$$\begin{aligned} +\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \pi_{z_{i1}} &= \sum_k p(z_{i1} = k | x_{i}; \theta_t), \qquad (5) \\ +\mathbb E_{p(z_i | x_{i}; \theta_t)} \log \eta_{z_{i, j}, x_{i, j}} &= \sum_{k, w} 1_{x_{ij} = w} p(z_{i, j} = k | x_i; \theta_t) \log \eta_{k, w}. \qquad (6) +\end{aligned}$$ + +plugging (4)(5)(6) back into (3) and summing over $j$, we obtain the +formula to maximise over $\theta$ on: + +$$\sum_k \sum_i r_{i1k} \log \pi_k + \sum_{k, \ell} \sum_{j = 2 : T, i} s_{ijk\ell} \log \xi_{k, \ell} + \sum_{k, w} \sum_{j = 1 : T, i} r_{ijk} 1_{x_{ij} = w} \log \eta_{k, w},$$ + +where + +$$\begin{aligned} +r_{ijk} &:= p(z_{ij} = k | x_{i}; \theta_t), \\ +s_{ijk\ell} &:= p(z_{i, j - 1} = k, z_{ij} = \ell | x_{i}; \theta_t). +\end{aligned}$$ + +Now we proceed to the M-step. Since each of the +$\pi_k, \xi_{k, \ell}, \eta_{k, w}$ is nicely confined in the inner sum +of each term, together with the constraint +$\sum_k \pi_k = \sum_\ell \xi_{k, \ell} = \sum_w \eta_{k, w} = 1$ it is +not hard to find the argmax at time $t + 1$ (the same way one finds the +MLE for any categorical distribution): + +$$\begin{aligned} +\pi_{k} &= {1 \over m} \sum_i r_{i1k}, \qquad (6.1) \\ +\xi_{k, \ell} &= {\sum_{j = 2 : T, i} s_{ijk\ell} \over \sum_{j = 1 : T - 1, i} r_{ijk}}, \qquad(6.2) \\ +\eta_{k, w} &= {\sum_{ij} 1_{x_{ij} = w} r_{ijk} \over \sum_{ij} r_{ijk}}. \qquad(6.3) +\end{aligned}$$ + +Note that (6.1)(6.3) are almost identical to (2.7)(2.8). This makes +sense as the only modification HMM makes over SMM is the added +dependencies between the latent variables. + +What remains is to compute $r$ and $s$. + +This is done by using the forward and backward procedures which takes +advantage of the conditioned independence / topology of the underlying +Bayesian network. It is out of scope of this post, but for the sake of +completeness I include it here. + +Let + +$$\begin{aligned} +\alpha_k(i, j) &:= p(x_{i, 1 : j}, z_{ij} = k; \theta_t), \\ +\beta_k(i, j) &:= p(x_{i, j + 1 : T} | z_{ij} = k; \theta_t). +\end{aligned}$$ + +They can be computed recursively as + +$$\begin{aligned} +\alpha_k(i, j) &= \begin{cases} +\eta_{k, x_{1j}} \pi_k, & j = 1; \\ +\eta_{k, x_{ij}} \sum_\ell \alpha_\ell(j - 1, i) \xi_{k\ell}, & j \ge 2. +\end{cases}\\ +\beta_k(i, j) &= \begin{cases} +1, & j = T;\\ +\sum_\ell \xi_{k\ell} \beta_\ell(j + 1, i) \eta_{\ell, x_{i, j + 1}}, & j < T. +\end{cases} +\end{aligned}$$ + +Then + +$$\begin{aligned} +p(z_{ij} = k, x_{i}; \theta_t) &= \alpha_k(i, j) \beta_k(i, j), \qquad (7)\\ +p(x_{i}; \theta_t) &= \sum_k \alpha_k(i, j) \beta_k(i, j),\forall j = 1 : T \qquad (8)\\ +p(z_{i, j - 1} = k, z_{i, j} = \ell, x_{i}; \theta_t) &= \alpha_k(i, j) \xi_{k\ell} \beta_\ell(i, j + 1) \eta_{\ell, x_{j + 1, i}}. \qquad (9) +\end{aligned}$$ + +And this yields $r_{ijk}$ and $s_{ijk\ell}$ since they can be computed +as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively. + +** Fully Bayesian EM / MFA + :PROPERTIES: + :CUSTOM_ID: fully-bayesian-em-mfa + :END: +Let us now venture into the realm of full Bayesian. + +In EM we aim to maximise the ELBO + +$$\int q(z) \log {p(x, z; \theta) \over q(z)} dz d\theta$$ + +alternately over $q$ and $\theta$. As mentioned before, the E-step of +maximising over $q$ is Bayesian, in that it computes the posterior of +$z$, whereas the M-step of maximising over $\theta$ is maximum +likelihood and frequentist. + +The fully Bayesian EM makes the M-step Bayesian by making $\theta$ a +random variable, so the ELBO becomes + +$$L(p(x, z, \theta), q(z, \theta)) = \int q(z, \theta) \log {p(x, z, \theta) \over q(z, \theta)} dz d\theta$$ + +We further assume $q$ can be factorised into distributions on $z$ and +$\theta$: $q(z, \theta) = q_1(z) q_2(\theta)$. So the above formula is +rewritten as + +$$L(p(x, z, \theta), q(z, \theta)) = \int q_1(z) q_2(\theta) \log {p(x, z, \theta) \over q_1(z) q_2(\theta)} dz d\theta$$ + +To find argmax over $q_1$, we rewrite + +$$\begin{aligned} +L(p(x, z, \theta), q(z, \theta)) &= \int q_1(z) \left(\int q_2(\theta) \log p(x, z, \theta) d\theta\right) dz - \int q_1(z) \log q_1(z) dz - \int q_2(\theta) \log q_2(\theta) \\&= - D(q_1(z) || p_x(z)) + C, +\end{aligned}$$ + +where $p_x$ is a density in $z$ with + +$$\log p_x(z) = \mathbb E_{q_2(\theta)} \log p(x, z, \theta) + C.$$ + +So the $q_1$ that maximises the ELBO is $q_1^* = p_x$. + +Similarly, the optimal $q_2$ is such that + +$$\log q_2^*(\theta) = \mathbb E_{q_1(z)} \log p(x, z, \theta) + C.$$ + +The fully Bayesian EM thus alternately evaluates $q_1^*$ (E-step) and +$q_2^*$ (M-step). + +It is also called mean field approximation (MFA), and can be easily +generalised to models with more than two groups of latent variables, see +e.g. Section 10.1 of Bishop 2006. + +*** Application to mixture models + :PROPERTIES: + :CUSTOM_ID: application-to-mixture-models + :END: +*Definition (Fully Bayesian mixture model)*. The relations between +$\pi$, $\eta$, $x$, $z$ are the same as in the definition of mixture +models. Furthermore, we assume the distribution of $(x | \eta_k)$ +belongs to the +[[https://en.wikipedia.org/wiki/Exponential_family][exponential family]] +(the definition of the exponential family is briefly touched at the end +of this section). But now both $\pi$ and $\eta$ are random variables. +Let the prior distribution $p(\pi)$ is Dirichlet with parameter +$(\alpha, \alpha, ..., \alpha)$. Let the prior $p(\eta_k)$ be the +conjugate prior of $(x | \eta_k)$, with parameter $\beta$, we will see +later in this section that the posterior $q(\eta_k)$ belongs to the same +family as $p(\eta_k)$. Represented in a plate notations, a fully +Bayesian mixture model looks like: + +[[/assets/resources/fully-bayesian-mm.png]] + +Given this structure we can write down the mean-field approximation: + +$$\log q(z) = \mathbb E_{q(\eta)q(\pi)} (\log(x | z, \eta) + \log(z | \pi)) + C.$$ + +Both sides can be factored into per-sample expressions, giving us + +$$\log q(z_i) = \mathbb E_{q(\eta)} \log p(x_i | z_i, \eta) + \mathbb E_{q(\pi)} \log p(z_i | \pi) + C$$ + +Therefore + +$$\log r_{ik} = \log q(z_i = k) = \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k) + \mathbb E_{q(\pi)} \log \pi_k + C. \qquad (9.1)$$ + +So the posterior of each $z_i$ is categorical regardless of the $p$s and +$q$s. + +Computing the posterior of $\pi$ and $\eta$: + +$$\log q(\pi) + \log q(\eta) = \log p(\pi) + \log p(\eta) + \sum_i \mathbb E_{q(z_i)} p(x_i | z_i, \eta) + \sum_i \mathbb E_{q(z_i)} p(z_i | \pi) + C.$$ + +So we can separate the terms involving $\pi$ and those involving $\eta$. +First compute the posterior of $\pi$: + +$$\log q(\pi) = \log p(\pi) + \sum_i \mathbb E_{q(z_i)} \log p(z_i | \pi) = \log p(\pi) + \sum_i \sum_k r_{ik} \log \pi_k + C.$$ + +The right hand side is the log of a Dirichlet modulus the constant $C$, +from which we can update the posterior parameter $\phi^\pi$: + +$$\phi^\pi_k = \alpha + \sum_i r_{ik}. \qquad (9.3)$$ + +Similarly we can obtain the posterior of $\eta$: + +$$\log q(\eta) = \log p(\eta) + \sum_i \sum_k r_{ik} \log p(x_i | \eta_k) + C.$$ + +Again we can factor the terms with respect to $k$ and get: + +$$\log q(\eta_k) = \log p(\eta_k) + \sum_i r_{ik} \log p(x_i | \eta_k) + C. \qquad (9.5)$$ + +Here we can see why conjugate prior works. Mathematically, given a +probability distribution $p(x | \theta)$, the distribution $p(\theta)$ +is called conjugate prior of $p(x | \theta)$ if +$\log p(\theta) + \log p(x | \theta)$ has the same form as +$\log p(\theta)$. + +For example, the conjugate prior for the exponential family +$p(x | \theta) = h(x) \exp(\theta \cdot T(x) - A(\theta))$ where $T$, +$A$ and $h$ are some functions is +$p(\theta; \chi, \nu) \propto \exp(\chi \cdot \theta - \nu A(\theta))$. + +Here what we want is a bit different from conjugate priors because of +the coefficients $r_{ik}$. But the computation carries over to the +conjugate priors of the exponential family (try it yourself and you'll +see). That is, if $p(x_i | \eta_k)$ belongs to the exponential family + +$$p(x_i | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$ + +and if $p(\eta_k)$ is the conjugate prior of $p(x_i | \eta_k)$ + +$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k))$$ + +then $q(\eta_k)$ has the same form as $p(\eta_k)$, and from (9.5) we can +compute the updates of $\phi^{\eta_k}$: + +$$\begin{aligned} +\phi^{\eta_k}_1 &= \chi + \sum_i r_{ik} T(x_i), \qquad (9.7) \\ +\phi^{\eta_k}_2 &= \nu + \sum_i r_{ik}. \qquad (9.9) +\end{aligned}$$ + +So the mean field approximation for the fully Bayesian mixture model is +the alternate iteration of (9.1) (E-step) and (9.3)(9.7)(9.9) (M-step) +until convergence. + +*** Fully Bayesian GMM + :PROPERTIES: + :CUSTOM_ID: fully-bayesian-gmm + :END: +A typical example of fully Bayesian mixture models is the fully Bayesian +Gaussian mixture model (Attias 2000, also called variational GMM in the +literature). It is defined by applying the same modification to GMM as +the difference between Fully Bayesian mixture model and the vanilla +mixture model. + +More specifically: + +- $p(z_{i}) = \text{Cat}(\pi)$ as in vanilla GMM +- $p(\pi) = \text{Dir}(\alpha, \alpha, ..., \alpha)$ has Dirichlet + distribution, the conjugate prior to the parameters of the categorical + distribution. +- $p(x_i | z_i = k) = p(x_i | \eta_k) = N(\mu_{k}, \Sigma_{k})$ as in + vanilla GMM +- $p(\mu_k, \Sigma_k) = \text{NIW} (\mu_0, \lambda, \Psi, \nu)$ is the + normal-inverse-Wishart distribution, the conjugate prior to the mean + and covariance matrix of the Gaussian distribution. + +The E-step and M-step can be computed using (9.1) and (9.3)(9.7)(9.9) in +the previous section. The details of the computation can be found in +Chapter 10.2 of Bishop 2006 or Attias 2000. + +*** LDA + :PROPERTIES: + :CUSTOM_ID: lda + :END: +As the second example of fully Bayesian mixture models, Latent Dirichlet +allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of +pLSA2, with the following plate notations: + +[[/assets/resources/lda.png]] + +It is the smoothed version in the paper. + +More specifically, on the basis of pLSA2, we add prior distributions to +$\eta$ and $\pi$: + +$$\begin{aligned} +p(\eta_k) &= \text{Dir} (\beta, ..., \beta), \qquad k = 1 : n_z \\ +p(\pi_\ell) &= \text{Dir} (\alpha, ..., \alpha), \qquad \ell = 1 : n_d \\ +\end{aligned}$$ + +And as before, the prior of $z$ is + +$$p(z_{\ell, i}) = \text{Cat} (\pi_\ell), \qquad \ell = 1 : n_d, i = 1 : m$$ + +We also denote posterior distribution + +$$\begin{aligned} +q(\eta_k) &= \text{Dir} (\phi^{\eta_k}), \qquad k = 1 : n_z \\ +q(\pi_\ell) &= \text{Dir} (\phi^{\pi_\ell}), \qquad \ell = 1 : n_d \\ +p(z_{\ell, i}) &= \text{Cat} (r_{\ell, i}), \qquad \ell = 1 : n_d, i = 1 : m +\end{aligned}$$ + +As before, in E-step we update $r$, and in M-step we update $\lambda$ +and $\gamma$. + +But in the LDA paper, one treats optimisation over $r$, $\lambda$ and +$\gamma$ as a E-step, and treats $\alpha$ and $\beta$ as parameters, +which is optmised over at M-step. This makes it more akin to the +classical EM where the E-step is Bayesian and M-step MLE. This is more +complicated, and we do not consider it this way here. + +Plugging in (9.1) we obtain the updates at E-step + +$$r_{\ell i k} \propto \exp(\psi(\phi^{\pi_\ell}_k) + \psi(\phi^{\eta_k}_{x_{\ell i}}) - \psi(\sum_w \phi^{\eta_k}_w)), \qquad (10)$$ + +where $\psi$ is the digamma function. Similarly, plugging in +(9.3)(9.7)(9.9), at M-step, we update the posterior of $\pi$ and $\eta$: + +$$\begin{aligned} +\phi^{\pi_\ell}_k &= \alpha + \sum_i r_{\ell i k}. \qquad (11)\\ +%%}}$ +%%As for $\eta$, we have +%%{{$%align% +%%\log q(\eta) &= \sum_k \log p(\eta_k) + \sum_{\ell, i} \mathbb E_{q(z_{\ell i})} \log p(x_{\ell i} | z_{\ell i}, \eta) \\ +%%&= \sum_{k, j} (\sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = j} + \beta - 1) \log \eta_{k j} +%%}}$ +%%which gives us +%%{{$ +\phi^{\eta_k}_w &= \beta + \sum_{\ell, i} r_{\ell i k} 1_{x_{\ell i} = w}. \qquad (12) +\end{aligned}$$ + +So the algorithm iterates over (10) and (11)(12) until convergence. + +*** DPMM + :PROPERTIES: + :CUSTOM_ID: dpmm + :END: +The Dirichlet process mixture model (DPMM) is like the fully Bayesian +mixture model except $n_z = \infty$, i.e. $z$ can take any positive +integer value. + +The probability of $z_i = k$ is defined using the so called +stick-breaking process: let $v_i \sim \text{Beta} (\alpha, \beta)$ be +i.i.d. random variables with Beta distributions, then + +$$\mathbb P(z_i = k | v_{1:\infty}) = (1 - v_1) (1 - v_2) ... (1 - v_{k - 1}) v_k.$$ + +So $v$ plays a similar role to $\pi$ in the previous models. + +As before, we have that the distribution of $x$ belongs to the +exponential family: + +$$p(x | z = k, \eta) = p(x | \eta_k) = h(x) \exp(\eta_k \cdot T(x) - A(\eta_k))$$ + +so the prior of $\eta_k$ is + +$$p(\eta_k) \propto \exp(\chi \cdot \eta_k - \nu A(\eta_k)).$$ + +Because of the infinities we can't directly apply the formulas in the +general fully Bayesian mixture models. So let us carefully derive the +whole thing again. + +As before, we can write down the ELBO: + +$$L(p(x, z, \theta), q(z, \theta)) = \mathbb E_{q(\theta)} \log {p(\theta) \over q(\theta)} + \mathbb E_{q(\theta) q(z)} \log {p(x, z | \theta) \over q(z)}.$$ + +Both terms are infinite series: + +$$L(p, q) = \sum_{k = 1 : \infty} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_{i = 1 : m} \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}.$$ + +There are several ways to deal with the infinities. One is to fix some +level $T > 0$ and set $v_T = 1$ almost surely (Blei-Jordan 2006). This +effectively turns the model into a finite one, and both terms become +finite sums over $k = 1 : T$. + +Another walkaround (Kurihara-Welling-Vlassis 2007) is also a kind of +truncation, but less heavy-handed: setting the posterior +$q(\theta) = q(\eta) q(v)$ to be: + +$$q(\theta) = q(\theta_{1 : T}) p(\theta_{T + 1 : \infty}) =: q(\theta_{\le T}) p(\theta_{> T}).$$ + +That is, tie the posterior after $T$ to the prior. This effectively +turns the first term in the ELBO to a finite sum over $k = 1 : T$, while +keeping the second sum an infinite series: + +$$L(p, q) = \sum_{k = 1 : T} \mathbb E_{q(\theta_k)} \log {p(\theta_k) \over q(\theta_k)} + \sum_i \sum_{k = 1 : \infty} q(z_i = k) \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}. \qquad (13)$$ + +The plate notation of this model is: + +[[/assets/resources/dpmm.png]] + +As it turns out, the infinities can be tamed in this case. + +As before, the optimal $q(z_i)$ is computed as + +$$r_{ik} = q(z_i = k) = s_{ik} / S_i$$ + +where + +$$\begin{aligned} +s_{ik} &= \exp(\mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta)) \\ +S_i &= \sum_{k = 1 : \infty} s_{ik}. +\end{aligned}$$ + +Plugging this back to (13) we have + +$$\begin{aligned} +\sum_{k = 1 : \infty} r_{ik} &\mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over r_{ik}} \\ +&= \sum_{k = 1 : \infty} r_{ik} \mathbb E_{q(\theta)} (\log p(x_i, z_i = k | \theta) - \mathbb E_{q(\theta)} \log p(x_i, z_i = k | \theta) + \log S_i) = \log S_i. +\end{aligned}$$ + +So it all rests upon $S_i$ being finite. + +For $k \le T + 1$, we compute the quantity $s_{ik}$ directly. For +$k > T$, it is not hard to show that + +$$s_{ik} = s_{i, T + 1} \exp((k - T - 1) \mathbb E_{p(w)} \log (1 - w)),$$ + +where $w$ is a random variable with same distribution as $p(v_k)$, i.e. +$\text{Beta}(\alpha, \beta)$. + +Hence + +$$S_i = \sum_{k = 1 : T} s_{ik} + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}$$ + +is indeed finite. Similarly we also obtain + +$$q(z_i > k) = S^{-1} \left(\sum_{\ell = k + 1 : T} s_\ell + {s_{i, T + 1} \over 1 - \exp(\psi(\beta) - \psi(\alpha + \beta))}\right), k \le T \qquad (14)$$ + +Now let us compute the posterior of $\theta_{\le T}$. In the following +we exchange the integrals without justifying them (c.f. Fubini's +Theorem). Equation (13) can be rewritten as + +$$L(p, q) = \mathbb E_{q(\theta_{\le T})} \left(\log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} - \log q(\theta_{\le T})\right).$$ + +Note that unlike the derivation of the mean-field approximation, we keep +the $- \mathbb E_{q(z)} \log q(z)$ term even though we are only +interested in $\theta$ at this stage. This is again due to the problem +of infinities: as in the computation of $S$, we would like to cancel out +some undesirable unbounded terms using $q(z)$. We now have + +$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_i \mathbb E_{q(z_i) p(\theta_{> T})} \log {p(x_i, z_i | \theta) \over q(z_i)} + C.$$ + +By plugging in $q(z = k)$ we obtain + +$$\log q(\theta_{\le T}) = \log p(\theta_{\le T}) + \sum_{k = 1 : \infty} q(z_i = k) \left(\mathbb E_{p(\theta_{> T})} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)} - \mathbb E_{q(\theta)} \log {p(x_i, z_i = k | \theta) \over q(z_i = k)}\right) + C.$$ + +Again, we separate the $v_k$'s and the $\eta_k$'s to obtain + +$$q(v_{\le T}) = \log p(v_{\le T}) + \sum_i \sum_k q(z_i = k) \left(\mathbb E_{p(v_{> T})} \log p(z_i = k | v) - \mathbb E_{q(v)} \log p (z_i = k | v)\right).$$ + +Denote by $D_k$ the difference between the two expetations on the right +hand side. It is easy to show that + +$$D_k = \begin{cases} +\log(1 - v_1) ... (1 - v_{k - 1}) v_k - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_{k - 1}) v_k & k \le T\\ +\log(1 - v_1) ... (1 - v_T) - \mathbb E_{q(v)} \log (1 - v_1) ... (1 - v_T) & k > T +\end{cases}$$ + +so $D_k$ is bounded. With this we can derive the update for +$\phi^{v, 1}$ and $\phi^{v, 2}$: + +$$\begin{aligned} +\phi^{v, 1}_k &= \alpha + \sum_i q(z_i = k) \\ +\phi^{v, 2}_k &= \beta + \sum_i q(z_i > k), +\end{aligned}$$ + +where $q(z_i > k)$ can be computed as in (14). + +When it comes to $\eta$, we have + +$$\log q(\eta_{\le T}) = \log p(\eta_{\le T}) + \sum_i \sum_{k = 1 : \infty} q(z_i = k) (\mathbb E_{p(\eta_k)} \log p(x_i | \eta_k) - \mathbb E_{q(\eta_k)} \log p(x_i | \eta_k)).$$ + +Since $q(\eta_k) = p(\eta_k)$ for $k > T$, the inner sum on the right +hand side is a finite sum over $k = 1 : T$. By factorising +$q(\eta_{\le T})$ and $p(\eta_{\le T})$, we have + +$$\log q(\eta_k) = \log p(\eta_k) + \sum_i q(z_i = k) \log (x_i | \eta_k) + C,$$ + +which gives us + +$$\begin{aligned} +\phi^{\eta, 1}_k &= \xi + \sum_i q(z_i = k) T(x_i) \\ +\phi^{\eta, 2}_k &= \nu + \sum_i q(z_i = k). +\end{aligned}$$ + +** SVI + :PROPERTIES: + :CUSTOM_ID: svi + :END: +In variational inference, the computation of some parameters are more +expensive than others. + +For example, the computation of M-step is often much more expensive than +that of E-step: + +- In the vanilla mixture models with the EM algorithm, the update of + $\theta$ requires the computation of $r_{ik}$ for all $i = 1 : m$, see + Eq (2.3). +- In the fully Bayesian mixture model with mean field approximation, the + updates of $\phi^\pi$ and $\phi^\eta$ require the computation of a sum + over all samples (see Eq (9.3)(9.7)(9.9)). + +Similarly, in pLSA2 (resp. LDA), the updates of $\eta_k$ (resp. +$\phi^{\eta_k}$) requires a sum over $\ell = 1 : n_d$, whereas the +updates of other parameters do not. + +In these cases, the parameter that requires more computations are called +global and the other ones local. + +Stochastic variational inference (SVI, Hoffman-Blei-Wang-Paisley 2012) +addresses this problem in the same way as stochastic gradient descent +improves efficiency of gradient descent. + +Each time SVI picks a sample, updates the corresponding local +parameters, and computes the update of the global parameters as if all +the $m$ samples are identical to the picked sample. Finally it +incorporates this global parameter value into previous computations of +the global parameters, by means of an exponential moving average. + +As an example, here's SVI applied to LDA: + +1. Set $t = 1$. + +2. Pick $\ell$ uniformly from $\{1, 2, ..., n_d\}$. + +3. Repeat until convergence: + + 1. Compute $(r_{\ell i k})_{i = 1 : m, k = 1 : n_z}$ using (10). + 2. Compute $(\phi^{\pi_\ell}_k)_{k = 1 : n_z}$ using (11). + +4. Compute $(\tilde \phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$ using + the following formula (compare with (12)) + $$\tilde \phi^{\eta_k}_w = \beta + n_d \sum_{i} r_{\ell i k} 1_{x_{\ell i} = w}$$ + +5. Update the exponential moving average + $(\phi^{\eta_k}_w)_{k = 1 : n_z, w = 1 : n_x}$: + $$\phi^{\eta_k}_w = (1 - \rho_t) \phi^{\eta_k}_w + \rho_t \tilde \phi^{\eta_k}_w$$ + +6. Increment $t$ and go back to Step 2. + +In the original paper, $\rho_t$ needs to satisfy some conditions that +guarantees convergence of the global parameters: + +$$\begin{aligned} +\sum_t \rho_t = \infty \\ +\sum_t \rho_t^2 < \infty +\end{aligned}$$ + +and the choice made there is + +$$\rho_t = (t + \tau)^{-\kappa}$$ + +for some $\kappa \in (.5, 1]$ and $\tau \ge 0$. + +** AEVB + :PROPERTIES: + :CUSTOM_ID: aevb + :END: +SVI adds to variational inference stochastic updates similar to +stochastic gradient descent. Why not just use neural networks with +stochastic gradient descent while we are at it? Autoencoding variational +Bayes (AEVB) (Kingma-Welling 2013) is such an algorithm. + +Let's look back to the original problem of maximising the ELBO: + +$$\max_{\theta, q} \sum_{i = 1 : m} L(p(x_i | z_i; \theta) p(z_i; \theta), q(z_i))$$ + +Since for any given $\theta$, the optimal $q(z_i)$ is the posterior +$p(z_i | x_i; \theta)$, the problem reduces to + +$$\max_{\theta} \sum_i L(p(x_i | z_i; \theta) p(z_i; \theta), p(z_i | x_i; \theta))$$ + +Let us assume $p(z_i; \theta) = p(z_i)$ is independent of $\theta$ to +simplify the problem. In the old mixture models, we have +$p(x_i | z_i; \theta) = p(x_i; \eta_{z_i})$, which we can generalise to +$p(x_i; f(\theta, z_i))$ for some function $f$. Using Beyes' theorem we +can also write down $p(z_i | x_i; \theta) = q(z_i; g(\theta, x_i))$ for +some function $g$. So the problem becomes + +$$\max_{\theta} \sum_i L(p(x_i; f(\theta, z_i)) p(z_i), q(z_i; g(\theta, x_i)))$$ + +In some cases $g$ can be hard to write down or compute. AEVB addresses +this problem by replacing $g(\theta, x_i)$ with a neural network +$g_\phi(x_i)$ with input $x_i$ and some separate parameters $\phi$. It +also replaces $f(\theta, z_i)$ with a neural network $f_\theta(z_i)$ +with input $z_i$ and parameters $\theta$. And now the problem becomes + +$$\max_{\theta, \phi} \sum_i L(p(x_i; f_\theta(z_i)) p(z_i), q(z_i; g_\phi(x_i))).$$ + +The objective function can be written as + +$$\sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - D(q(z_i; g_\phi(x_i)) || p(z_i)).$$ + +The first term is called the negative reconstruction error, like the +$- \|decoder(encoder(x)) - x\|$ in autoencoders, which is where the +"autoencoder" in the name comes from. + +The second term is a regularisation term that penalises the posterior +$q(z_i)$ that is very different from the prior $p(z_i)$. We assume this +term can be computed analytically. + +So only the first term requires computing. + +We can approximate the sum over $i$ in a similar fashion as SVI: pick +$j$ uniformly randomly from $\{1 ... m\}$ and treat the whole dataset as +$m$ replicates of $x_j$, and approximate the expectation using +Monte-Carlo: + +$$U(x_i, \theta, \phi) := \sum_i \mathbb E_{q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) \approx m \mathbb E_{q(z_j; g_\phi(x_j))} \log p(x_j; f_\theta(z_j)) \approx {m \over L} \sum_{\ell = 1}^L \log p(x_j; f_\theta(z_{j, \ell})),$$ + +where each $z_{j, \ell}$ is sampled from $q(z_j; g_\phi(x_j))$. + +But then it is not easy to approximate the gradient over $\phi$. One can +use the log trick as in policy gradients, but it has the problem of high +variance. In policy gradients this is overcome by using baseline +subtractions. In the AEVB paper it is tackled with the +reparameterisation trick. + +Assume there exists a transformation $T_\phi$ and a random variable +$\epsilon$ with distribution independent of $\phi$ or $\theta$, such +that $T_\phi(x_i, \epsilon)$ has distribution $q(z_i; g_\phi(x_i))$. In +this case we can rewrite $U(x, \phi, \theta)$ as + +$$\sum_i \mathbb E_{\epsilon \sim p(\epsilon)} \log p(x_i; f_\theta(T_\phi(x_i, \epsilon))),$$ + +This way one can use Monte-Carlo to approximate +$\nabla_\phi U(x, \phi, \theta)$: + +$$\nabla_\phi U(x, \phi, \theta) \approx {m \over L} \sum_{\ell = 1 : L} \nabla_\phi \log p(x_j; f_\theta(T_\phi(x_j, \epsilon_\ell))),$$ + +where each $\epsilon_{\ell}$ is sampled from $p(\epsilon)$. The +approximation of $U(x, \phi, \theta)$ itself can be done similarly. + +*** VAE + :PROPERTIES: + :CUSTOM_ID: vae + :END: +As an example of AEVB, the paper introduces variational autoencoder +(VAE), with the following instantiations: + +- The prior $p(z_i) = N(0, I)$ is standard normal, thus independent of + $\theta$. +- The distribution $p(x_i; \eta)$ is either Gaussian or categorical. +- The distribution $q(z_i; \mu, \Sigma)$ is Gaussian with diagonal + covariance matrix. So + $g_\phi(z_i) = (\mu_\phi(x_i), \text{diag}(\sigma^2_\phi(x_i)_{1 : d}))$. + Thus in the reparameterisation trick $\epsilon \sim N(0, I)$ and + $T_\phi(x_i, \epsilon) = \epsilon \odot \sigma_\phi(x_i) + \mu_\phi(x_i)$, + where $\odot$ is elementwise multiplication. +- The KL divergence can be easily computed analytically as + $- D(q(z_i; g_\phi(x_i)) || p(z_i)) = {d \over 2} + \sum_{j = 1 : d} \log\sigma_\phi(x_i)_j - {1 \over 2} \sum_{j = 1 : d} (\mu_\phi(x_i)_j^2 + \sigma_\phi(x_i)_j^2)$. + +With this, one can use backprop to maximise the ELBO. + +*** Fully Bayesian AEVB + :PROPERTIES: + :CUSTOM_ID: fully-bayesian-aevb + :END: +Let us turn to fully Bayesian version of AEVB. Again, we first recall +the ELBO of the fully Bayesian mixture models: + +$$L(p(x, z, \pi, \eta; \alpha, \beta), q(z, \pi, \eta; r, \phi)) = L(p(x | z, \eta) p(z | \pi) p(\pi; \alpha) p(\eta; \beta), q(z; r) q(\eta; \phi^\eta) q(\pi; \phi^\pi)).$$ + +We write $\theta = (\pi, \eta)$, rewrite $\alpha := (\alpha, \beta)$, +$\phi := r$, and $\gamma := (\phi^\eta, \phi^\pi)$. Furthermore, as in +the half-Bayesian version we assume $p(z | \theta) = p(z)$, i.e. $z$ +does not depend on $\theta$. Similarly we also assume +$p(\theta; \alpha) = p(\theta)$. Now we have + +$$L(p(x, z, \theta; \alpha), q(z, \theta; \phi, \gamma)) = L(p(x | z, \theta) p(z) p(\theta), q(z; \phi) q(\theta; \gamma)).$$ + +And the objective is to maximise it over $\phi$ and $\gamma$. We no +longer maximise over $\theta$, because it is now a random variable, like +$z$. Now let us transform it to a neural network model, as in the +half-Bayesian case: + +$$L\left(\left(\prod_{i = 1 : m} p(x_i; f_\theta(z_i))\right) \left(\prod_{i = 1 : m} p(z_i) \right) p(\theta), \left(\prod_{i = 1 : m} q(z_i; g_\phi(x_i))\right) q(\theta; h_\gamma(x))\right).$$ + +where $f_\theta$, $g_\phi$ and $h_\gamma$ are neural networks. Again, by +separating out KL-divergence terms, the above formula becomes + +$$\sum_i \mathbb E_{q(\theta; h_\gamma(x))q(z_i; g_\phi(x_i))} \log p(x_i; f_\theta(z_i)) - \sum_i D(q(z_i; g_\phi(x_i)) || p(z_i)) - D(q(\theta; h_\gamma(x)) || p(\theta)).$$ + +Again, we assume the latter two terms can be computed analytically. +Using reparameterisation trick, we write + +$$\begin{aligned} +\theta &= R_\gamma(\zeta, x) \\ +z_i &= T_\phi(\epsilon, x_i) +\end{aligned}$$ + +for some transformations $R_\gamma$ and $T_\phi$ and random variables +$\zeta$ and $\epsilon$ so that the output has the desired distributions. + +Then the first term can be written as + +$$\mathbb E_{\zeta, \epsilon} \log p(x_i; f_{R_\gamma(\zeta, x)} (T_\phi(\epsilon, x_i))),$$ + +so that the gradients can be computed accordingly. + +Again, one may use Monte-Carlo to approximate this expetation. + +** References + :PROPERTIES: + :CUSTOM_ID: references + :END: + +- Attias, Hagai. "A variational baysian framework for graphical models." + In Advances in neural information processing systems, pp. + 209-215. 2000. +- Bishop, Christopher M. Neural networks for pattern recognition. + Springer. 2006. +- Blei, David M., and Michael I. Jordan. "Variational Inference for + Dirichlet Process Mixtures." Bayesian Analysis 1, no. 1 (March 2006): + 121--43. [[https://doi.org/10.1214/06-BA104]]. +- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent Dirichlet + Allocation." Journal of Machine Learning Research 3, no. Jan (2003): + 993--1022. +- Hofmann, Thomas. "Latent Semantic Models for Collaborative Filtering." + ACM Transactions on Information Systems 22, no. 1 (January 1, 2004): + 89--115. [[https://doi.org/10.1145/963770.963774]]. +- Hofmann, Thomas. "Learning the similarity of documents: An + information-geometric approach to document retrieval and + categorization." In Advances in neural information processing systems, + pp. 914-920. 2000. +- Hoffman, Matt, David M. Blei, Chong Wang, and John Paisley. + "Stochastic Variational Inference." ArXiv:1206.7051 [Cs, Stat], June + 29, 2012. [[http://arxiv.org/abs/1206.7051]]. +- Kingma, Diederik P., and Max Welling. "Auto-Encoding Variational + Bayes." ArXiv:1312.6114 [Cs, Stat], December 20, 2013. + [[http://arxiv.org/abs/1312.6114]]. +- Kurihara, Kenichi, Max Welling, and Nikos Vlassis. "Accelerated + variational Dirichlet process mixtures." In Advances in neural + information processing systems, pp. 761-768. 2007. +- Sudderth, Erik Blaine. "Graphical models for visual object recognition + and tracking." PhD diss., Massachusetts Institute of Technology, 2006. diff --git a/posts/2019-03-13-a-tail-of-two-densities.org b/posts/2019-03-13-a-tail-of-two-densities.org new file mode 100644 index 0000000..783e0c5 --- /dev/null +++ b/posts/2019-03-13-a-tail-of-two-densities.org @@ -0,0 +1,1304 @@ +#+title: A Tail of Two Densities + +#+date: <2019-03-13> + +This is Part 1 of a two-part post where I give an introduction to the +mathematics of differential privacy. + +Practically speaking, +[[https://en.wikipedia.org/wiki/Differential_privacy][differential +privacy]] is a technique of perturbing database queries so that query +results do not leak too much information while still being relatively +accurate. + +This post however focuses on the mathematical aspects of differential +privacy, which is a study of +[[https://en.wikipedia.org/wiki/Concentration_inequality][tail bounds]] +of the divergence between two probability measures, with the end goal of +applying it to +[[https://en.wikipedia.org/wiki/Stochastic_gradient_descent][stochastic +gradient descent]]. This post should be suitable for anyone familiar +with probability theory. + +I start with the definition of \(\epsilon\)-differential privacy +(corresponding to max divergence), followed by +\((\epsilon, \delta)\)-differential privacy (a.k.a. approximate +differential privacy, corresponding to the \(\delta\)-approximate max +divergence). I show a characterisation of the +\((\epsilon, \delta)\)-differential privacy as conditioned +\(\epsilon\)-differential privacy. Also, as examples, I illustrate the +\(\epsilon\)-dp with Laplace mechanism and, using some common tail bounds, +the approximate dp with the Gaussian mechanism. + +Then I continue to show the effect of combinatorial and sequential +compositions of randomised queries (called mechanisms) on privacy by +stating and proving the composition theorems for differential privacy, +as well as the effect of mixing mechanisms, by presenting the +subsampling theorem (a.k.a. amplification theorem). + +In [[/posts/2019-03-14-great-but-manageable-expectations.html][Part 2]], +I discuss the Rényi differential privacy, corresponding to the Rényi +divergence, a study of the +[[https://en.wikipedia.org/wiki/Moment-generating_function][moment +generating functions]] of the divergence between probability measures to +derive the tail bounds. + +Like in Part 1, I prove a composition theorem and a subsampling theorem. + +I also attempt to reproduce a seemingly better moment bound for the +Gaussian mechanism with subsampling, with one intermediate step which I +am not able to prove. + +After that I explain the Tensorflow implementation of differential +privacy in its +[[https://github.com/tensorflow/privacy/tree/master/privacy][Privacy]] +module, which focuses on the differentially private stochastic gradient +descent algorithm (DP-SGD). + +Finally I use the results from both Part 1 and Part 2 to obtain some +privacy guarantees for composed subsampling queries in general, and for +DP-SGD in particular. I also compare these privacy guarantees. + +*Acknowledgement*. I would like to thank +[[http://stockholm.ai][Stockholm AI]] for introducing me to the subject +of differential privacy. Thanks to Amir Hossein Rahnama for hosting the +discussions at Stockholm AI. Thanks to (in chronological order) Reynaldo +Boulogne, Martin Abedi, Ilya Mironov, Kurt Johansson, Mark Bun, Salil +Vadhan, Jonathan Ullman, Yuanyuan Xu and Yiting Li for communication and +discussions. Also thanks to the +[[https://www.reddit.com/r/MachineLearning/][r/MachineLearning]] +community for comments and suggestions which result in improvement of +readability of this post. The research was done while working at +[[https://www.kth.se/en/sci/institutioner/math][KTH Department of +Mathematics]]. + +/If you are confused by any notations, ask me or try +[[/notations.html][this]]. This post (including both Part 1 and Part2) +is licensed under [[https://creativecommons.org/licenses/by-sa/4.0/][CC +BY-SA]] and [[https://www.gnu.org/licenses/fdl.html][GNU FDL]]./ + +** The gist of differential privacy + :PROPERTIES: + :CUSTOM_ID: the-gist-of-differential-privacy + :END: +If you only have one minute, here is what differential privacy is about: + +Let \(p\) and \(q\) be two probability densities, we define the /divergence +variable/[fn:1] of \((p, q)\) to be + +\[L(p || q) := \log {p(\xi) \over q(\xi)}\] + +where \(\xi\) is a random variable distributed according to \(p\). + +Roughly speaking, differential privacy is the study of the tail bound of +\(L(p || q)\): for certain \(p\)s and \(q\)s, and for \(\epsilon > 0\), find +\(\delta(\epsilon)\) such that + +\[\mathbb P(L(p || q) > \epsilon) < \delta(\epsilon),\] + +where \(p\) and \(q\) are the laws of the outputs of a randomised functions +on two very similar inputs. Moreover, to make matters even simpler, only +three situations need to be considered: + +1. (General case) \(q\) is in the form of \(q(y) = p(y + \Delta)\) for some + bounded constant \(\Delta\). +2. (Compositions) \(p\) and \(q\) are combinatorial or sequential + compositions of some simpler \(p_i\)'s and \(q_i\)'s respectively +3. (Subsampling) \(p\) and \(q\) are mixtures / averages of some simpler + \(p_i\)'s and \(q_i\)'s respectively + +In applications, the inputs are databases and the randomised functions +are queries with an added noise, and the tail bounds give privacy +guarantees. When it comes to gradient descent, the input is the training +dataset, and the query updates the parameters, and privacy is achieved +by adding noise to the gradients. + +Now if you have an hour... + +** \(\epsilon\)-dp + :PROPERTIES: + :CUSTOM_ID: epsilon-dp + :END: +*Definition (Mechanisms)*. Let \(X\) be a space with a metric +\(d: X \times X \to \mathbb N\). A /mechanism/ \(M\) is a function that +takes \(x \in X\) as input and outputs a random variable on \(Y\). + +In this post, \(X = Z^m\) is the space of datasets of \(m\) rows for some +integer \(m\), where each item resides in some space \(Z\). In this case the +distance \(d(x, x') := \#\{i: x_i \neq x'_i\}\) is the number of rows that +differ between \(x\) and \(x'\). + +Normally we have a query \(f: X \to Y\), and construct the mechanism \(M\) +from \(f\) by adding a noise: + +\[M(x) := f(x) + \text{noise}.\] + +Later, we will also consider mechanisms constructed from composition or +mixture of other mechanisms. + +In this post \(Y = \mathbb R^d\) for some \(d\). + +*Definition (Sensitivity)*. Let \(f: X \to \mathbb R^d\) be a function. +The /sensitivity/ \(S_f\) of \(f\) is defined as + +\[S_f := \sup_{x, x' \in X: d(x, x') = 1} \|f(x) - f(x')\|_2,\] + +where \(\|y\|_2 = \sqrt{y_1^2 + ... + y_d^2}\) is the \(\ell^2\)-norm. + +*Definition (Differential Privacy)*. A mechanism \(M\) is called +\(\epsilon\)/-differential privacy/ (\(\epsilon\)-dp) if it satisfies the +following condition: for all \(x, x' \in X\) with \(d(x, x') = 1\), and for +all measureable set \(S \subset \mathbb R^n\), + +\[\mathbb P(M(x) \in S) \le e^\epsilon P(M(x') \in S). \qquad (1)\] + +Practically speaking, this means given the results from perturbed query +on two known databases that differs by one row, it is hard to determine +which result is from which database. + +An example of \(\epsilon\)-dp mechanism is the Laplace mechanism. + +*Definition*. The /Laplace distribution/ over \(\mathbb R\) with parameter +\(b > 0\) has probability density function + +\[f_{\text{Lap}(b)}(x) = {1 \over 2 b} e^{- {|x| \over b}}.\] + +*Definition*. Let \(d = 1\). The /Laplace mechanism/ is defined by + +\[M(x) = f(x) + \text{Lap}(b).\] + +*Claim*. The Laplace mechanism with + +\[b \ge \epsilon^{-1} S_f \qquad (1.5)\] + +is \(\epsilon\)-dp. + +*Proof*. Quite straightforward. Let \(p\) and \(q\) be the laws of \(M(x)\) +and \(M(x')\) respectively. + +\[{p (y) \over q (y)} = {f_{\text{Lap}(b)} (y - f(x)) \over f_{\text{Lap}(b)} (y - f(x'))} = \exp(b^{-1} (|y - f(x')| - |y - f(x)|))\] + +Using triangular inequality \(|A| - |B| \le |A - B|\) on the right hand +side, we have + +\[{p (y) \over q (y)} \le \exp(b^{-1} (|f(x) - f(x')|)) \le \exp(\epsilon)\] + +where in the last step we use the condition (1.5). \(\square\) + +** Approximate differential privacy + :PROPERTIES: + :CUSTOM_ID: approximate-differential-privacy + :END: +Unfortunately, \(\epsilon\)-dp does not apply to the most commonly used +noise, the Gaussian noise. To fix this, we need to relax the definition +a bit. + +*Definition*. A mechanism \(M\) is said to be +\((\epsilon, \delta)\)/-differentially private/ if for all \(x, x' \in X\) +with \(d(x, x') = 1\) and for all measureable \(S \subset \mathbb R^d\) + +\[\mathbb P(M(x) \in S) \le e^\epsilon P(M(x') \in S) + \delta. \qquad (2)\] + +Immediately we see that the \((\epsilon, \delta)\)-dp is meaningful only +if \(\delta < 1\). + +*** Indistinguishability + :PROPERTIES: + :CUSTOM_ID: indistinguishability + :END: +To understand \((\epsilon, \delta)\)-dp, it is helpful to study +\((\epsilon, \delta)\)-indistinguishability. + +*Definition*. Two probability measures \(p\) and \(q\) on the same space are +called \((\epsilon, \delta)\)/-ind(istinguishable)/ if for all measureable +sets \(S\): + +$$\begin{aligned} +p(S) \le e^\epsilon q(S) + \delta, \qquad (3) \\ +q(S) \le e^\epsilon p(S) + \delta. \qquad (4) +\end{aligned}$$ + +As before, we also call random variables \(\xi\) and \(\eta\) to be +\((\epsilon, \delta)\)-ind if their laws are \((\epsilon, \delta)\)-ind. +When \(\delta = 0\), we call it \(\epsilon\)-ind. + +Immediately we have + +*Claim 0*. \(M\) is \((\epsilon, \delta)\)-dp (resp. \(\epsilon\)-dp) iff +\(M(x)\) and \(M(x')\) are \((\epsilon, \delta)\)-ind (resp. \(\epsilon\)-ind) +for all \(x\) and \(x'\) with distance \(1\). + +*Definition (Divergence Variable)*. Let \(p\) and \(q\) be two probability +measures. Let \(\xi\) be a random variable distributed according to \(p\), +we define a random variable \(L(p || q)\) by + +\[L(p || q) := \log {p(\xi) \over q(\xi)},\] + +and call it the /divergence variable/ of \((p, q)\). + +One interesting and readily verifiable fact is + +\[\mathbb E L(p || q) = D(p || q)\] + +where \(D\) is the +[[https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence][KL-divergence]]. + +*Claim 1*. If + +$$\begin{aligned} +\mathbb P(L(p || q) \le \epsilon) &\ge 1 - \delta, \qquad(5) \\ +\mathbb P(L(q || p) \le \epsilon) &\ge 1 - \delta +\end{aligned}$$ + +then \(p\) and \(q\) are \((\epsilon, \delta)\)-ind. + +*Proof*. We verify (3), and (4) can be shown in the same way. Let +\(A := \{y \in Y: \log {p(y) \over q(y)} > \epsilon\}\), then by (5) we +have + +\[p(A) < \delta.\] + +So + +\[p(S) = p(S \cap A) + p(S \setminus A) \le \delta + e^\epsilon q(S \setminus A) \le \delta + e^\epsilon q(S).\] + +\(\square\) + +This Claim translates differential privacy to the tail bound of +divergence variables, and for the rest of this post all dp results are +obtained by estimating this tail bound. + +In the following we discuss the converse of Claim 1. The discussions are +rather technical, and readers can skip to the +[[#back-to-approximate-differential-privacy][next subsection]] on first +reading. + +The converse of Claim 1 is not true. + +*Claim 2*. There exists \(\epsilon, \delta > 0\), and \(p\) and \(q\) that are +\((\epsilon, \delta)\)-ind, such that + +$$\begin{aligned} +\mathbb P(L(p || q) \le \epsilon) &< 1 - \delta, \\ +\mathbb P(L(q || p) \le \epsilon) &< 1 - \delta +\end{aligned}$$ + +*Proof*. Here's a example. Let \(Y = \{0, 1\}\), and \(p(0) = q(1) = 2 / 5\) +and \(p(1) = q(0) = 3 / 5\). Then it is not hard to verify that \(p\) and +\(q\) are \((\log {4 \over 3}, {1 \over 3})\)-ind: just check (3) for all +four possible \(S \subset Y\) and (4) holds by symmetry. On the other +hand, + +\[\mathbb P(L(p || q) \le \log {4 \over 3}) = \mathbb P(L(q || p) \le \log {4 \over 3}) = {2 \over 5} < {2 \over 3}.\] + +\(\square\) + +A weaker version of the converse of Claim 1 is true +(Kasiviswanathan-Smith 2015), though: + +*Claim 3*. Let \(\alpha > 1\). If \(p\) and \(q\) are +\((\epsilon, \delta)\)-ind, then + +\[\mathbb P(L(p || q) > \alpha \epsilon) < {1 \over 1 - \exp((1 - \alpha) \epsilon)} \delta.\] + +*Proof*. Define + +\[S = \{y: p(y) > e^{\alpha \epsilon} q(y)\}.\] + +Then we have + +\[e^{\alpha \epsilon} q(S) < p(S) \le e^\epsilon q(S) + \delta,\] + +where the first inequality is due to the definition of \(S\), and the +second due to the \((\epsilon, \delta)\)-ind. Therefore + +\[q(S) \le {\delta \over e^{\alpha \epsilon} - e^\epsilon}.\] + +Using the \((\epsilon, \delta)\)-ind again we have + +\[p(S) \le e^\epsilon q(S) + \delta = {1 \over 1 - e^{(1 - \alpha) \epsilon}} \delta.\] + +\(\square\) + +This can be quite bad if \(\epsilon\) is small. + +To prove the composition theorems in the next section, we need a +condition better than that in Claim 1 so that we can go back and forth +between indistinguishability and such condition. In other words, we need +a /characterisation/ of indistinguishability. + +Let us take a careful look at the condition in Claim 1 and call it *C1*: + +*C1*. \(\mathbb P(L(p || q) \le \epsilon) \ge 1 - \delta\) and +\(\mathbb P(L(q || p) \le \epsilon) \ge 1 - \delta\) + +It is equivalent to + +*C2*. there exist events \(A, B \subset Y\) with probabilities \(p(A)\) and +\(q(B)\) at least \(1 - \delta\) such that +\(\log p(y) - \log q(y) \le \epsilon\) for all \(y \in A\) and +\(\log q(y) - \log p(y) \le \epsilon\) for all \(y \in B\). + +A similar-looking condition to *C2* is the following: + +*C3*. Let \(\Omega\) be the +[[https://en.wikipedia.org/wiki/Probability_space#Definition][underlying +probability space]]. There exist two events \(E, F \subset \Omega\) with +\(\mathbb P(E), \mathbb P(F) \ge 1 - \delta\), such that +\(|\log p_{|E}(y) - \log q_{|F}(y)| \le \epsilon\) for all \(y \in Y\). + +Here \(p_{|E}\) (resp. \(q_{|F}\)) is \(p\) (resp. \(q\)) conditioned on event +\(E\) (resp. \(F\)). + +*Remark*. Note that the events in *C2* and *C3* are in different spaces, +and therefore we can not write \(p_{|E}(S)\) as \(p(S | E)\) or \(q_{|F}(S)\) +as \(q(S | F)\). In fact, if we let \(E\) and \(F\) in *C3* be subsets of \(Y\) +with \(p(E), q(F) \ge 1 - \delta\) and assume \(p\) and \(q\) have the same +supports, then *C3* degenerates to a stronger condition than *C2*. +Indeed, in this case \(p_E(y) = p(y) 1_{y \in E}\) and +\(q_F(y) = q(y) 1_{y \in F}\), and so \(p_E(y) \le e^\epsilon q_F(y)\) +forces \(E \subset F\). We also obtain \(F \subset E\) in the same way. This +gives us \(E = F\), and *C3* becomes *C2* with \(A = B = E = F\). + +As it turns out, *C3* is the condition we need. + +*Claim 4*. Two probability measures \(p\) and \(q\) are +\((\epsilon, \delta)\)-ind if and only if *C3* holds. + +*Proof*(Murtagh-Vadhan 2018). The "if" direction is proved in the same +way as Claim 1. Without loss of generality we may assume +\(\mathbb P(E) = \mathbb P(F) \ge 1 - \delta\). To see this, suppose \(F\) +has higher probability than \(E\), then we can substitute \(F\) with a +subset of \(F\) that has the same probability as \(E\) (with possible +enlargement of the probability space). + +Let \(\xi \sim p\) and \(\eta \sim q\) be two independent random variables, +then + +$$\begin{aligned} +p(S) &= \mathbb P(\xi \in S | E) \mathbb P(E) + \mathbb P(\xi \in S; E^c) \\ +&\le e^\epsilon \mathbb P(\eta \in S | F) \mathbb P(E) + \delta \\ +&= e^\epsilon \mathbb P(\eta \in S | F) \mathbb P(F) + \delta\\ +&\le e^\epsilon q(S) + \delta. +\end{aligned}$$ + +The "only-if" direction is more involved. + +We construct events \(E\) and \(F\) by constructing functions +\(e, f: Y \to [0, \infty)\) satisfying the following conditions: + +1. \(0 \le e(y) \le p(y)\) and \(0 \le f(y) \le q(y)\) for all \(y \in Y\). +2. \(|\log e(y) - \log f(y)| \le \epsilon\) for all \(y \in Y\). +3. \(e(Y), f(Y) \ge 1 - \delta\). +4. \(e(Y) = f(Y)\). + +Here for a set \(S \subset Y\), \(e(S) := \int_S e(y) dy\), and the same +goes for \(f(S)\). + +Let \(\xi \sim p\) and \(\eta \sim q\). Then we define \(E\) and \(F\) by + +$$\mathbb P(E | \xi = y) = e(y) / p(y) \\ +\mathbb P(F | \eta = y) = f(y) / q(y).$$ + +*Remark inside proof*. This can seem a bit confusing. Intuitively, we +can think of it this way when \(Y\) is finite: Recall a random variable on +\(Y\) is a function from the probability space \(\Omega\) to \(Y\). Let event +\(G_y \subset \Omega\) be defined as \(G_y = \xi^{-1} (y)\). We cut \(G_y\) +into the disjoint union of \(E_y\) and \(G_y \setminus E_y\) such that +\(\mathbb P(E_y) = e(y)\). Then \(E = \bigcup_{y \in Y} E_y\). So \(e(y)\) can +be seen as the "density" of \(E\). + +Indeed, given \(E\) and \(F\) defined this way, we have + +\[p_E(y) = {e(y) \over e(Y)} \le {\exp(\epsilon) f(y) \over e(Y)} = {\exp(\epsilon) f(y) \over f(Y)} = \exp(\epsilon) q_F(y).\] + +and + +\[\mathbb P(E) = \int \mathbb P(E | \xi = y) p(y) dy = e(Y) \ge 1 - \delta,\] + +and the same goes for \(\mathbb P(F)\). + +What remains is to construct \(e(y)\) and \(f(y)\) satisfying the four +conditions. + +Like in the proof of Claim 1, let \(S, T \subset Y\) be defined as + +$$\begin{aligned} +S := \{y: p(y) > \exp(\epsilon) q(y)\},\\ +T := \{y: q(y) > \exp(\epsilon) p(y)\}. +\end{aligned}$$ + +Let + +$$\begin{aligned} +e(y) &:= \exp(\epsilon) q(y) 1_{y \in S} + p(y) 1_{y \notin S}\\ +f(y) &:= \exp(\epsilon) p(y) 1_{y \in T} + q(y) 1_{y \notin T}. \qquad (6) +\end{aligned}$$ + +By checking them on the three disjoint subsets \(S\), \(T\), \((S \cup T)^c\), +it is not hard to verify that the \(e(y)\) and \(f(y)\) constructed this way +satisfy the first two conditions. They also satisfy the third condition: + +$$\begin{aligned} +e(Y) &= 1 - (p(S) - \exp(\epsilon) q(S)) \ge 1 - \delta, \\ +f(Y) &= 1 - (q(T) - \exp(\epsilon) p(T)) \ge 1 - \delta. +\end{aligned}$$ + +If \(e(Y) = f(Y)\) then we are done. Otherwise, without loss of +generality, assume \(e(Y) < f(Y)\), then all it remains to do is to reduce +the value of \(f(y)\) while preserving Condition 1, 2 and 3, until +\(f(Y) = e(Y)\). + +As it turns out, this can be achieved by reducing \(f(y)\) on the set +\(\{y \in Y: q(y) > p(y)\}\). To see this, let us rename the \(f(y)\) +defined in (6) \(f_+(y)\), and construct \(f_-(y)\) by + +\[f_-(y) := p(y) 1_{y \in T} + (q(y) \wedge p(y)) 1_{y \notin T}.\] + +It is not hard to show that not only \(e(y)\) and \(f_-(y)\) also satisfy +conditions 1-3, but + +\[e(y) \ge f_-(y), \forall y \in Y,\] + +and thus \(e(Y) \ge f_-(Y)\). Therefore there exists an \(f\) that +interpolates between \(f_-\) and \(f_+\) with \(f(Y) = e(Y)\). \(\square\) + +To prove the adaptive composition theorem for approximate differential +privacy, we need a similar claim (We use index shorthand +\(\xi_{< i} = \xi_{1 : i - 1}\) and similarly for other notations): + +*Claim 5*. Let \(\xi_{1 : i}\) and \(\eta_{1 : i}\) be random variables. Let + +$$\begin{aligned} +p_i(S | y_{1 : i - 1}) := \mathbb P(\xi_i \in S | \xi_{1 : i - 1} = y_{1 : i - 1})\\ +q_i(S | y_{1 : i - 1}) := \mathbb P(\eta_i \in S | \eta_{1 : i - 1} = y_{1 : i - 1}) +\end{aligned}$$ + +be the conditional laws of \(\xi_i | \xi_{< i}\) and \(\eta_i | \eta_{< i}\) +respectively. Then the following are equivalent: + +1. For any \(y_{< i} \in Y^{i - 1}\), \(p_i(\cdot | y_{< i})\) and + \(q_i(\cdot | y_{< i})\) are \((\epsilon, \delta)\)-ind + +2. There exists events \(E_i, F_i \subset \Omega\) with + \(\mathbb P(E_i | \xi_{ Item 1: as in the Proof of Claim 4, + +$$\begin{aligned} +p_i(S | y_{< i}) &= p_{i | E_i} (S | y_{< i}) \mathbb P(E_i | \xi_{< i} = y_{< i}) + p_{i | E_i^c}(S | y_{< i}) \mathbb P(E_i^c | \xi_{< i} = y_{< i}) \\ +&\le p_{i | E_i} (S | y_{< i}) \mathbb P(E_i | \xi_{< i} = y_{< i}) + \delta \\ +&= p_{i | E_i} (S | y_{< i}) \mathbb P(F_i | \xi_{< i} = y_{< i}) + \delta \\ +&\le e^\epsilon q_{i | F_i} (S | y_{< i}) \mathbb P(F_i | \xi_{< i} = y_{< i}) + \delta \\ +&= e^\epsilon q_i (S | y_{< i}) + \delta. +\end{aligned}$$ + +The direction from +\(q_i(S | y_{< i}) \le e^\epsilon p_i(S | y_{< i}) + \delta\) can be shown +in the same way. + +Item 1 => Item 2: as in the Proof of Claim 4 we construct \(e(y_{1 : i})\) +and \(f(y_{1 : i})\) as "densities" of events \(E_i\) and \(F_i\). + +Let + +$$\begin{aligned} +e(y_{1 : i}) &:= e^\epsilon q_i(y_i | y_{< i}) 1_{y_i \in S_i(y_{< i})} + p_i(y_i | y_{< i}) 1_{y_i \notin S_i(y_{< i})}\\ +f(y_{1 : i}) &:= e^\epsilon p_i(y_i | y_{< i}) 1_{y_i \in T_i(y_{< i})} + q_i(y_i | y_{< i}) 1_{y_i \notin T_i(y_{< i})}\\ +\end{aligned}$$ + +where + +$$\begin{aligned} +S_i(y_{< i}) = \{y_i \in Y: p_i(y_i | y_{< i}) > e^\epsilon q_i(y_i | y_{< i})\}\\ +T_i(y_{< i}) = \{y_i \in Y: q_i(y_i | y_{< i}) > e^\epsilon p_i(y_i | y_{< i})\}. +\end{aligned}$$ + +Then \(E_i\) and \(F_i\) are defined as + +$$\begin{aligned} +\mathbb P(E_i | \xi_{\le i} = y_{\le i}) &= {e(y_{\le i}) \over p_i(y_{\le i})},\\ +\mathbb P(F_i | \xi_{\le i} = y_{\le i}) &= {f(y_{\le i}) \over q_i(y_{\le i})}. +\end{aligned}$$ + +The rest of the proof is almost the same as the proof of Claim 4. +\(\square\) + +*** Back to approximate differential privacy + :PROPERTIES: + :CUSTOM_ID: back-to-approximate-differential-privacy + :END: +By Claim 0 and 1 we have + +*Claim 6*. If for all \(x, x' \in X\) with distance \(1\) + +\[\mathbb P(L(M(x) || M(x')) \le \epsilon) \ge 1 - \delta,\] + +then \(M\) is \((\epsilon, \delta)\)-dp. + +Note that in the literature the divergence variable \(L(M(x) || M(x'))\) +is also called the /privacy loss/. + +By Claim 0 and Claim 4 we have + +*Claim 7*. \(M\) is \((\epsilon, \delta)\)-dp if and only if for every +\(x, x' \in X\) with distance \(1\), there exist events +\(E, F \subset \Omega\) with \(\mathbb P(E) = \mathbb P(F) \ge 1 - \delta\), +\(M(x) | E\) and \(M(x') | F\) are \(\epsilon\)-ind. + +We can further simplify the privacy loss \(L(M(x) || M(x'))\), by +observing the translational and scaling invariance of \(L(\cdot||\cdot)\): + +$$\begin{aligned} +L(\xi || \eta) &\overset{d}{=} L(\alpha \xi + \beta || \alpha \eta + \beta), \qquad \alpha \neq 0. \qquad (6.1) +\end{aligned}$$ + +With this and the definition + +\[M(x) = f(x) + \zeta\] + +for some random variable \(\zeta\), we have + +\[L(M(x) || M(x')) \overset{d}{=} L(\zeta || \zeta + f(x') - f(x)).\] + +Without loss of generality, we can consider \(f\) with sensitivity \(1\), +for + +\[L(f(x) + S_f \zeta || f(x') + S_f \zeta) \overset{d}{=} L(S_f^{-1} f(x) + \zeta || S_f^{-1} f(x') + \zeta)\] + +so for any noise \(\zeta\) that achieves \((\epsilon, \delta)\)-dp for a +function with sensitivity \(1\), we have the same privacy guarantee by for +an arbitrary function with sensitivity \(S_f\) by adding a noise +\(S_f \zeta\). + +With Claim 6 we can show that the Gaussian mechanism is approximately +differentially private. But first we need to define it. + +*Definition (Gaussian mechanism)*. Given a query \(f: X \to Y\), the +/Gaussian mechanism/ \(M\) adds a Gaussian noise to the query: + +\[M(x) = f(x) + N(0, \sigma^2 I).\] + +Some tail bounds for the Gaussian distribution will be useful. + +*Claim 8 (Gaussian tail bounds)*. Let \(\xi \sim N(0, 1)\) be a standard +normal distribution. Then for \(t > 0\) + +\[\mathbb P(\xi > t) < {1 \over \sqrt{2 \pi} t} e^{- {t^2 \over 2}}, \qquad (6.3)\] + +and + +\[\mathbb P(\xi > t) < e^{- {t^2 \over 2}}. \qquad (6.5)\] + +*Proof*. Both bounds are well known. The first can be proved using + +\[\int_t^\infty e^{- {y^2 \over 2}} dy < \int_t^\infty {y \over t} e^{- {y^2 \over 2}} dy.\] + +The second is shown using +[[https://en.wikipedia.org/wiki/Chernoff_bound][Chernoff bound]]. For +any random variable \(\xi\), + +\[\mathbb P(\xi > t) < {\mathbb E \exp(\lambda \xi) \over \exp(\lambda t)} = \exp(\kappa_\xi(\lambda) - \lambda t), \qquad (6.7)\] + +where \(\kappa_\xi(\lambda) = \log \mathbb E \exp(\lambda \xi)\) is the +cumulant of \(\xi\). Since (6.7) holds for any \(\lambda\), we can get the +best bound by minimising \(\kappa_\xi(\lambda) - \lambda t\) (a.k.a. the +[[https://en.wikipedia.org/wiki/Legendre_transformation][Legendre +transformation]]). When \(\xi\) is standard normal, we get (6.5). +\(\square\) + +*Remark*. We will use the Chernoff bound extensively in the second part +of this post when considering Rényi differential privacy. + +*Claim 9*. The Gaussian mechanism on a query \(f\) is +\((\epsilon, \delta)\)-dp, where + +\[\delta = \exp(- (\epsilon \sigma / S_f - (2 \sigma / S_f)^{-1})^2 / 2). \qquad (6.8)\] + +Conversely, to achieve give \((\epsilon, \delta)\)-dp, we may set + +\[\sigma > \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right) S_f \qquad (6.81)\] + +or + +\[\sigma > (\epsilon^{-1} (1 \vee \sqrt{(\log (2 \pi)^{-1} \delta^{-2})_+}) + (2 \epsilon)^{- {1 \over 2}}) S_f \qquad (6.82)\] + +or + +\[\sigma > \epsilon^{-1} \sqrt{\log e^\epsilon \delta^{-2}} S_f \qquad (6.83)\] + +or + +\[\sigma > \epsilon^{-1} (\sqrt{1 + \epsilon} \vee \sqrt{(\log e^\epsilon (2 \pi)^{-1} \delta^{-2})_+}) S_f. \qquad (6.84)\] + +*Proof*. As discussed before we only need to consider the case where +\(S_f = 1\). Fix arbitrary \(x, x' \in X\) with \(d(x, x') = 1\). Let +\(\zeta = (\zeta_1, ..., \zeta_d) \sim N(0, I_d)\). + +By Claim 6 it suffices to bound + +\[\mathbb P(L(M(x) || M(x')) > \epsilon)\] + +We have by the linear invariance of \(L\), + +\[L(M(x) || M(x')) = L(f(x) + \sigma \zeta || f(x') + \sigma \zeta) \overset{d}{=} L(\zeta|| \zeta + \Delta / \sigma),\] + +where \(\Delta := f(x') - f(x)\). + +Plugging in the Gaussian density, we have + +\[L(M(x) || M(x')) \overset{d}{=} \sum_i {\Delta_i \over \sigma} \zeta_i + \sum_i {\Delta_i^2 \over 2 \sigma^2} \overset{d}{=} {\|\Delta\|_2 \over \sigma} \xi + {\|\Delta\|_2^2 \over 2 \sigma^2}.\] + +where \(\xi \sim N(0, 1)\). + +Hence + +\[\mathbb P(L(M(x) || M(x')) > \epsilon) = \mathbb P(\zeta > {\sigma \over \|\Delta\|_2} \epsilon - {\|\Delta\|_2 \over 2 \sigma}).\] + +Since \(\|\Delta\|_2 \le S_f = 1\), we have + +\[\mathbb P(L(M(x) || M(x')) > \epsilon) \le \mathbb P(\xi > \sigma \epsilon - (2 \sigma)^{-1}).\] + +Thus the problem is reduced to the tail bound of a standard normal +distribution, so we can use Claim 8. Note that we implicitly require +\(\sigma > (2 \epsilon)^{- 1 / 2}\) here so that +\(\sigma \epsilon - (2 \sigma)^{-1} > 0\) and we can use the tail bounds. + +Using (6.3) we have + +\[\mathbb P(L(M(x) || M(x')) > \epsilon) < \exp(- (\epsilon \sigma - (2 \sigma)^{-1})^2 / 2).\] + +This gives us (6.8). + +To bound the right hand by \(\delta\), we require + +\[\epsilon \sigma - {1 \over 2 \sigma} > \sqrt{2 \log \delta^{-1}}. \qquad (6.91)\] + +Solving this inequality we have + +\[\sigma > {\sqrt{2 \log \delta^{-1}} + \sqrt{2 \log \delta^{-1} + 2 \epsilon} \over 2 \epsilon}.\] + +Using +\(\sqrt{2 \log \delta^{-1} + 2 \epsilon} \le \sqrt{2 \log \delta^{-1}} + \sqrt{2 \epsilon}\), +we can achieve the above inequality by having + +\[\sigma > \epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{-{1 \over 2}}.\] + +This gives us (6.81). + +Alternatively, we can use the concavity of \(\sqrt{\cdot}\): + +\[(2 \epsilon)^{-1} (\sqrt{2 \log \delta^{-1}} + \sqrt{2 \log \delta^{-1} + 2 \epsilon}) \le \epsilon^{-1} \sqrt{\log e^\epsilon \delta^{-2}},\] + +which gives us (6.83) + +Back to (6.9), if we use (6.5) instead, we need + +\[\log t + {t^2 \over 2} > \log {(2 \pi)^{- 1 / 2} \delta^{-1}}\] + +where \(t = \epsilon \sigma - (2 \sigma)^{-1}\). This can be satisfied if + +$$\begin{aligned} +t &> 1 \qquad (6.93)\\ +t &> \sqrt{\log (2 \pi)^{-1} \delta^{-2}}. \qquad (6.95) +\end{aligned}$$ + +We can solve both inequalities as before and obtain + +\[\sigma > \epsilon^{-1} (1 \vee \sqrt{(\log (2 \pi)^{-1} \delta^{-2})_+}) + (2 \epsilon)^{- {1 \over 2}},\] + +or + +\[\sigma > \epsilon^{-1}(\sqrt{1 + \epsilon} \vee \sqrt{(\log e^\epsilon (2 \pi)^{-1} \delta^{-2})_+}).\] + +This gives us (6.82)(6.84). \(\square\) + +When \(\epsilon \le \alpha\) is bounded, by (6.83) (6.84) we can require +either + +\[\sigma > \epsilon^{-1} (\sqrt{\log e^\alpha \delta^{-2}}) S_f\] + +or + +\[\sigma > \epsilon^{-1} (\sqrt{1 + \alpha} \vee \sqrt{(\log (2 \pi)^{-1} e^\alpha \delta^{-2})_+}) S_f.\] + +The second bound is similar to and slightly better than the one in +Theorem A.1 of Dwork-Roth 2013, where \(\alpha = 1\): + +\[\sigma > \epsilon^{-1} \left({3 \over 2} \vee \sqrt{(2 \log {5 \over 4} \delta^{-1})_+}\right) S_f.\] + +Note that the lower bound of \({3 \over 2}\) is implicitly required in the +proof of Theorem A.1. + +** Composition theorems + :PROPERTIES: + :CUSTOM_ID: composition-theorems + :END: +So far we have seen how a mechanism made of a single query plus a noise +can be proved to be differentially private. But we need to understand +the privacy when composing several mechanisms, combinatorially or +sequentially. Let us first define the combinatorial case: + +*Definition (Independent composition)*. Let \(M_1, ..., M_k\) be \(k\) +mechanisms with independent noises. The mechanism \(M = (M_1, ..., M_k)\) +is called the /independent composition/ of \(M_{1 : k}\). + +To define the adaptive composition, let us motivate it with an example +of gradient descent. Consider the loss function \(\ell(x; \theta)\) of a +neural network, where \(\theta\) is the parameter and \(x\) the input, +gradient descent updates its parameter \(\theta\) at each time \(t\): + +\[\theta_{t} = \theta_{t - 1} - \alpha m^{-1} \sum_{i = 1 : m} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}}.\] + +We may add privacy by adding noise \(\zeta_t\) at each step: + +\[\theta_{t} = \theta_{t - 1} - \alpha m^{-1} \sum_{i = 1 : m} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}} + \zeta_t. \qquad (6.97)\] + +Viewed as a sequence of mechanism, we have that at each time \(t\), the +mechanism \(M_t\) takes input \(x\), and outputs \(\theta_t\). But \(M_t\) also +depends on the output of the previous mechanism \(M_{t - 1}\). To this +end, we define the adaptive composition. + +*Definition (Adaptive composition)*. Let +\(({M_i(y_{1 : i - 1})})_{i = 1 : k}\) be \(k\) mechanisms with independent +noises, where \(M_1\) has no parameter, \(M_2\) has one parameter in \(Y\), +\(M_3\) has two parameters in \(Y\) and so on. For \(x \in X\), define \(\xi_i\) +recursively by + +$$\begin{aligned} +\xi_1 &:= M_1(x)\\ +\xi_i &:= M_i(\xi_1, \xi_2, ..., \xi_{i - 1}) (x). +\end{aligned}$$ + +The /adaptive composition/ of \(M_{1 : k}\) is defined by +\(M(x) := (\xi_1, \xi_2, ..., \xi_k)\). + +The definition of adaptive composition may look a bit complicated, but +the point is to describe \(k\) mechanisms such that for each \(i\), the +output of the first, second, ..., \(i - 1\)th mechanisms determine the +\(i\)th mechanism, like in the case of gradient descent. + +It is not hard to write down the differentially private gradient descent +as a sequential composition: + +\[M_t(\theta_{1 : t - 1})(x) = \theta_{t - 1} - \alpha m^{-1} \sum_{i = 1 : m} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}} + \zeta_t.\] + +In Dwork-Rothblum-Vadhan 2010 (see also Dwork-Roth 2013) the adaptive +composition is defined in a more general way, but the definition is +based on the same principle, and proofs in this post on adaptive +compositions carry over. + +It is not hard to see that the adaptive composition degenerates to +independent composition when each \(M_i(y_{1 : i})\) evaluates to the same +mechanism regardless of \(y_{1 : i}\), in which case the \(\xi_i\)s are +independent. + +In the following when discussing adaptive compositions we sometimes omit +the parameters for convenience without risk of ambiguity, and write +\(M_i(y_{1 : i})\) as \(M_i\), but keep in mind of the dependence on the +parameters. + +It is time to state and prove the composition theorems. In this section +we consider \(2 \times 2 \times 2 = 8\) cases, i.e. situations of three +dimensions, where there are two choices in each dimension: + +1. Composition of \(\epsilon\)-dp or more generally + \((\epsilon, \delta)\)-dp mechanisms +2. Composition of independent or more generally adaptive mechanisms +3. Basic or advanced compositions + +Note that in the first two dimensions the second choice is more general +than the first. + +The proofs of these composition theorems will be laid out as follows: + +1. Claim 10 - Basic composition theorem for \((\epsilon, \delta)\)-dp with + adaptive mechanisms: by a direct proof with an induction argument +2. Claim 14 - Advanced composition theorem for \(\epsilon\)-dp with + independent mechanisms: by factorising privacy loss and using + [[https://en.wikipedia.org/wiki/Hoeffding%27s_inequality][Hoeffding's + Inequality]] +3. Claim 16 - Advanced composition theorem for \(\epsilon\)-dp with + adaptive mechanisms: by factorising privacy loss and using + [[https://en.wikipedia.org/wiki/Azuma%27s_inequality][Azuma's + Inequality]] +4. Claims 17 and 18 - Advanced composition theorem for + \((\epsilon, \delta)\)-dp with independent / adaptive mechanisms: by + using characterisations of \((\epsilon, \delta)\)-dp in Claims 4 and 5 + as an approximation of \(\epsilon\)-dp and then using Proofs in Item 2 + or 3. + +*Claim 10 (Basic composition theorem).* Let \(M_{1 : k}\) be \(k\) +mechanisms with independent noises such that for each \(i\) and +\(y_{1 : i - 1}\), \(M_i(y_{1 : i - 1})\) is \((\epsilon_i, \delta_i)\)-dp. +Then the adpative composition of \(M_{1 : k}\) is +\((\sum_i \epsilon_i, \sum_i \delta_i)\)-dp. + +*Proof (Dwork-Lei 2009, see also Dwork-Roth 2013 Appendix B.1)*. Let \(x\) +and \(x'\) be neighbouring points in \(X\). Let \(M\) be the adaptive +composition of \(M_{1 : k}\). Define + +\[\xi_{1 : k} := M(x), \qquad \eta_{1 : k} := M(x').\] + +Let \(p^i\) and \(q^i\) be the laws of \((\xi_{1 : i})\) and \((\eta_{1 : i})\) +respectively. + +Let \(S_1, ..., S_k \subset Y\) and \(T_i := \prod_{j = 1 : i} S_j\). We use +two tricks. + +1. Since \(\xi_i | \xi_{< i} = y_{< i}\) and + \(\eta_i | \eta_{< i} = y_{< i}\) are \((\epsilon_i, \delta_i)\)-ind, and + a probability is no greater than \(1\), $$\begin{aligned} + \mathbb P(\xi_i \in S_i | \xi_{< i} = y_{< i}) &\le (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) + \delta_i) \wedge 1 \\ + &\le (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) + \delta_i) \wedge (1 + \delta_i) \\ + &= (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) \wedge 1) + \delta_i + \end{aligned}$$ + +2. Given \(p\) and \(q\) that are \((\epsilon, \delta)\)-ind, define + \[\mu(x) = (p(x) - e^\epsilon q(x))_+.\] + + We have \[\mu(S) \le \delta, \forall S\] + + In the following we define + \(\mu^{i - 1} = (p^{i - 1} - e^\epsilon q^{i - 1})_+\) for the same + purpose. + +We use an inductive argument to prove the theorem: + +$$\begin{aligned} +\mathbb P(\xi_{\le i} \in T_i) &= \int_{T_{i - 1}} \mathbb P(\xi_i \in S_i | \xi_{< i} = y_{< i}) p^{i - 1} (y_{< i}) dy_{< i} \\ +&\le \int_{T_{i - 1}} (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) \wedge 1) p^{i - 1}(y_{< i}) dy_{< i} + \delta_i\\ +&\le \int_{T_{i - 1}} (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) \wedge 1) (e^{\epsilon_1 + ... + \epsilon_{i - 1}} q^{i - 1}(y_{< i}) + \mu^{i - 1} (y_{< i})) dy_{< i} + \delta_i\\ +&\le \int_{T_{i - 1}} e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) e^{\epsilon_1 + ... + \epsilon_{i - 1}} q^{i - 1}(y_{< i}) dy_{< i} + \mu_{i - 1}(T_{i - 1}) + \delta_i\\ +&\le e^{\epsilon_1 + ... + \epsilon_i} \mathbb P(\eta_{\le i} \in T_i) + \delta_1 + ... + \delta_{i - 1} + \delta_i.\\ +\end{aligned}$$ + +In the second line we use Trick 1; in the third line we use the +induction assumption; in the fourth line we multiply the first term in +the first braket with first term in the second braket, and the second +term (i.e. \(1\)) in the first braket with the second term in the second +braket (i.e. the \(\mu\) term); in the last line we use Trick 2. + +The base case \(i = 1\) is true since \(M_1\) is +\((\epsilon_1, \delta_1)\)-dp. \(\square\) + +To prove the advanced composition theorem, we start with some lemmas. + +*Claim 11*. If \(p\) and \(q\) are \(\epsilon\)-ind, then + +\[D(p || q) + D(q || p) \le \epsilon(e^\epsilon - 1).\] + +*Proof*. Since \(p\) and \(q\) are \(\epsilon\)-ind, we have +\(|\log p(x) - \log q(x)| \le \epsilon\) for all \(x\). Let +\(S := \{x: p(x) > q(x)\}\). Then we have on + +$$\begin{aligned} +D(p || q) + D(q || p) &= \int (p(x) - q(x)) (\log p(x) - \log q(x)) dx\\ +&= \int_S (p(x) - q(x)) (\log p(x) - \log q(x)) dx + \int_{S^c} (q(x) - p(x)) (\log q(x) - \log p(x)) dx\\ +&\le \epsilon(\int_S p(x) - q(x) dx + \int_{S^c} q(x) - p(x) dx) +\end{aligned}$$ + +Since on \(S\) we have \(q(x) \le p(x) \le e^\epsilon q(x)\), and on \(S^c\) +we have \(p(x) \le q(x) \le e^\epsilon p(x)\), we obtain + +\[D(p || q) + D(q || p) \le \epsilon \int_S (1 - e^{-\epsilon}) p(x) dx + \epsilon \int_{S^c} (e^{\epsilon} - 1) p(x) dx \le \epsilon (e^{\epsilon} - 1),\] + +where in the last step we use \(e^\epsilon - 1 \ge 1 - e^{- \epsilon}\) +and \(p(S) + p(S^c) = 1\). \(\square\) + +*Claim 12*. If \(p\) and \(q\) are \(\epsilon\)-ind, then + +\[D(p || q) \le a(\epsilon) \ge D(q || p),\] + +where + +\[a(\epsilon) = \epsilon (e^\epsilon - 1) 1_{\epsilon \le \log 2} + \epsilon 1_{\epsilon > \log 2} \le (\log 2)^{-1} \epsilon^2 1_{\epsilon \le \log 2} + \epsilon 1_{\epsilon > \log 2}. \qquad (6.98)\] + +*Proof*. Since \(p\) and \(q\) are \(\epsilon\)-ind, we have + +\[D(p || q) = \mathbb E_{\xi \sim p} \log {p(\xi) \over q(\xi)} \le \max_y {\log p(y) \over \log q(y)} \le \epsilon.\] + +Comparing the quantity in Claim 11 (\(\epsilon(e^\epsilon - 1)\)) with the +quantity above (\(\epsilon\)), we arrive at the conclusion. \(\square\) + +*Claim 13 +([[https://en.wikipedia.org/wiki/Hoeffding%27s_inequality][Hoeffding's +Inequality]])*. Let \(L_i\) be independent random variables with +\(|L_i| \le b\), and let \(L = L_1 + ... + L_k\), then for \(t > 0\), + +\[\mathbb P(L - \mathbb E L \ge t) \le \exp(- {t^2 \over 2 k b^2}).\] + +*Claim 14 (Advanced Independent Composition Theorem)* (\(\delta = 0\)). +Fix \(0 < \beta < 1\). Let \(M_1, ..., M_k\) be \(\epsilon\)-dp, then the +independent composition \(M\) of \(M_{1 : k}\) is +\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} \epsilon, \beta)\)-dp. + +*Remark*. By (6.98) we know that +\(k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} \epsilon = \sqrt{2 k \log \beta^{-1}} \epsilon + k O(\epsilon^2)\) +when \(\epsilon\) is sufficiently small, in which case the leading term is +of order \(O(\sqrt k \epsilon)\) and we save a \(\sqrt k\) in the +\(\epsilon\)-part compared to the Basic Composition Theorem (Claim 10). + +*Remark*. In practice one can try different choices of \(\beta\) and +settle with the one that gives the best privacy guarantee. See the +discussions at the end of +[[/posts/2019-03-14-great-but-manageable-expectations.html][Part 2 of +this post]]. + +*Proof*. Let \(p_i\), \(q_i\), \(p\) and \(q\) be the laws of \(M_i(x)\), +\(M_i(x')\), \(M(x)\) and \(M(x')\) respectively. + +\[\mathbb E L_i = D(p_i || q_i) \le a(\epsilon),\] + +where \(L_i := L(p_i || q_i)\). Due to \(\epsilon\)-ind also have + +\[|L_i| \le \epsilon.\] + +Therefore, by Hoeffding's Inequality, + +\[\mathbb P(L - k a(\epsilon) \ge t) \le \mathbb P(L - \mathbb E L \ge t) \le \exp(- t^2 / 2 k \epsilon^2),\] + +where \(L := \sum_i L_i = L(p || q)\). + +Plugging in \(t = \sqrt{2 k \epsilon^2 \log \beta^{-1}}\), we have + +\[\mathbb P(L(p || q) \le k a(\epsilon) + \sqrt{2 k \epsilon^2 \log \beta^{-1}}) \ge 1 - \beta.\] + +Similarly we also have + +\[\mathbb P(L(q || p) \le k a(\epsilon) + \sqrt{2 k \epsilon^2 \log \beta^{-1}}) \ge 1 - \beta.\] + +By Claim 1 we arrive at the conclusion. \(\square\) + +*Claim 15 ([[https://en.wikipedia.org/wiki/Azuma%27s_inequality][Azuma's +Inequality]])*. Let \(X_{0 : k}\) be a +[[https://en.wikipedia.org/wiki/Martingale_(probability_theory)][supermartingale]]. +If \(|X_i - X_{i - 1}| \le b\), then + +\[\mathbb P(X_k - X_0 \ge t) \le \exp(- {t^2 \over 2 k b^2}).\] + +Azuma's Inequality implies a slightly weaker version of Hoeffding's +Inequality. To see this, let \(L_{1 : k}\) be independent variables with +\(|L_i| \le b\). Let \(X_i = \sum_{j = 1 : i} L_j - \mathbb E L_j\). Then +\(X_{0 : k}\) is a martingale, and + +\[| X_i - X_{i - 1} | = | L_i - \mathbb E L_i | \le 2 b,\] + +since \(\|L_i\|_1 \le \|L_i\|_\infty\). Hence by Azuma's Inequality, + +\[\mathbb P(L - \mathbb E L \ge t) \le \exp(- {t^2 \over 8 k b^2}).\] + +Of course here we have made no assumption on \(\mathbb E L_i\). If instead +we have some bound for the expectation, say \(|\mathbb E L_i| \le a\), +then by the same derivation we have + +\[\mathbb P(L - \mathbb E L \ge t) \le \exp(- {t^2 \over 2 k (a + b)^2}).\] + +It is not hard to see what Azuma is to Hoeffding is like adaptive +composition to independent composition. Indeed, we can use Azuma's +Inequality to prove the Advanced Adaptive Composition Theorem for +\(\delta = 0\). + +*Claim 16 (Advanced Adaptive Composition Theorem)* (\(\delta = 0\)). Let +\(\beta > 0\). Let \(M_{1 : k}\) be \(k\) mechanisms with independent noises +such that for each \(i\) and \(y_{1 : i}\), \(M_i(y_{1 : i})\) is +\((\epsilon, 0)\)-dp. Then the adpative composition of \(M_{1 : k}\) is +\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon)), \beta)\)-dp. + +*Proof*. As before, let \(\xi_{1 : k} \overset{d}{=} M(x)\) and +\(\eta_{1 : k} \overset{d}{=} M(x')\), where \(M\) is the adaptive +composition of \(M_{1 : k}\). Let \(p_i\) (resp. \(q_i\)) be the law of +\(\xi_i | \xi_{< i}\) (resp. \(\eta_i | \eta_{< i}\)). Let \(p^i\) (resp. +\(q^i\)) be the law of \(\xi_{\le i}\) (resp. \(\eta_{\le i}\)). We want to +construct supermartingale \(X\). To this end, let + +\[X_i = \log {p^i(\xi_{\le i}) \over q^i(\xi_{\le i})} - i a(\epsilon) \] + +We show that \((X_i)\) is a supermartingale: + +$$\begin{aligned} +\mathbb E(X_i - X_{i - 1} | X_{i - 1}) &= \mathbb E \left(\log {p_i (\xi_i | \xi_{< i}) \over q_i (\xi_i | \xi_{< i})} - a(\epsilon) | \log {p^{i - 1} (\xi_{< i}) \over q^{i - 1} (\xi_{< i})}\right) \\ +&= \mathbb E \left( \mathbb E \left(\log {p_i (\xi_i | \xi_{< i}) \over q_i (\xi_i | \xi_{< i})} | \xi_{< i}\right) | \log {p^{i - 1} (\xi_{< i}) \over q^{i - 1} (\xi_{< i})}\right) - a(\epsilon) \\ +&= \mathbb E \left( D(p_i (\cdot | \xi_{< i}) || q_i (\cdot | \xi_{< i})) | \log {p^{i - 1} (\xi_{< i}) \over q^{i - 1} (\xi_{< i})}\right) - a(\epsilon) \\ +&\le 0, +\end{aligned}$$ + +since by Claim 12 +\(D(p_i(\cdot | y_{< i}) || q_i(\cdot | y_{< i})) \le a(\epsilon)\) for +all \(y_{< i}\). + +Since + +\[| X_i - X_{i - 1} | = | \log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})} - a(\epsilon) | \le \epsilon + a(\epsilon),\] + +by Azuma's Inequality, + +\[\mathbb P(\log {p^k(\xi_{1 : k}) \over q^k(\xi_{1 : k})} \ge k a(\epsilon) + t) \le \exp(- {t^2 \over 2 k (\epsilon + a(\epsilon))^2}). \qquad(6.99)\] + +Let \(t = \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon))\) we are +done. \(\square\) + +*Claim 17 (Advanced Independent Composition Theorem)*. Fix +\(0 < \beta < 1\). Let \(M_1, ..., M_k\) be \((\epsilon, \delta)\)-dp, then +the independent composition \(M\) of \(M_{1 : k}\) is +\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} \epsilon, k \delta + \beta)\)-dp. + +*Proof*. By Claim 4, there exist events \(E_{1 : k}\) and \(F_{1 : k}\) such +that + +1. The laws \(p_{i | E_i}\) and \(q_{i | F_i}\) are \(\epsilon\)-ind. +2. \(\mathbb P(E_i), \mathbb P(F_i) \ge 1 - \delta\). + +Let \(E := \bigcap E_i\) and \(F := \bigcap F_i\), then they both have +probability at least \(1 - k \delta\), and \(p_{i | E}\) and \(q_{i | F}\) are +\(\epsilon\)-ind. + +By Claim 14, \(p_{|E}\) and \(q_{|F}\) are +\((\epsilon' := k a(\epsilon) + \sqrt{2 k \epsilon^2 \log \beta^{-1}}, \beta)\)-ind. +Let us shrink the bigger event between \(E\) and \(F\) so that they have +equal probabilities. Then + +$$\begin{aligned} +p (S) &\le p_{|E}(S) \mathbb P(E) + \mathbb P(E^c) \\ +&\le (e^{\epsilon'} q_{|F}(S) + \beta) \mathbb P(F) + k \delta\\ +&\le e^{\epsilon'} q(S) + \beta + k \delta. +\end{aligned}$$ + +\(\square\) + +*Claim 18 (Advanced Adaptive Composition Theorem)*. Fix \(0 < \beta < 1\). +Let \(M_{1 : k}\) be \(k\) mechanisms with independent noises such that for +each \(i\) and \(y_{1 : i}\), \(M_i(y_{1 : i})\) is \((\epsilon, \delta)\)-dp. +Then the adpative composition of \(M_{1 : k}\) is +\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon)), \beta + k \delta)\)-dp. + +*Remark*. This theorem appeared in Dwork-Rothblum-Vadhan 2010, but I +could not find a proof there. A proof can be found in Dwork-Roth 2013 +(See Theorem 3.20 there). Here I prove it in a similar way, except that +instead of the use of an intermediate random variable there, I use the +conditional probability results from Claim 5, the approach mentioned in +Vadhan 2017. + +*Proof*. By Claim 5, there exist events \(E_{1 : k}\) and \(F_{1 : k}\) such +that + +1. The laws \(p_{i | E_i}(\cdot | y_{< i})\) and + \(q_{i | F_i}(\cdot | y_{< i})\) are \(\epsilon\)-ind for all \(y_{< i}\). +2. \(\mathbb P(E_i | y_{< i}), \mathbb P(F_i | y_{< i}) \ge 1 - \delta\) + for all \(y_{< i}\). + +Let \(E := \bigcap E_i\) and \(F := \bigcap F_i\), then they both have +probability at least \(1 - k \delta\), and \(p_{i | E}(\cdot | y_{< i}\) and +\(q_{i | F}(\cdot | y_{< i})\) are \(\epsilon\)-ind. + +By Advanced Adaptive Composition Theorem (\(\delta = 0\)), \(p_{|E}\) and +\(q_{|F}\) are +\((\epsilon' := k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon)), \beta)\)-ind. + +The rest is the same as in the proof of Claim 17. \(\square\) + +** Subsampling + :PROPERTIES: + :CUSTOM_ID: subsampling + :END: +Stochastic gradient descent is like gradient descent, but with random +subsampling. + +Recall we have been considering databases in the space \(Z^m\). Let +\(n < m\) be a positive integer, +\(\mathcal I := \{I \subset [m]: |I| = n\}\) be the set of subsets of +\([m]\) of size \(n\), and \(\gamma\) a random subset sampled uniformly from +\(\mathcal I\). Let \(r = {n \over m}\) which we call the subsampling rate. +Then we may add a subsampling module to the noisy gradient descent +algorithm (6.97) considered before + +\[\theta_{t} = \theta_{t - 1} - \alpha n^{-1} \sum_{i \in \gamma} \nabla_\theta h_\theta(x_i) |_{\theta = \theta_{t - 1}} + \zeta_t. \qquad (7)\] + +It turns out subsampling has an amplification effect on privacy. + +*Claim 19 (Ullman 2017)*. Fix \(r \in [0, 1]\). Let \(n \le m\) be two +nonnegative integers with \(n = r m\). Let \(N\) be an +\((\epsilon, \delta)\)-dp mechanism on \(Z^n\). Define mechanism \(M\) on +\(Z^m\) by + +\[M(x) = N(x_\gamma)\] + +Then \(M\) is \((\log (1 + r(e^\epsilon - 1)), r \delta)\)-dp. + +*Remark*. Some seem to cite +Kasiviswanathan-Lee-Nissim-Raskhodnikova-Smith 2005 for this result, but +it is not clear to me how it appears there. + +*Proof*. Let \(x, x' \in Z^n\) such that they differ by one row +\(x_i \neq x_i'\). Naturally we would like to consider the cases where the +index \(i\) is picked and the ones where it is not separately. Let +\(\mathcal I_\in\) and \(\mathcal I_\notin\) be these two cases: + +$$\begin{aligned} +\mathcal I_\in = \{J \subset \mathcal I: i \in J\}\\ +\mathcal I_\notin = \{J \subset \mathcal I: i \notin J\}\\ +\end{aligned}$$ + +We will use these notations later. Let \(A\) be the event +\(\{\gamma \ni i\}\). + +Let \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively. We +collect some useful facts about them. First due to \(N\) being +\((\epsilon, \delta)\)-dp, + +\[p_{|A}(S) \le e^\epsilon q_{|A}(S) + \delta.\] + +Also, + +\[p_{|A}(S) \le e^\epsilon p_{|A^c}(S) + \delta.\] + +To see this, note that being conditional laws, \(p_A\) and \(p_{A^c}\) are +averages of laws over \(\mathcal I_\in\) and \(\mathcal I_\notin\) +respectively: + +$$\begin{aligned} +p_{|A}(S) = |\mathcal I_\in|^{-1} \sum_{I \in \mathcal I_\in} \mathbb P(N(x_I) \in S)\\ +p_{|A^c}(S) = |\mathcal I_\notin|^{-1} \sum_{J \in \mathcal I_\notin} \mathbb P(N(x_J) \in S). +\end{aligned}$$ + +Now we want to pair the \(I\)'s in \(\mathcal I_\in\) and \(J\)'s in +\(\mathcal I_\notin\) so that they differ by one index only, which means +\(d(x_I, x_J) = 1\). Formally, this means we want to consider the set: + +\[\mathcal D := \{(I, J) \in \mathcal I_\in \times \mathcal I_\notin: |I \cap J| = n - 1\}.\] + +We may observe by trying out some simple cases that every +\(I \in \mathcal I_\in\) is paired with \(n\) elements in +\(\mathcal I_\notin\), and every \(J \in \mathcal I_\notin\) is paired with +\(m - n\) elements in \(\mathcal I_\in\). Therefore + +\[p_{|A}(S) = |\mathcal D|^{-1} \sum_{(I, J) \in \mathcal D} \mathbb P(N(x_I \in S)) \le |\mathcal D|^{-1} \sum_{(I, J) \in \mathcal D} (e^\epsilon \mathbb P(N(x_J \in S)) + \delta) = e^\epsilon p_{|A^c} (S) + \delta.\] + +Since each of the \(m\) indices is picked independently with probability +\(r\), we have + +\[\mathbb P(A) = r.\] + +Let \(t \in [0, 1]\) to be determined. We may write + +$$\begin{aligned} +p(S) &= r p_{|A} (S) + (1 - r) p_{|A^c} (S)\\ +&\le r(t e^\epsilon q_{|A}(S) + (1 - t) e^\epsilon q_{|A^c}(S) + \delta) + (1 - r) q_{|A^c} (S)\\ +&= rte^\epsilon q_{|A}(S) + (r(1 - t) e^\epsilon + (1 - r)) q_{|A^c} (S) + r \delta\\ +&= te^\epsilon r q_{|A}(S) + \left({r \over 1 - r}(1 - t) e^\epsilon + 1\right) (1 - r) q_{|A^c} (S) + r \delta \\ +&\le \left(t e^\epsilon \wedge \left({r \over 1 - r} (1 - t) e^\epsilon + 1\right)\right) q(S) + r \delta. \qquad (7.5) +\end{aligned}$$ + +We can see from the last line that the best bound we can get is when + +\[t e^\epsilon = {r \over 1 - r} (1 - t) e^\epsilon + 1.\] + +Solving this equation we obtain + +\[t = r + e^{- \epsilon} - r e^{- \epsilon}\] + +and plugging this in (7.5) we have + +\[p(S) \le (1 + r(e^\epsilon - 1)) q(S) + r \delta.\] + +\(\square\) + +Since \(\log (1 + x) < x\) for \(x > 0\), we can rewrite the conclusion of +the Claim to \((r(e^\epsilon - 1), r \delta)\)-dp. Further more, if +\(\epsilon < \alpha\) for some \(\alpha\), we can rewrite it as +\((r \alpha^{-1} (e^\alpha - 1) \epsilon, r \delta)\)-dp or +\((O(r \epsilon), r \delta)\)-dp. + +Let \(\epsilon < 1\). We see that if the mechanism \(N\) is +\((\epsilon, \delta)\)-dp on \(Z^n\), then \(M\) is +\((2 r \epsilon, r \delta)\)-dp, and if we run it over \(k / r\) +minibatches, by Advanced Adaptive Composition theorem, we have +\((\sqrt{2 k r \log \beta^{-1}} \epsilon + 2 k r \epsilon^2, k \delta + \beta)\)-dp. + +This is better than the privacy guarantee without subsampling, where we +run over \(k\) iterations and obtain +\((\sqrt{2 k \log \beta^{-1}} \epsilon + 2 k \epsilon^2, k \delta + \beta)\)-dp. +So with subsampling we gain an extra \(\sqrt r\) in the \(\epsilon\)-part of +the privacy guarantee. But, smaller subsampling rate means smaller +minibatch size, which would result in bigger variance, so there is a +trade-off here. + +Finally we define the differentially private stochastic gradient descent +(DP-SGD) with the Gaussian mechanism +(Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016), which is (7) +with the noise specialised to Gaussian and an added clipping operation +to bound to sensitivity of the query to a chosen \(C\): + +\[\theta_{t} = \theta_{t - 1} - \alpha \left(n^{-1} \sum_{i \in \gamma} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}}\right)_{\text{Clipped at }C / 2} + N(0, \sigma^2 C^2 I),\] + +where + +\[y_{\text{Clipped at } \alpha} := y / (1 \vee {\|y\|_2 \over \alpha})\] + +is \(y\) clipped to have norm at most \(\alpha\). + +Note that the clipping in DP-SGD is much stronger than making the query +have sensitivity \(C\). It makes the difference between the query results +of two /arbitrary/ inputs bounded by \(C\), rather than /neighbouring/ +inputs. + +In [[/posts/2019-03-14-great-but-manageable-expectations.html][Part 2 of +this post]] we will use the tools developed above to discuss the privacy +guarantee for DP-SGD, among other things. + +** References + :PROPERTIES: + :CUSTOM_ID: references + :END: + +- Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya + Mironov, Kunal Talwar, and Li Zhang. "Deep Learning with Differential + Privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer + and Communications Security - CCS'16, 2016, 308--18. + [[https://doi.org/10.1145/2976749.2978318]]. +- Dwork, Cynthia, and Aaron Roth. "The Algorithmic Foundations of + Differential Privacy." Foundations and Trends® in Theoretical Computer + Science 9, no. 3--4 (2013): 211--407. + [[https://doi.org/10.1561/0400000042]]. +- Dwork, Cynthia, Guy N. Rothblum, and Salil Vadhan. "Boosting and + Differential Privacy." In 2010 IEEE 51st Annual Symposium on + Foundations of Computer Science, 51--60. Las Vegas, NV, USA: + IEEE, 2010. [[https://doi.org/10.1109/FOCS.2010.12]]. +- Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya + Raskhodnikova, and Adam Smith. "What Can We Learn Privately?" In 46th + Annual IEEE Symposium on Foundations of Computer Science (FOCS'05). + Pittsburgh, PA, USA: IEEE, 2005. + [[https://doi.org/10.1109/SFCS.2005.1]]. +- Murtagh, Jack, and Salil Vadhan. "The Complexity of Computing the + Optimal Composition of Differential Privacy." In Theory of + Cryptography, edited by Eyal Kushilevitz and Tal Malkin, 9562:157--75. + Berlin, Heidelberg: Springer Berlin Heidelberg, 2016. + [[https://doi.org/10.1007/978-3-662-49096-9_7]]. +- Ullman, Jonathan. "Solution to CS7880 Homework 1.", 2017. + [[http://www.ccs.neu.edu/home/jullman/cs7880s17/HW1sol.pdf]] +- Vadhan, Salil. "The Complexity of Differential Privacy." In Tutorials + on the Foundations of Cryptography, edited by Yehuda Lindell, + 347--450. Cham: Springer International Publishing, 2017. + [[https://doi.org/10.1007/978-3-319-57048-8_7]]. + +[fn:1] For those who have read about differential privacy and never + heard of the term "divergence variable", it is closely related to + the notion of "privacy loss", see the paragraph under Claim 6 in + [[#back-to-approximate-differential-privacy][Back to approximate + differential privacy]]. I defined the term this way so that we + can focus on the more general stuff: compared to the privacy loss + \(L(M(x) || M(x'))\), the term \(L(p || q)\) removes the "distracting + information" that \(p\) and \(q\) are related to databases, queries, + mechanisms etc., but merely probability laws. By removing the + distraction, we simplify the analysis. And once we are done with + the analysis of \(L(p || q)\), we can apply the results obtained in + the general setting to the special case where \(p\) is the law of + \(M(x)\) and \(q\) is the law of \(M(x')\). diff --git a/posts/2019-03-14-great-but-manageable-expectations.org b/posts/2019-03-14-great-but-manageable-expectations.org new file mode 100644 index 0000000..6438090 --- /dev/null +++ b/posts/2019-03-14-great-but-manageable-expectations.org @@ -0,0 +1,836 @@ +#+title: Great but Manageable Expectations + +#+date: <2019-03-14> + +This is Part 2 of a two-part blog post on differential privacy. +Continuing from [[file:2019-03-13-a-tail-of-two-densities.org][Part 1]], I discuss the Rényi differential privacy, corresponding to the +Rényi divergence, a study of the +[[https://en.wikipedia.org/wiki/Moment-generating_function][moment +generating functions]] of the divergence between probability measures in +order to derive the tail bounds. + +Like in Part 1, I prove a composition theorem and a subsampling theorem. + +I also attempt to reproduce a seemingly better moment bound for the +Gaussian mechanism with subsampling, with one intermediate step which I +am not able to prove. + +After that I explain the Tensorflow implementation of differential +privacy in its +[[https://github.com/tensorflow/privacy/tree/master/privacy][Privacy]] +module, which focuses on the differentially private stochastic gradient +descent algorithm (DP-SGD). + +Finally I use the results from both Part 1 and Part 2 to obtain some +privacy guarantees for composed subsampling queries in general, and for +DP-SGD in particular. I also compare these privacy guarantees. + +/If you are confused by any notations, ask me or try +[[/notations.html][this]]./ + +** Rényi divergence and differential privacy + :PROPERTIES: + :CUSTOM_ID: rényi-divergence-and-differential-privacy + :END: +Recall in the proof of Gaussian mechanism privacy guarantee (Claim 8) we +used the Chernoff bound for the Gaussian noise. Why not use the Chernoff +bound for the divergence variable / privacy loss directly, since the +latter is closer to the core subject than the noise? This leads us to +the study of Rényi divergence. + +So far we have seen several notions of divergence used in differential +privacy: the max divergence which is \(\epsilon\)-ind in disguise: + +\[D_\infty(p || q) := \max_y \log {p(y) \over q(y)},\] + +the \(\delta\)-approximate max divergence that defines the +\((\epsilon, \delta)\)-ind: + +\[D_\infty^\delta(p || q) := \max_y \log{p(y) - \delta \over q(y)},\] + +and the KL-divergence which is the expectation of the divergence +variable: + +\[D(p || q) = \mathbb E L(p || q) = \int \log {p(y) \over q(y)} p(y) dy.\] + +The Rényi divergence is an interpolation between the max divergence and +the KL-divergence, defined as the log moment generating function / +cumulants of the divergence variable: + +\[D_\lambda(p || q) = (\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1) L(p || q)) = (\lambda - 1)^{-1} \log \int {p(y)^\lambda \over q(y)^{\lambda - 1}} dy.\] + +Indeed, when \(\lambda \to \infty\) we recover the max divergence, and +when \(\lambda \to 1\), by recognising \(D_\lambda\) as a derivative in +\(\lambda\) at \(\lambda = 1\), we recover the KL divergence. In this post +we only consider \(\lambda > 1\). + +Using the Rényi divergence we may define: + +*Definition (Rényi differential privacy)* (Mironov 2017). An mechanism +\(M\) is \((\lambda, \rho)\)/-Rényi differentially private/ +(\((\lambda, \rho)\)-rdp) if for all \(x\) and \(x'\) with distance \(1\), + +\[D_\lambda(M(x) || M(x')) \le \rho.\] + +For convenience we also define two related notions, \(G_\lambda (f || g)\) +and \(\kappa_{f, g} (t)\) for \(\lambda > 1\), \(t > 0\) and positive +functions \(f\) and \(g\): + +\[G_\lambda(f || g) = \int f(y)^{\lambda} g(y)^{1 - \lambda} dy; \qquad \kappa_{f, g} (t) = \log G_{t + 1}(f || g).\] + +For probability densities \(p\) and \(q\), \(G_{t + 1}(p || q)\) and +\(\kappa_{p, q}(t)\) are the \(t\)th moment generating function and +[[https://en.wikipedia.org/wiki/Cumulant][cumulant]] of the divergence +variable \(L(p || q)\), and + +\[D_\lambda(p || q) = (\lambda - 1)^{-1} \kappa_{p, q}(\lambda - 1).\] + +In the following, whenever you see \(t\), think of it as \(\lambda - 1\). + +*Example 1 (RDP for the Gaussian mechanism)*. Using the scaling and +translation invariance of \(L\) (6.1), we have that the divergence +variable for two Gaussians with the same variance is + +\[L(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) \overset{d}{=} L(N(0, I) || N((\mu_2 - \mu_1) / \sigma, I)).\] + +With this we get + +\[D_\lambda(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) = {\lambda \|\mu_2 - \mu_1\|_2^2 \over 2 \sigma^2} = D_\lambda(N(\mu_2, \sigma^2 I) || N(\mu_1, \sigma^2 I)).\] + +Again due to the scaling invariance of \(L\), we only need to consider \(f\) +with sensitivity \(1\), see the discussion under (6.1). The Gaussian +mechanism on query \(f\) is thus \((\lambda, \lambda / 2 \sigma^2)\)-rdp for +any \(\lambda > 1\). + +From the example of Gaussian mechanism, we see that the relation between +\(\lambda\) and \(\rho\) is like that between \(\epsilon\) and \(\delta\). Given +\(\lambda\) (resp. \(\rho\)) and parameters like variance of the noise and +the sensitivity of the query, we can write \(\rho = \rho(\lambda)\) (resp. +\(\lambda = \lambda(\rho)\)). + +Using the Chernoff bound (6.7), we can bound the divergence variable: + +\[\mathbb P(L(p || q) \ge \epsilon) \le {\mathbb E \exp(t L(p || q)) \over \exp(t \epsilon))} = \exp (\kappa_{p, q}(t) - \epsilon t). \qquad (7.7)\] + +For a function \(f: I \to \mathbb R\), denote its +[[https://en.wikipedia.org/wiki/Legendre_transformation][Legendre +transform]] by + +\[f^*(\epsilon) := \sup_{t \in I} (\epsilon t - f(t)).\] + +By taking infimum on the RHS of (7.7), we obtain + +*Claim 20*. Two probability densities \(p\) and \(q\) are +\((\epsilon, \exp(-\kappa_{p, q}^*(\epsilon)))\)-ind. + +Given a mechanism \(M\), let \(\kappa_M(t)\) denote an upper bound for the +cumulant of its privacy loss: + +\[\log \mathbb E \exp(t L(M(x) || M(x'))) \le \kappa_M(t), \qquad \forall x, x'\text{ with } d(x, x') = 1.\] + +For example, we can set \(\kappa_M(t) = t \rho(t + 1)\). Using the same +argument we have the following: + +*Claim 21*. If \(M\) is \((\lambda, \rho)\)-rdp, then + +1. it is also \((\epsilon, \exp((\lambda - 1) (\rho - \epsilon)))\)-dp for + any \(\epsilon \ge \rho\). +2. Alternatively, \(M\) is \((\epsilon, - \exp(\kappa_M^*(\epsilon)))\)-dp + for any \(\epsilon > 0\). +3. Alternatively, for any \(0 < \delta \le 1\), \(M\) is + \((\rho + (\lambda - 1)^{-1} \log \delta^{-1}, \delta)\)-dp. + +*Example 2 (Gaussian mechanism)*. We can apply the above argument to the +Gaussian mechanism on query \(f\) and get: + +\[\delta \le \inf_{\lambda > 1} \exp((\lambda - 1) ({\lambda \over 2 \sigma^2} - \epsilon))\] + +By assuming \(\sigma^2 > (2 \epsilon)^{-1}\) we have that the infimum is +achieved when \(\lambda = (1 + 2 \epsilon / \sigma^2) / 2\) and + +\[\delta \le \exp(- ((2 \sigma)^{-1} - \epsilon \sigma)^2 / 2)\] + +which is the same result as (6.8), obtained using the Chernoff bound of +the noise. + +However, as we will see later, compositions will yield different results +from those obtained from methods in +[[/posts/2019-03-13-a-tail-of-two-densities.html][Part 1]] when +considering Rényi dp. + +*** Moment Composition + :PROPERTIES: + :CUSTOM_ID: moment-composition + :END: +*Claim 22 (Moment Composition Theorem)*. Let \(M\) be the adaptive +composition of \(M_{1 : k}\). Suppose for any \(y_{< i}\), \(M_i(y_{< i})\) is +\((\lambda, \rho)\)-rdp. Then \(M\) is \((\lambda, k\rho)\)-rdp. + +*Proof*. Rather straightforward. As before let \(p_i\) and \(q_i\) be the +conditional laws of adpative composition of \(M_{1 : i}\) at \(x\) and \(x'\) +respectively, and \(p^i\) and \(q^i\) be the joint laws of \(M_{1 : i}\) at +\(x\) and \(x'\) respectively. Denote + +\[D_i = \mathbb E \exp((\lambda - 1)\log {p^i(\xi_{1 : i}) \over q^i(\xi_{1 : i})})\] + +Then + +$$\begin{aligned} +D_i &= \mathbb E\mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) \exp((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}) \big| \xi_{< i}\right) \\ +&= \mathbb E \mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) | \xi_{< i}\right) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\ +&\le \mathbb E \exp((\lambda - 1) \rho) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\ +&= \exp((\lambda - 1) \rho) D_{i - 1}. +\end{aligned}$$ + +Applying this recursively we have + +\[D_k \le \exp(k(\lambda - 1) \rho),\] + +and so + +\[(\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1)\log {p^k(\xi_{1 : i}) \over q^k(\xi_{1 : i})}) = (\lambda - 1)^{-1} \log D_k \le k \rho.\] + +Since this holds for all \(x\) and \(x'\), we are done. \(\square\) + +This, together with the scaling property of the legendre transformation: + +\[(k f)^*(x) = k f^*(x / k)\] + +yields + +*Claim 23*. The \(k\)-fold adaptive composition of +\((\lambda, \rho(\lambda))\)-rdp mechanisms is +\((\epsilon, \exp(- k \kappa^*(\epsilon / k)))\)-dp, where +\(\kappa(t) := t \rho(t + 1)\). + +*Example 3 (Gaussian mechanism)*. We can apply the above claim to +Gaussian mechanism. Again, without loss of generality we assume +\(S_f = 1\). But let us do it manually to get the same results. If we +apply the Moment Composition Theorem to the an adaptive composition of +Gaussian mechanisms on the same query, then since each \(M_i\) is +\((\lambda, (2 \sigma^2)^{-1} \lambda)\)-rdp, the composition \(M\) is +\((\lambda, (2 \sigma^2)^{-1} k \lambda)\)-rdp. Processing this using the +Chernoff bound as in the previous example, we have + +\[\delta = \exp(- ((2 \sigma / \sqrt k)^{-1} - \epsilon \sigma / \sqrt k)^2 / 2),\] + +Substituting \(\sigma\) with \(\sigma / \sqrt k\) in (6.81), we conclude +that if + +\[\sigma > \sqrt k \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right)\] + +then the composition \(M\) is \((\epsilon, \delta)\)-dp. + +As we will see in the discussions at the end of this post, this result +is different from (and probably better than) the one obtained by using +the Advanced Composition Theorem (Claim 18). + +*** Subsampling + :PROPERTIES: + :CUSTOM_ID: subsampling + :END: +We also have a subsampling theorem for the Rényi dp. + +*Claim 24*. Fix \(r \in [0, 1]\). Let \(m \le n\) be two nonnegative +integers with \(m = r n\). Let \(N\) be a \((\lambda, \rho)\)-rdp machanism on +\(X^m\). Let \(\mathcal I := \{J \subset [n]: |J| = m\}\) be the set of +subsets of \([n]\) of size \(m\). Define mechanism \(M\) on \(X^n\) by + +\[M(x) = N(x_\gamma)\] + +where \(\gamma\) is sampled uniformly from \(\mathcal I\). Then \(M\) is +\((\lambda, {1 \over \lambda - 1} \log (1 + r(e^{(\lambda - 1) \rho} - 1)))\)-rdp. + +To prove Claim 24, we need a useful lemma: + +*Claim 25*. Let \(p_{1 : n}\) and \(q_{1 : n}\) be nonnegative integers, and +\(\lambda > 1\). Then + +\[{(\sum p_i)^\lambda \over (\sum q_i)^{\lambda - 1}} \le \sum_i {p_i^\lambda \over q_i^{\lambda - 1}}. \qquad (8)\] + +*Proof*. Let + +\[r(i) := p_i / P, \qquad u(i) := q_i / Q\] + +where + +\[P := \sum p_i, \qquad Q := \sum q_i\] + +then \(r\) and \(u\) are probability mass functions. Plugging in +\(p_i = r(i) P\) and \(q_i = u(i) Q\) into the objective (8), it suffices to +show + +\[1 \le \sum_i {r(i)^\lambda \over u(i)^{\lambda - 1}} = \mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda\] + +This is true due to Jensen's Inequality: + +\[\mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda \ge \left(\mathbb E_{\xi \sim u} {r(\xi) \over u(\xi)} \right)^\lambda = 1.\] + +\(\square\) + +*Proof of Claim 24*. Define \(\mathcal I\) as before. + +Let \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively. For any +\(I \in \mathcal I\), let \(p_I\) and \(q_I\) be the laws of \(N(x_I)\) and +\(N(x_I')\) respectively. Then we have + +$$\begin{aligned} +p(y) &= n^{-1} \sum_{I \in \mathcal I} p_I(y) \\ +q(y) &= n^{-1} \sum_{I \in \mathcal I} q_I(y), +\end{aligned}$$ + +where \(n = |\mathcal I|\). + +The MGF of \(L(p || q)\) is thus + +\[\mathbb E((\lambda - 1) L(p || q)) = n^{-1} \int {(\sum_I p_I(y))^\lambda \over (\sum_I q_I(y))^{\lambda - 1}} dy \le n^{-1} \sum_I \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy \qquad (9)\] + +where in the last step we used Claim 25. As in the proof of Claim 19, we +divide \(\mathcal I\) into disjoint sets \(\mathcal I_\in\) and +\(\mathcal I_\notin\). Furthermore we denote by \(n_\in\) and \(n_\notin\) +their cardinalities. Then the right hand side of (9) becomes + +\[n^{-1} \sum_{I \in \mathcal I_\in} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy + n^{-1} \sum_{I \in \mathcal I_\notin} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy\] + +The summands in the first are the MGF of \(L(p_I || q_I)\), and the +summands in the second term are \(1\), so + +$$\begin{aligned} +\mathbb E((\lambda - 1) L(p || q)) &\le n^{-1} \sum_{I \in \mathcal I_\in} \mathbb E \exp((\lambda - 1) L(p_I || q_I)) + (1 - r) \\ +&\le n^{-1} \sum_{I \in \mathcal I_\in} \exp((\lambda - 1) D_\lambda(p_I || q_I)) + (1 - r) \\ +&\le r \exp((\lambda - 1) \rho) + (1 - r). +\end{aligned}$$ + +Taking log and dividing by \((\lambda - 1)\) on both sides we have + +\[D_\lambda(p || q) \le (\lambda - 1)^{-1} \log (1 + r(\exp((\lambda - 1) \rho) - 1)).\] + +\(\square\) + +As before, we can rewrite the conclusion of Lemma 6 using +\(1 + z \le e^z\) and obtain +\((\lambda, (\lambda - 1)^{-1} r (e^{(\lambda - 1) \rho} - 1))\)-rdp, +which further gives \((\lambda, \alpha^{-1} (e^\alpha - 1) r \rho)\)-rdp +(or \((\lambda, O(r \rho))\)-rdp) if \((\lambda - 1) \rho < \alpha\) for +some \(\alpha\). + +It is not hard to see that the subsampling theorem in moment method, +even though similar to the results of that in the usual method, does not +help due to lack of an analogue of advanced composition theorem of the +moments. + +*Example 4 (Gaussian mechanism)*. Applying the moment subsampling +theorem to the Gaussian mechanism, we obtain +\((\lambda, O(r \lambda / \sigma^2))\)-rdp for a subsampled Gaussian +mechanism with rate \(r\). +Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016 (ACGMMTZ16 in the +following), however, gains an extra \(r\) in the bound given certain +assumptions. + +** ACGMMTZ16 + :PROPERTIES: + :CUSTOM_ID: acgmmtz16 + :END: +What follows is my understanding of this result. I call it a conjecture +because there is a gap which I am not able to reproduce their proof or +prove it myself. This does not mean the result is false. On the +contrary, I am inclined to believe it is true. + +*Claim 26*. Assuming Conjecture 1 (see below) is true. For a subsampled +Gaussian mechanism with ratio \(r\), if \(r = O(\sigma^{-1})\) and +\(\lambda = O(\sigma^2)\), then we have +\((\lambda, O(r^2 \lambda / \sigma^2))\)-rdp. + +Wait, why is there a conjecture? Well, I have tried but not been able to +prove the following, which is a hidden assumption in the original proof: + +*Conjecture 1*. Let \(M\) be the Gaussian mechanism with subsampling rate +\(r\), and \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively, +where \(d(x, x') = 1\). Then + +\[D_\lambda (p || q) \le D_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0)\] + +where \(\mu_i = N(i, \sigma^2)\). + +*Remark*. Conjecture 1 is heuristically reasonable. To see this, let us +use the notations \(p_I\) and \(q_I\) to be \(q\) and \(p\) conditioned on the +subsampling index \(I\), just like in the proof of the subsampling +theorems (Claim 19 and 24). Then for \(I \in \mathcal I_\in\), + +\[D_\lambda(p_I || q_I) \le D_\lambda(\mu_0 || \mu_1),\] + +and for \(I \in \mathcal I_\notin\), + +\[D_\lambda(p_I || q_I) = 0 = D_\lambda(\mu_0 || \mu_0).\] + +Since we are taking an average over \(\mathcal I\), of which +\(r |\mathcal I|\) are in \(\mathcal I_\in\) and \((1 - r) |\mathcal I|\) are +in \(\mathcal I_\notin\), (9.3) says "the inequalities carry over +averaging". + +[[https://math.stackexchange.com/a/3152296/149540][A more general +version of Conjecture 1 has been proven false]]. The counter example for +the general version does not apply here, so it is still possible +Conjecture 1 is true. + +Let \(p_\in\) (resp. \(q_\in\)) be the average of \(p_I\) (resp. \(q_I\)) over +\(I \in \mathcal I_\in\), and \(p_\notin\) (resp. \(q_\notin\)) be the average +of \(p_I\) (resp. \(q_I\)) over \(I \in \mathcal I_\notin\). + +Immediately we have \(p_\notin = q_\notin\), hence + +\[D_\lambda(p_\notin || q_\notin) = 0 = D_\lambda(\mu_0 || \mu_0). \qquad(9.7)\] + +By Claim 25, we have + +\[D_\lambda(p_\in || q_\in) \le D_\lambda (\mu_1 || \mu_0). \qquad(9.9) \] + +So one way to prove Conjecture 1 is perhaps prove a more specialised +comparison theorem than the false conjecture: + +Given (9.7) and (9.9), show that + +\[D_\lambda(r p_\in + (1 - r) p_\notin || r q_\in + (1 - r) q_\notin) \le D_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0).\] + +[End of Remark] + +#+begin_html + +#+end_html + +Recall the definition of \(G_\lambda\) under the definition of Rényi +differential privacy. The following Claim will be useful. + +*Claim 27*. Let \(\lambda\) be a positive integer, then + +\[G_\lambda(r p + (1 - r) q || q) = \sum_{k = 1 : \lambda} {\lambda \choose k} r^k (1 - r)^{\lambda - k} G_k(p || q).\] + +*Proof*. Quite straightforward, by expanding the numerator +\((r p + (1 - r) q)^\lambda\) using binomial expansion. \(\square\) + +*Proof of Claim 26*. By Conjecture 1, it suffices to prove the +following: + +If \(r \le c_1 \sigma^{-1}\) and \(\lambda \le c_2 \sigma^2\) for some +positive constant \(c_1\) and \(c_2\), then there exists \(C = C(c_1, c_2)\) +such that \(G_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0) \le C\) (since +\(O(r^2 \lambda^2 / \sigma^2) = O(1)\)). + +*Remark in the proof*. Note that the choice of \(c_1\), \(c_2\) and the +function \(C(c_1, c_2)\) are important to the practicality and usefulness +of Claim 26. + +#+begin_html + +#+end_html + +Using Claim 27 and Example 1, we have + +$$\begin{aligned} +G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0)) &= \sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} G_j(\mu_1 || \mu_0)\\ +&=\sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} \exp(j (j - 1) / 2 \sigma^2). \qquad (9.5) +\end{aligned}$$ + +Denote by \(n = \lceil \sigma^2 \rceil\). It suffices to show + +\[\sum_{j = 0 : n} {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j (j - 1) / 2 n) \le C\] + +Note that we can discard the linear term \(- c_2 j / \sigma^2\) in the +exponential term since we want to bound the sum from above. + +We examine the asymptotics of this sum when \(n\) is large, and treat the +sum as an approximation to an integration of a function +\(\phi: [0, 1] \to \mathbb R\). For \(j = x n\), where \(x \in (0, 1)\), +\(\phi\) is thus defined as (note we multiply the summand with \(n\) to +compensate the uniform measure on \(1, ..., n\): + +$$\begin{aligned} +\phi_n(x) &:= n {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j^2 / 2 n) \\ +&= n {n \choose x n} (c_1 n^{- 1 / 2})^{x n} (1 - c_1 n^{- 1 / 2})^{(1 - x) n} \exp(c_2 x^2 n / 2) +\end{aligned}$$ + +Using Stirling's approximation + +\[n! \approx \sqrt{2 \pi n} n^n e^{- n},\] + +we can approach the binomial coefficient: + +\[{n \choose x n} \approx (\sqrt{2 \pi x (1 - x)} x^{x n} (1 - x)^{(1 - x) n})^{-1}.\] + +We also approximate + +\[(1 - c_1 n^{- 1 / 2})^{(1 - x) n} \approx \exp(- c_1 \sqrt{n} (1 - x)).\] + +With these we have + +\[\phi_n(x) \approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left(- {1 \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x) + {1 \over 2} c_2 x^2) n + {1 \over 2} \log n\right).\] + +This vanishes as \(n \to \infty\), and since \(\phi_n(x)\) is bounded above +by the integrable function \({1 \over \sqrt{2 \pi x (1 - x)}}\) (c.f. the +arcsine law), and below by \(0\), we may invoke the dominant convergence +theorem and exchange the integral with the limit and get + +$$\begin{aligned} +\lim_{n \to \infty} &G_n (r \mu_1 + (1 - r) \mu_0 || \mu_0)) \\ +&\le \lim_{n \to \infty} \int \phi_n(x) dx = \int \lim_{n \to \infty} \phi_n(x) dx = 0. +\end{aligned}$$ + +Thus we have that the generating function of the divergence variable +\(L(r \mu_1 + (1 - r) \mu_0 || \mu_0)\) is bounded. + +Can this be true for better orders + +\[r \le c_1 \sigma^{- d_r},\qquad \lambda \le c_2 \sigma^{d_\lambda}\] + +for some \(d_r \in (0, 1]\) and \(d_\lambda \in [2, \infty)\)? If we follow +the same approximation using these exponents, then letting +\(n = c_2 \sigma^{d_\lambda}\), + +$$\begin{aligned} +{n \choose j} &r^j (1 - r)^{n - j} G_j(\mu_0 || \mu_1) \le \phi_n(x) \\ +&\approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left({1 \over 2} c_2^{2 \over d_\lambda} x^2 n^{2 - {2 \over d_\lambda}} - {d_r \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x)) n + {1 \over 2} \log n\right). +\end{aligned}$$ + +So we see that to keep the divergence moments bounded it is possible to +have any \(r = O(\sigma^{- d_r})\) for \(d_r \in (0, 1)\), but relaxing +\(\lambda\) may not be safe. + +If we relax \(r\), then we get + +\[G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0) = O(r^{2 / d_r} \lambda^2 \sigma^{-2}) = O(1).\] + +Note that now the constant \(C\) depends on \(d_r\) as well. Numerical +experiments seem to suggest that \(C\) can increase quite rapidly as \(d_r\) +decreases from \(1\). \(\square\) + +In the following for consistency we retain \(k\) as the number of epochs, +and use \(T := k / r\) to denote the number of compositions / steps / +minibatches. With Claim 26 we have: + +*Claim 28*. Assuming Conjecture 1 is true. Let \(\epsilon, c_1, c_2 > 0\), +\(r \le c_1 \sigma^{-1}\), +\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\). then the DP-SGD with +subsampling rate \(r\), and \(T\) steps is \((\epsilon, \delta)\)-dp for + +\[\delta = \exp(- {1 \over 2} c_2 \sigma^2 \epsilon).\] + +In other words, for + +\[\sigma \ge \sqrt{2 c_2^{-1}} \epsilon^{- {1 \over 2}} \sqrt{\log \delta^{-1}},\] + +we can achieve \((\epsilon, \delta)\)-dp. + +*Proof*. By Claim 26 and the Moment Composition Theorem (Claim 22), for +\(\lambda = c_2 \sigma^2\), substituting +\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\), we have + +\[\mathbb P(L(p || q) \ge \epsilon) \le \exp(k C(c_1, c_2) - \lambda \epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right).\] + +\(\square\) + +*Remark*. Claim 28 is my understanding / version of Theorem 1 in +[ACGMMTZ16], by using the same proof technique. Here I quote the +original version of theorem with notions and notations altered for +consistency with this post: + +#+begin_quote + There exists constants \(c_1', c_2' > 0\) so that for any + \(\epsilon < c_1' r^2 T\), DP-SGD is \((\epsilon, \delta)\)-differentially + private for any \(\delta > 0\) if we choose +#+end_quote + +\[\sigma \ge c_2' {r \sqrt{T \log (1 / \delta)} \over \epsilon}. \qquad (10)\] + +I am however unable to reproduce this version, assuming Conjecture 1 is +true, for the following reasons: + +1. In the proof in the paper, we have \(\epsilon = c_1' r^2 T\) instead of + "less than" in the statement of the Theorem. If we change it to + \(\epsilon < c_1' r^2 T\) then the direction of the inequality becomes + opposite to the direction we want to prove: + \[\exp(k C(c_1, c_2) - \lambda \epsilon) \ge ...\] + +2. The condition \(r = O(\sigma^{-1})\) of Claim 26 whose result is used + in the proof of this theorem is not mentioned in the statement of the + proof. The implication is that (10) becomes an ill-formed condition + as the right hand side also depends on \(\sigma\). + +** Tensorflow implementation + :PROPERTIES: + :CUSTOM_ID: tensorflow-implementation + :END: +The DP-SGD is implemented in +[[https://github.com/tensorflow/privacy][TensorFlow Privacy]]. In the +following I discuss the package in the current state (2019-03-11). It is +divided into two parts: +[[https://github.com/tensorflow/privacy/tree/master/privacy/optimizers][=optimizers=]] +which implements the actual differentially private algorithms, and +[[https://github.com/tensorflow/privacy/tree/master/privacy/analysis][=analysis=]] +which computes the privacy guarantee. + +The =analysis= part implements a privacy ledger that "keeps a record of +all queries executed over a given dataset for the purpose of computing +privacy guarantees". On the other hand, all the computation is done in +[[https://github.com/tensorflow/privacy/blob/7e2d796bdee9b60dce21a82a397eefda35b0ac10/privacy/analysis/rdp_accountant.py][=rdp_accountant.py=]]. +At this moment, =rdp_accountant.py= only implements the computation of +the privacy guarantees for DP-SGD with Gaussian mechanism. In the +following I will briefly explain the code in this file. + +Some notational correspondences: their =alpha= is our \(\lambda\), their +=q= is our \(r\), their =A_alpha= (in the comments) is our +\(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2)} (\lambda - 1)\), at +least when \(\lambda\) is an integer. + +- The function =_compute_log_a= presumably computes the cumulants + \(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2), N(0, \sigma^2)}(\lambda - 1)\). + It calls =_compute_log_a_int= or =_compute_log_a_frac= depending on + whether \(\lambda\) is an integer. +- The function =_compute_log_a_int= computes the cumulant using (9.5). +- When \(\lambda\) is not an integer, we can't use (9.5). I have yet to + decode how =_compute_log_a_frac= computes the cumulant (or an upper + bound of it) in this case +- The function =_compute_delta= computes \(\delta\)s for a list of + \(\lambda\)s and \(\kappa\)s using Item 1 of Claim 25 and return the + smallest one, and the function =_compute_epsilon= computes epsilon + uses Item 3 in Claim 25 in the same way. + +In =optimizers=, among other things, the DP-SGD with Gaussian mechanism +is implemented in =dp_optimizer.py= and =gaussian_query.py=. See the +definition of =DPGradientDescentGaussianOptimizer= in =dp_optimizer.py= +and trace the calls therein. + +At this moment, the privacy guarantee computation part and the optimizer +part are separated, with =rdp_accountant.py= called in +=compute_dp_sgd_privacy.py= with user-supplied parameters. I think this +is due to the lack of implementation in =rdp_accountant.py= of any +non-DPSGD-with-Gaussian privacy guarantee computation. There is already +[[https://github.com/tensorflow/privacy/issues/23][an issue on this]], +so hopefully it won't be long before the privacy guarantees can be +automatically computed given a DP-SGD instance. + +** Comparison among different methods + :PROPERTIES: + :CUSTOM_ID: comparison-among-different-methods + :END: +So far we have seen three routes to compute the privacy guarantees for +DP-SGD with the Gaussian mechanism: + +1. Claim 9 (single Gaussian mechanism privacy guarantee) -> Claim 19 + (Subsampling theorem) -> Claim 18 (Advanced Adaptive Composition + Theorem) +2. Example 1 (RDP for the Gaussian mechanism) -> Claim 22 (Moment + Composition Theorem) -> Example 3 (Moment composition applied to the + Gaussian mechanism) +3. Claim 26 (RDP for Gaussian mechanism with specific magnitudes for + subsampling rate) -> Claim 28 (Moment Composition Theorem and + translation to conventional DP) + +Which one is the best? + +To make fair comparison, we may use one parameter as the metric and set +all others to be the same. For example, we can + +1. Given the same \(\epsilon\), \(r\) (in Route 1 and 3), \(k\), \(\sigma\), + compare the \(\delta\)s +2. Given the same \(\epsilon\), \(r\) (in Route 1 and 3), \(k\), \(\delta\), + compare the \(\sigma\)s +3. Given the same \(\delta\), \(r\) (in Route 1 and 3), \(k\), \(\sigma\), + compare the \(\epsilon\)s. + +I find that the first one, where \(\delta\) is used as a metric, the best. +This is because we have the tightest bounds and the cleanest formula +when comparing the \(\delta\). For example, the Azuma and Chernoff bounds +are both expressed as a bound for \(\delta\). On the other hand, the +inversion of these bounds either requires a cost in the tightness (Claim +9, bounds on \(\sigma\)) or in the complexity of the formula (Claim 16 +Advanced Adaptive Composition Theorem, bounds on \(\epsilon\)). + +So if we use \(\sigma\) or \(\epsilon\) as a metric, either we get a less +fair comparison, or have to use a much more complicated formula as the +bounds. + +Let us first compare Route 1 and Route 2 without specialising to the +Gaussian mechanism. + +*Warning*. What follows is a bit messy. + +Suppose each mechanism \(N_i\) satisfies +\((\epsilon', \delta(\epsilon'))\)-dp. Let +\(\tilde \epsilon := \log (1 + r (e^{\epsilon'} - 1))\), then we have the +subsampled mechanism \(M_i(x) = N_i(x_\gamma)\) is +\((\tilde \epsilon, r \tilde \delta(\tilde \epsilon))\)-dp, where + +\[\tilde \delta(\tilde \epsilon) = \delta(\log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1))\] + +Using the Azuma bound in the proof of Advanced Adaptive Composition +Theorem (6.99): + +\[\mathbb P(L(p^k || q^k) \ge \epsilon) \le \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}).\] + +So we have the final bound for Route 1: + +\[\delta_1(\epsilon) = \min_{\tilde \epsilon: \epsilon > r^{-1} k a(\tilde \epsilon)} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \tilde \delta(\tilde \epsilon).\] + +As for Route 2, since we do not gain anything from subsampling in RDP, +we do not subsample at all. + +By Claim 23, we have the bound for Route 2: + +\[\delta_2(\epsilon) = \exp(- k \kappa^* (\epsilon / k)).\] + +On one hand, one can compare \(\delta_1\) and \(\delta_2\) with numerical +experiments. On the other hand, if we further specify +\(\delta(\epsilon')\) in Route 1 as the Chernoff bound for the cumulants +of divergence variable, i.e. + +\[\delta(\epsilon') = \exp(- \kappa^* (\epsilon')),\] + +we have + +\[\delta_1 (\epsilon) = \min_{\tilde \epsilon: a(\tilde \epsilon) < r k^{-1} \epsilon} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \exp(- \kappa^* (b(\tilde\epsilon))),\] + +where + +\[b(\tilde \epsilon) := \log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1) \le r^{-1} \tilde\epsilon.\] + +We note that since +\(a(\tilde \epsilon) = \tilde\epsilon(e^{\tilde \epsilon} - 1) 1_{\tilde\epsilon < \log 2} + \tilde\epsilon 1_{\tilde\epsilon \ge \log 2}\), +we may compare the two cases separately. + +Note that we have \(\kappa^*\) is a monotonously increasing function, +therefore + +\[\kappa^* (b(\tilde\epsilon)) \le \kappa^*(r^{-1} \tilde\epsilon).\] + +So for \(\tilde \epsilon \ge \log 2\), we have + +\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(r^{-1} \tilde \epsilon)) \ge k \exp(- \kappa^*(k^{-1} \epsilon)) \ge \delta_2(\epsilon).\] + +For \(\tilde\epsilon < \log 2\), it is harder to compare, as now + +\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(\epsilon / \sqrt{r k})).\] + +It is tempting to believe that this should also be greater than +\(\delta_2(\epsilon)\). But I can not say for sure. At least in the +special case of Gaussian, we have + +\[k \exp(- \kappa^*(\epsilon / \sqrt{r k})) = k \exp(- (\sigma \sqrt{\epsilon / k r} - (2 \sigma)^{-1})^2) \ge \exp(- k ({\sigma \epsilon \over k} - (2 \sigma)^{-1})^2) = \delta_2(\epsilon)\] + +when \(\epsilon\) is sufficiently small. However we still need to consider +the case where \(\epsilon\) is not too small. But overall it seems most +likely Route 2 is superior than Route 1. + +So let us compare Route 2 with Route 3: + +Given the condition to obtain the Chernoff bound + +\[{\sigma \epsilon \over k} > (2 \sigma)^{-1}\] + +we have + +\[\delta_2(\epsilon) > \exp(- k (\sigma \epsilon / k)^2) = \exp(- \sigma^2 \epsilon^2 / k).\] + +For this to achieve the same bound + +\[\delta_3(\epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right)\] + +we need \(k < {2 \epsilon \over c_2}\). This is only possible if \(c_2\) is +small or \(\epsilon\) is large, since \(k\) is a positive integer. + +So taking at face value, Route 3 seems to achieve the best results. +However, it also has some similar implicit conditions that need to be +satisfied: First \(T\) needs to be at least \(1\), meaning + +\[{c_2 \over C(c_1, c_2)} \epsilon \sigma^2 \ge 1.\] + +Second, \(k\) needs to be at least \(1\) as well, i.e. + +\[k = r T \ge {c_1 c_2 \over C(c_1, c_2)} \epsilon \sigma \ge 1.\] + +Both conditions rely on the magnitudes of \(\epsilon\), \(\sigma\), \(c_1\), +\(c_2\), and the rate of growth of \(C(c_1, c_2)\). The biggest problem in +this list is the last, because if we know how fast \(C\) grows then we'll +have a better idea what are the constraints for the parameters to +achieve the result in Route 3. + +** Further questions + :PROPERTIES: + :CUSTOM_ID: further-questions + :END: +Here is a list of what I think may be interesting topics or potential +problems to look at, with no guarantee that they are all awesome +untouched research problems: + +1. Prove Conjecture 1 +2. Find a theoretically definitive answer whether the methods in Part 1 + or Part 2 yield better privacy guarantees. +3. Study the non-Gaussian cases, general or specific. Let \(p\) be some + probability density, what is the tail bound of + \(L(p(y) || p(y + \alpha))\) for \(|\alpha| \le 1\)? Can you find + anything better than Gaussian? For a start, perhaps the nice tables + of Rényi divergence in Gil-Alajaji-Linder 2013 may be useful? +4. Find out how useful Claim 26 is. Perhaps start with computing the + constant \(C\) nemerically. +5. Help with [[https://github.com/tensorflow/privacy/issues/23][the + aforementioned issue]] in the Tensorflow privacy package. + +** References + :PROPERTIES: + :CUSTOM_ID: references + :END: + +- Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya + Mironov, Kunal Talwar, and Li Zhang. "Deep Learning with Differential + Privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer + and Communications Security - CCS'16, 2016, 308--18. + [[https://doi.org/10.1145/2976749.2978318]]. +- Erven, Tim van, and Peter Harremoës. "R\'enyi Divergence and + Kullback-Leibler Divergence." IEEE Transactions on Information Theory + 60, no. 7 (July 2014): 3797--3820. + [[https://doi.org/10.1109/TIT.2014.2320500]]. +- Gil, M., F. Alajaji, and T. Linder. "Rényi Divergence Measures for + Commonly Used Univariate Continuous Distributions." Information + Sciences 249 (November 2013): 124--31. + [[https://doi.org/10.1016/j.ins.2013.06.018]]. +- Mironov, Ilya. "Renyi Differential Privacy." 2017 IEEE 30th Computer + Security Foundations Symposium (CSF), August 2017, 263--75. + [[https://doi.org/10.1109/CSF.2017.11]]. diff --git a/posts/blog.html b/posts/blog.html new file mode 100644 index 0000000..80176e7 --- /dev/null +++ b/posts/blog.html @@ -0,0 +1,21 @@ +#+TITLE: All posts + +- *[[file:sitemap.org][All posts]]* - 2021-06-17 +- *[[file:2019-03-14-great-but-manageable-expectations.org][Great but Manageable Expectations]]* - 2019-03-14 +- *[[file:2019-03-13-a-tail-of-two-densities.org][A Tail of Two Densities]]* - 2019-03-13 +- *[[file:2019-02-14-raise-your-elbo.org][Raise your ELBO]]* - 2019-02-14 +- *[[file:2019-01-03-discriminant-analysis.org][Discriminant analysis]]* - 2019-01-03 +- *[[file:2018-12-02-lime-shapley.org][Shapley, LIME and SHAP]]* - 2018-12-02 +- *[[file:2018-06-03-automatic_differentiation.org][Automatic differentiation]]* - 2018-06-03 +- *[[file:2018-04-10-update-open-research.org][Updates on open research]]* - 2018-04-29 +- *[[file:2017-08-07-mathematical_bazaar.org][The Mathematical Bazaar]]* - 2017-08-07 +- *[[file:2017-04-25-open_research_toywiki.org][Open mathematical research and launching toywiki]]* - 2017-04-25 +- *[[file:2016-10-13-q-robinson-schensted-knuth-polymer.org][A \(q\)-Robinson-Schensted-Knuth algorithm and a \(q\)-polymer]]* - 2016-10-13 +- *[[file:2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.org][AMS review of 'Double Macdonald polynomials as the stable limit of Macdonald superpolynomials' by Blondeau-Fournier, Lapointe and Mathieu]]* - 2015-07-15 +- *[[file:2015-07-01-causal-quantum-product-levy-area.org][On a causal quantum double product integral related to Lévy stochastic area.]]* - 2015-07-01 +- *[[file:2015-05-30-infinite-binary-words-containing-repetitions-odd-periods.org][AMS review of 'Infinite binary words containing repetitions of odd period' by Badkobeh and Crochemore]]* - 2015-05-30 +- *[[file:2015-04-02-juggling-skill-tree.org][jst]]* - 2015-04-02 +- *[[file:2015-04-01-unitary-double-products.org][Unitary causal quantum stochastic double products as universal]]* - 2015-04-01 +- *[[file:2015-01-20-weighted-interpretation-super-catalan-numbers.org][AMS review of 'A weighted interpretation for the super Catalan]]* - 2015-01-20 +- *[[file:2014-04-01-q-robinson-schensted-symmetry-paper.org][Symmetry property of \(q\)-weighted Robinson-Schensted algorithms and branching algorithms]]* - 2014-04-01 +- *[[file:2013-06-01-q-robinson-schensted-paper.org][A \(q\)-weighted Robinson-Schensted algorithm]]* - 2013-06-01 \ No newline at end of file -- cgit v1.2.3