From 5d0f3619de0d1fd48477164838bb9c07499ec82e Mon Sep 17 00:00:00 2001
From: Yuchen Pei <me@ypei.me>
Date: Thu, 14 Mar 2019 11:17:22 +0100
Subject: added a post

---
 ...2019-03-14-great-but-manageable-expectations.md | 752 +++++++++++++++++++++
 1 file changed, 752 insertions(+)
 create mode 100644 posts/2019-03-14-great-but-manageable-expectations.md

(limited to 'posts/2019-03-14-great-but-manageable-expectations.md')

diff --git a/posts/2019-03-14-great-but-manageable-expectations.md b/posts/2019-03-14-great-but-manageable-expectations.md
new file mode 100644
index 0000000..6c4e04e
--- /dev/null
+++ b/posts/2019-03-14-great-but-manageable-expectations.md
@@ -0,0 +1,752 @@
+---
+title: 'Great, but Manageable Expectations'
+date: 2019-03-14
+template: post
+comments: true
+---
+
+Let us continue with the study of differential privacy from [Part 1 of this post](2019-03-13-a-tail-of-two-densities.html).
+
+Rényi divergence and differential privacy 
+-----------------------------------------
+
+Recall in the proof of Gaussian mechanism privacy guarantee (Claim 8) we
+used the Chernoff bound for the Gaussian noise. Why not use the Chernoff
+bound for the divergence variable / privacy loss directly, since the
+latter is closer to the core subject than the noise? This leads us to
+the study of Rényi divergence.
+
+So far we have seen several notions of divergence used in differential
+privacy: the max divergence which is $\epsilon$-ind in disguise:
+
+$$D_\infty(p || q) := \max_y \log {p(y) \over q(y)},$$
+
+the $\delta$-approximate max divergence that defines the
+$(\epsilon, \delta)$-ind:
+
+$$D_\infty^\delta(p || q) := \max_y \log{p(y) - \delta \over q(y)},$$
+
+and the KL-divergence which is the expectation of the divergence
+variable:
+
+$$D(p || q) = \mathbb E L(p || q) = \int \log {p(y) \over q(y)} p(y) dy.$$
+
+The Rényi divergence is an interpolation between the max divergence and
+the KL-divergence, defined as the log moment generating function /
+cumulants of the divergence variable:
+
+$$D_\lambda(p || q) = (\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1) L(p || q)) = (\lambda - 1)^{-1} \log \int {p(y)^\lambda \over q(y)^{\lambda - 1}} dx.$$
+
+Indeed, when $\lambda \to \infty$ we recover the max divergence, and
+when $\lambda \to 1$, by recognising $D_\lambda$ as a derivative in
+$\lambda$ at $\lambda = 1$, we recover the KL divergence. In this post
+we only consider $\lambda > 1$.
+
+Using the Rényi divergence we may define:
+
+**Definition (Rényi
+differential privacy)** (Mironov 2017). An mechanism $M$ is
+$(\lambda, \rho)$*-Rényi differentially private* ($(\lambda, \rho)$-rdp)
+if for all $x$ and $x'$ with distance $1$,
+
+$$D_\lambda(M(x) || M(x')) \le \rho.$$
+
+For convenience we also define two related notions, $G_\lambda (f || g)$
+and $\kappa_{f, g} (t)$ for $\lambda > 1$, $t > 0$ and positive
+functions $f$ and $g$:
+
+$$G_\lambda(f || g) = \int f(y)^{\lambda} g(y)^{1 - \lambda} dy; \qquad \kappa_{f, g} (t) = \log G_{t + 1}(f || g).$$
+
+For probability densities $p$ and $q$, $G_{t + 1}(p || q)$ and
+$\kappa_{p, q}(t)$ are the $t$th moment generating function and cumulant
+of the divergence variable $L(p || q)$, and
+
+$$D_\lambda(p || q) = (\lambda - 1)^{-1} \kappa_{p, q}(\lambda - 1).$$
+
+In the following, whenever you see $t$, think of it as $\lambda - 1$.
+
+**Example 1 (RDP for Gaussian
+mechanism)**. Using the scaling and translation invariance of $L$ (6.1),
+we have that the divergence variable for two Gaussians with the same
+variance is
+
+$$L(N(\mu_1, \sigma^2) || N(\mu_2, \sigma^2)) \overset{d}{=} L(N(0, 1) || N((\mu_2 - \mu_1) / \sigma, 1)).$$
+
+With this we get
+
+$$D_\lambda(N(\mu_1, \sigma^2) || N(\mu_2, \sigma^2)) = {\lambda (\mu_2 - \mu_1)^2 \over 2 \sigma^2} = D_\lambda(N(\mu_2, \sigma^2) || N(\mu_1, \sigma^2)).$$
+
+Also due to the scaling invariance of $L$, we only need to consider $f$
+with sensitivity $1$, see the discussion under (6.1). The Gaussian
+mechanism on query $f$ is thus $(\lambda, \lambda / 2 \sigma^2)$-rdp for
+any $\lambda > 1$.
+
+From the example of Gaussian mechanism, we see that the relation between
+$\lambda$ and $\rho$ is like that between $\epsilon$ and $\delta$. Given
+$\lambda$ (resp. $\rho$) and parameters like variance of the noise and
+the sensitivity of the query, we can write $\rho = \rho(\lambda)$ (resp.
+$\lambda = \lambda(\rho)$).
+
+Using the Chernoff bound (6.7), we can bound the divergence variable:
+
+$$\mathbb P(L(p || q) \ge \epsilon) \le {\mathbb E \exp(t L(p || q)) \over \exp(t \epsilon))} =  \exp (\kappa_{p, q}(t) - \epsilon t). \qquad (7.7)$$
+
+For a function $f: I \to \mathbb R$, denote its Legendre transform by
+
+$$f^*(\epsilon) := \sup_{t \in I} (\epsilon t - f(t)).$$
+
+By taking infimum on the RHS of (7.7), we obtain
+
+**Claim 20**. Two probability densities $p$ and $q$ are
+$(\epsilon, \exp(-\kappa_{p, q}^*(\epsilon)))$-ind.
+
+Given a mechanism $M$, let $\kappa_M(t)$ denote an upper bound for the
+cumulant of its privacy loss:
+
+$$\log \mathbb E \exp(t L(M(x) || M(x'))) \le \kappa_M(t), \qquad \forall x, x'\text{ with } d(x, x') = 1.$$
+
+For example, we can set $\kappa_M(t) = t \rho(t + 1)$. Using the same
+argument we have the following:
+
+**Claim 21**.
+
+1.  If $M$ is $(\lambda, \rho)$-rdp, then it is also
+    $(\epsilon, \exp((\lambda - 1) (\rho - \epsilon)))$-dp for any
+    $\epsilon \ge \rho$.
+2.  Alternatively, $M$ is $(\epsilon, - \exp(\kappa_M^*(\epsilon)))$-dp
+    for any $\epsilon > 0$.
+3.  Alternatively, for any $0 < \delta \le 1$, $M$ is
+    $(\rho + (\lambda - 1)^{-1} \log \delta^{-1}, \delta)$-dp.
+
+**Example 2 (Gaussian mechanism)**.
+We can apply the above argument to the Gaussian mechanism on query $f$
+and get:
+
+$$\delta \le \inf_{\lambda > 1} \exp((\lambda - 1) ({\lambda \over 2 \sigma^2} - \epsilon))$$
+
+By assuming $\sigma^2 > (2 \epsilon)^{-1}$ we have that the infimum is
+achieved when $\lambda = (1 + 2 \epsilon / \sigma^2) / 2$ and
+
+$$\delta \le \exp(- ((2 \sigma)^{-1} - \epsilon \sigma)^2 / 2)$$
+
+which is the same result as (6.8), obtained using the Chernoff bound of
+the noise.
+
+However, as we will see later, compositions will yield different results
+from those obtained from methods in [Part 1](2019-03-13-a-tail-of-two-densities.html) when considering Rényi dp.
+
+**Claim 22 (Moment Composition
+Theorem)**. Let $M$ be the adaptive composition of $M_{1 : k}$. Suppose
+for any $y_{< i}$, $M_i(y_{< i})$ is $(\lambda, \rho)$-rdp. Then $M$ is
+$(\lambda, k\rho)$-rdp.
+
+**Proof**. Rather straightforward. As before let $p_i$ and
+$q_i$ be the conditional laws of adpative composition of $M_{1 : i}$ at
+$x$ and $x'$ respectively, and $p^i$ and $q^i$ be the joint laws of
+$M_{1 : i}$ at $x$ and $x'$ respectively. Denote
+
+$$D_i = \mathbb E \exp((\lambda - 1)\log {p^i(\xi_{1 : i}) \over q^i(\xi_{1 : i})})$$
+
+Then
+
+$$\begin{aligned}
+D_i &= \mathbb E\mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) \exp((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}) \big| \xi_{< i}\right) \\
+&= \mathbb E \mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) | \xi_{< i}\right) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\
+&\le \mathbb E \exp((\lambda - 1) \rho) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\
+&= \exp((\lambda - 1) \rho) D_{i - 1}.
+\end{aligned}$$
+
+Applying this recursively we have
+
+$$D_k \le \exp(k(\lambda - 1) \rho),$$
+
+and so
+
+$$(\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1)\log {p^k(\xi_{1 : i}) \over q^k(\xi_{1 : i})}) = (\lambda - 1)^{-1} \log D_k \le k \rho.$$
+
+Since this holds for all $x$ and $x'$, we are done. $\square$
+
+This, together with the scaling property of the legendre transformation:
+
+$$(k f)^*(x) = k f^*(x / k)$$
+
+yields
+
+**Claim 23**. The $k$-fold adaptive composition of
+$(\lambda, \rho(\lambda))$-rdp mechanisms is
+$(\epsilon, \exp(- k \kappa^*(\epsilon / k)))$-dp, where
+$\kappa(t) := t \rho(t + 1)$.
+
+**Example 3 (Gaussian mechanism)**.
+We can apply the above claim to Gaussian mechanism. Again, without loss
+of generality we assume $S_f = 1$. But let us do it manually to get the
+same results. If we apply the Moment Composition Theorem to the an
+adaptive composition of Gaussian mechanisms on the same query, then
+since each $M_i$ is $(\lambda, (2 \sigma^2)^{-1} \lambda)$-rdp, the
+composition $M$ is $(\lambda, (2 \sigma^2)^{-1} k \lambda)$-rdp.
+Processing this using the Chernoff bound as in the previous example, we
+have
+
+$$\delta = \exp(- ((2 \sigma / \sqrt k)^{-1} - \epsilon \sigma / \sqrt k)^2 / 2),$$
+
+Substituting $\sigma$ with $\sigma / \sqrt k$ in (6.81), we conclude
+that if
+
+$$\sigma > \sqrt k \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right)$$
+
+then the composition $M$ is $(\epsilon, \delta)$-dp.
+
+As we will see in the discussions at the end of this post, this result
+is different from (and probably better than) the one obtained by using
+the Advanced Composition Theorem (Claim 18).
+
+We also have a subsampling theorem for the Rényi dp.
+
+**Claim 24**. Fix $r \in [0, 1]$. Let $m \le n$ be two
+nonnegative integers with $m = r n$. Let $N$ be a $(\lambda, \rho)$-rdp
+machanism on $X^m$. Let $\mathcal I := \{J \subset [n]: |J| = m\}$ be
+the set of subsets of $[n]$ of size $m$. Define mechanism $M$ on $X^n$
+by
+
+$$M(x) = N(x_\gamma)$$
+
+where $\gamma$ is sampled uniformly from $\mathcal I$. Then $M$ is
+$(\lambda, {1 \over \lambda - 1} \log (1 + r(e^{(\lambda - 1) \rho} - 1)))$-rdp.
+
+To prove Claim 24, we need a useful lemma:
+
+{#Claim 25}**Claim 25**. Let $p_{1 : n}$ and $q_{1 : n}$ be
+nonnegative integers, and $\lambda > 1$. Then
+
+$${(\sum p_i)^\lambda \over (\sum q_i)^{\lambda - 1}} \le \sum_i {p_i^\lambda \over q_i^{\lambda - 1}}. \qquad (8)$$
+
+**Proof**. Let
+
+$$r(i) := p_i / P, \qquad u(i) := q_i / Q$$
+
+where
+
+$$P := \sum p_i, \qquad Q := \sum q_i$$
+
+then $r$ and $u$ are probability mass functions. Plugging in
+$p_i = r(i) P$ and $q_i = u(i) Q$ into the objective (8), it suffices to
+show
+
+$$1 \le \sum_i {r(i)^\lambda \over u(i)^{\lambda - 1}} = \mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda$$
+
+This is true due to Jensen\'s Inequality:
+
+$$\mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda \ge \left(\mathbb E_{\xi \sim u} {r(\xi) \over u(\xi)} \right)^\lambda = 1.$$
+
+$\square$
+
+**Proof of Claim 24**. Define $\mathcal I$ as
+before.
+
+Let $p$ and $q$ be the laws of $M(x)$ and $M(x')$ respectively. For any
+$I \in \mathcal I$, let $p_I$ and $q_I$ be the laws of $N(x_I)$ and
+$N(x_I')$ respectively. Then we have
+
+$$\begin{aligned}
+p(y) &= n^{-1} \sum_{I \in \mathcal I} p_I(y) \\
+q(y) &= n^{-1} \sum_{I \in \mathcal I} q_I(y),
+\end{aligned}$$
+
+where $n = |\mathcal I|$.
+
+The MGF of $L(p || q)$ is thus
+
+$$\mathbb E((\lambda - 1) L(p || q)) = n^{-1} \int {(\sum_I p_I(y))^\lambda \over (\sum_I q_I(y))^{\lambda - 1}} dy \le n^{-1} \sum_I \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy \qquad (9)$$
+
+where in the last step we used Claim 25. As in the proof of Claim 19, we
+divide $\mathcal I$ into disjoint sets $\mathcal I_\in$ and
+$\mathcal I_\notin$. Furthermore we denote by $n_\in$ and $n_\notin$
+their cardinalities. Then the right hand side of (9) becomes
+
+$$n^{-1} \sum_{I \in \mathcal I_\in} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy + n^{-1} \sum_{I \in \mathcal I_\notin} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy$$
+
+The summands in the first are the MGF of $L(p_I || q_I)$, and the
+summands in the second term are $1$, so
+
+$$\begin{aligned}
+\mathbb E((\lambda - 1) L(p || q)) &\le n^{-1} \sum_{I \in \mathcal I_\in} \mathbb E \exp((\lambda - 1) L(p_I || q_I)) + (1 - r) \\
+&\le n^{-1} \sum_{I \in \mathcal I_\in} \exp((\lambda - 1) D_\lambda(p_I || q_I)) + (1 - r) \\
+&\le r \exp((\lambda - 1) \rho) + (1 - r).
+\end{aligned}$$
+
+Taking log and dividing by $(\lambda - 1)$ on both sides we have
+
+$$D_\lambda(p || q) \le (\lambda - 1)^{-1} \log (1 + r(\exp((\lambda - 1) \rho) - 1)).$$
+
+$\square$
+
+As before, we can rewrite the conclusion of Lemma 6 using
+$1 + z \le e^z$ and obtain
+$(\lambda, (\lambda - 1)^{-1} r (e^{(\lambda - 1) \rho} - 1))$-rdp,
+which further gives $(\lambda, \alpha^{-1} (e^\alpha - 1) r \rho)$-rdp
+(or $(\lambda, O(r \rho))$-rdp) if $(\lambda - 1) \rho < \alpha$ for
+some $\alpha$.
+
+It is not hard to see that the subsampling theorem in moment method,
+even though similar to the results of that in the usual method, does not
+help due to lack of an analogue of advanced composition theorem of the
+moments.
+
+**Example 4 (Gaussian mechanism)**.
+Applying the moment subsampling theorem to the Gaussian mechanism, we
+obtain $(\lambda, O(r \lambda / \sigma^2))$-rdp for a subsampled
+Gaussian mechanism with rate $r$.
+Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016 (ACGMMTZ16 in the
+following), however, gains an extra $r$ in the bound given certain
+assumptions.
+
+ACGMMTZ16 
+---------
+
+What follows is my understanding of this result. I call it a hypothesis
+because there is a gap which I am not able to reproduce their proof or
+prove it myself. This does not mean the result is false. On the
+contrary, I am inclined to believe it is true.
+
+**Hypothesis 0**. For a subsampled Gaussian mechanism
+with ratio $r$, if $r = O(\sigma^{-1})$ and $\lambda = O(\sigma^2)$,
+then we have $(\lambda, O(r^2 \lambda / \sigma^2))$-rdp.
+
+Recall the definition of $G_\lambda$ under the definition of Rényi
+differential privacy. The following Claim will be useful.
+
+**Claim 26**. Let $\lambda$ be a positive integer, then
+
+$$G_\lambda(r p + (1 - r) q || q) = \sum_{k = 1 : \lambda} {\lambda \choose k} r^k (1 - r)^{\lambda - k} G_k(p || q).$$
+
+**Proof**. Quite straightforward, by expanding the numerator
+$(r p + (1 - r) q)^\lambda$ using binomial expansion. $\square$
+
+**Proof of Hypothesis 0
+(Incomplete)**. I will break the proof into two parts and discuss each
+one.
+
+1.  The MGF of the privacy loss is bounded by that of
+    $L(r \mu_1 + (1 - r) \mu_0 || \mu_0)$ where
+    $\mu_i = N(i, \sigma^2)$.
+2.  If $r \le c_1 \sigma^{-1}$ and $\lambda \le c_2 \sigma^2$, then
+    there exists $C = C(c_1, c_2)$ such that
+    $G_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0) \le C$ (since
+    $O(r^2 \lambda^2 / \sigma^2) = O(1)$).
+
+**Remark in the proof**. Note that the choice of
+$c_1$, $c_2$ and the function $C(c_1, c_2)$ are important to the
+practicality and usefulness of Hypothesis 0.
+
+Item 1 can be deduced from the following hypothesis, which, as the only
+gap in the Proof of Hypothesis 0, is heuristically reasonable, hence why
+I am inclined to believe Hypothesis 0 is true.
+
+**Hypothesis 1**. Let $p_i$, $q_i$, $\mu_i$, $\nu_i$ be
+probability densities on the same space for $i = 1 : n$. If
+$D_\lambda(p_i || q_i) \le D_\lambda(\mu_i || \nu_i)$ for all $i$, then
+
+$$D_\lambda(n^{-1} \sum_i p_i || n^{-1} \sum_i q_i) \le D_\lambda(n^{-1} \sum_i \mu_i || n^{-1} \sum_i \nu_i).$$
+
+Basically, it is saying \"if for each $i$, $p_i$ and $q_i$ are closer to
+each other than $\mu_i$ and $\nu_i$, then so is their average over
+$i$\".
+
+To see how Hypothesis 1 implies Item 1, note that
+
+$$D_\lambda(q_I || p_I) = D_\lambda(p_I || q_I) 
+\begin{cases}
+\le D_\lambda(\mu_0 || \mu_1) = D_\lambda(\mu_1 || \mu_0), & I \in \mathcal I_\in\\
+= D_\lambda(\mu_0 || \mu_0) = D_\lambda(\mu_1 || \mu_1) = 0 & I \in \mathcal I_\notin
+\end{cases}$$
+
+and that $p = |\mathcal I|^{-1} \sum_{I \in \mathcal I} p_I$ and
+$q = |\mathcal I|^{-1} \sum_{I \in \mathcal I} q_I$ and
+$|\mathcal I_\in| = r |\mathcal I|$.
+
+Alternatively, one may try to prove a weaker version of Hypothesis 1, by
+specialising on mixture of Gaussians.
+
+One way one may try to prove Hypothesis 1 is to prove an equivalent
+statement:
+
+**Hypothesis 2**. Let $p_1$, $q_1$, $p_2$, $q_2$,
+$\mu_1$, $\mu_2$, $\nu_1$, $\nu_2$ be probability densities, and suppose
+
+$$\int {p_i^\lambda \over q_i^{\lambda - 1}} \le \int {\mu_i^\lambda \over \nu_i^{\lambda - 1}}, \qquad i = 1, 2$$
+
+then
+
+$$\int {(p_1 + p_2)^\lambda \over (q_1 + q_2)^{\lambda - 1}} \le \int {(\mu_1 + \mu_1)^\lambda \over (\nu_1 + \nu_2)^{\lambda - 1}}.$$
+
+Indeed, on one hand, Hypothesis 2 is a special case of Hypothesis 1 and
+on the other hand, given Hypothesis 2, Hypothesis 1 can be shown using
+induction by replacing one pair of densities a time.
+
+Now let us prove Item 2.
+
+Using Claim 26 and Example 1, we have
+
+$$\begin{aligned}
+G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0)) &= \sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} G_j(\mu_1 || \mu_0)\\
+&=\sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} \exp(j (j - 1) / 2 \sigma^2). \qquad (9.5)
+\end{aligned}$$
+
+Denote by $n = \lceil \sigma^2 \rceil$. It suffices to show
+
+$$\sum_{j = 0 : n} {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j (j - 1) / 2 n) \le C$$
+
+Note that we can discard the linear term $- c_2 j / \sigma^2$ in the
+exponential term since we want to bound the sum from above.
+
+We examine the asymptotics of this sum when $n$ is large, and treat the
+sum as an approximation to an integration of a function
+$\phi: [0, 1] \to \mathbb R$. For $j = x n$, where $x \in (0, 1)$,
+$\phi$ is thus defined as (note we multiply the summand with $n$ to
+compensate the uniform measure on $1, ..., n$:
+
+$$\begin{aligned}
+\phi_n(x) &:= n {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j^2 / 2 n) \\
+&= n {n \choose x n} (c_1 n^{- 1 / 2})^{x n} (1 - c_1 n^{- 1 / 2})^{(1 - x) n} \exp(c_2 x^2 n / 2)
+\end{aligned}$$
+
+Using Stirling\'s approximation
+
+$$n! \approx \sqrt{2 \pi n} n^n e^{- n},$$
+
+we can approach the binomial coefficient:
+
+$${n \choose x n} \approx (\sqrt{2 \pi x (1 - x)} x^{x n} (1 - x)^{(1 - x) n})^{-1}.$$
+
+We also approximate
+
+$$(1 - c_1 n^{- 1 / 2})^{(1 - x) n} \approx \exp(- c_1 \sqrt{n} (1 - x)).$$
+
+With these we have
+
+$$\phi_n(x) \approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left(- {1 \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x) + {1 \over 2} c_2 x^2) n + {1 \over 2} \log n\right).$$
+
+This vanishes as $n \to \infty$, and since $\phi_n(x)$ is bounded above
+by the integrable function ${1 \over \sqrt{2 \pi x (1 - x)}}$ (c.f. the
+arcsine law), and below by $0$, we may invoke the dominant convergence
+theorem and exchange the integral with the limit and get
+
+$$\begin{aligned}
+\lim_{n \to \infty} &G_n (r \mu_1 + (1 - r) \mu_0 || \mu_0)) \\
+&\le \lim_{n \to \infty} \int \phi_n(x) dx = \int \lim_{n \to \infty} \phi_n(x) dx = 0.
+\end{aligned}$$
+
+Thus we have that the generating function of the divergence variable
+$L(r \mu_1 + (1 - r) \mu_0 || \mu_0)$ is bounded.
+
+Can this be true for better orders
+
+$$r \le c_1 \sigma^{- d_r},\qquad \lambda \le c_2 \sigma^{d_\lambda}$$
+
+for some $d_r \in (0, 1]$ and $d_\lambda \in [2, \infty)$? If we follow
+the same approximation using these exponents, then letting
+$n = c_2 \sigma^{d_\lambda}$,
+
+$$\begin{aligned}
+{n \choose j} &r^j (1 - r)^{n - j} G_j(\mu_0 || \mu_1) \le \phi_n(x) \\
+&\approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left({1 \over 2} c_2^{2 \over d_\lambda} x^2 n^{2 - {2 \over d_\lambda}} - {d_r \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x)) n + {1 \over 2} \log n\right).
+\end{aligned}$$
+
+So we see that to keep the divergence moments bounded it is possible to
+have any $r = O(\sigma^{- d_r})$ for $d_r \in (0, 1)$, but relaxing
+$\lambda$ may not be safe.
+
+If we relax $r$, then we get
+
+$$G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0) = O(r^{2 / d_r} \lambda^2 \sigma^{-2}) = O(1).$$
+
+Note that now the constant $C$ depends on $d_r$ as well. Numerical
+experiments seem to suggest that $C$ can increase quite rapidly as $d_r$
+decreases from $1$. $\square$
+
+In the following for consistency we retain $k$ as the number of epochs,
+and use $T := k / r$ to denote the number of compositions / steps /
+minibatches. With Hypothesis 0 we have:
+
+**Hypothesis 3**. Let $\epsilon, c_1, c_2 > 0$,
+$r \le c_1 \sigma^{-1}$,
+$T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2$. then the DP-SGD with
+subsampling rate $r$, and $T$ steps is $(\epsilon, \delta)$-dp for
+
+$$\delta = \exp(- {1 \over 2} c_2 \sigma^2 \epsilon).$$
+
+In other words, for
+
+$$\sigma \ge \sqrt{2 c_2^{-1}} \epsilon^{- {1 \over 2}} \sqrt{\log \delta^{-1}},$$
+
+we can achieve $(\epsilon, \delta)$-dp.
+
+**Proof**. By Hypothesis 0 and the Moment Composition Theorem
+(Claim 22), for $\lambda = c_2 \sigma^2$, substituting
+$T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2$, we have
+
+$$\mathbb P(L(p || q) \ge \epsilon) \le \exp(k C(c_1, c_2) - \lambda \epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right).$$
+
+$\square$
+
+**Remark**. Hypothesis 3 is my understanding / version of
+Theorem 1 in \[ACGMMTZ16\], by using the same proof technique. Here I
+quote the original version of theorem with notions and notations altered
+for consistency with this post:
+
+> There exists constants $c_1', c_2' > 0$ so that for any
+> $\epsilon < c_1' r^2 T$, DP-SGD is $(\epsilon, \delta)$-differentially
+> private for any $\delta > 0$ if we choose
+
+$$\sigma \ge c_2' {r \sqrt{T \log (1 / \delta)} \over \epsilon}. \qquad (10)$$
+
+I am however unable to reproduce this version, assuming Hypothesis 0 is
+true, for the following reasons:
+
+1.  In the proof in the paper, we have $\epsilon = c_1' r^2 T$ instead
+    of \"less than\" in the statement of the Theorem. If we change it to
+    $\epsilon < c_1' r^2 T$ then the direction of the inequality becomes
+    opposite to the direction we want to prove:
+    $$\exp(k C(c_1, c_2) - \lambda \epsilon) \ge ...$$
+
+2.  The implicit condition $r = O(\sigma^{-1})$ of Hypothesis 0 whose
+    result is used in the proof of this theorem is not mentioned in the
+    statement of the proof. The implication is that (10) becomes an
+    ill-formed condition as the right hand side also depends on
+    $\sigma$.
+
+Tensorflow implementation 
+-------------------------
+
+The DP-SGD is implemented in [TensorFlow
+Privacy](https://github.com/tensorflow/privacy). In the following I
+discuss the package in the current state (2019-03-11). It is divided
+into two parts: `optimizers` which implements the actual differentially
+private algorithm, and `analysis` which computes the privacy guarantee.
+
+The `analysis` parts implements a privacy ledger that \"keeps a record
+of all queries executed over a given dataset for the purpose of
+computing privacy guarantees\". On the other hand, all the computation
+is done in `rdp_accountant.py`
+([link](https://github.com/tensorflow/privacy/blob/7e2d796bdee9b60dce21a82a397eefda35b0ac10/privacy/analysis/rdp_accountant.py).
+At this moment, `rdp_accountant.py` only implements the computation of
+the privacy guarantees for DP-SGD with Gaussian mechanism. In the
+following I will briefly explain the code in this file.
+
+Some notational correspondences: their `alpha` is our $\lambda$, their
+`q` is our $r$, their `A_alpha` (in the comments) is our
+$\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2)} (\lambda - 1)$, at
+least when $\lambda$ is an integer.
+
+-   The function `_compute_log_a` presumably computes the cumulants
+    $\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2), N(0, \sigma^2)}(\lambda - 1)$.
+    It calls `_compute_log_a_int` or `_compute_log_a_frac` depending on
+    whether $\lambda$ is an integer.
+-   The function `_compute_log_a_int` computes the cumulant using (9.5).
+-   When $\lambda$ is not an integer, we can\'t use (9.5). I have yet to
+    decode how `_compute_log_a_frac` computes the cumulant (or an upper
+    bound of it) in this case
+-   The function `_compute_delta` computes $\delta$s for a list of
+    $\lambda$s and $\kappa$s using Item 1 of Claim 3 and return the
+    smallest one, and the function `_compute_epsilon` computes epsilon
+    uses Item 3 in the same way.
+
+In `optimizers`, among other things, the DP-SGD with Gaussian mechanism
+is implemented in `dp_optimizer.py` and `gaussian_query.py`. See the
+definition of `DPGradientDescentGaussianOptimizer` in `dp_optimizer.py`
+and trace the calls therein.
+
+At this moment, the privacy guarantee computation part and the optimizer
+part are separated, with `rdp_accountant.py` called in
+`compute_dp_sgd_privacy.py` with user-supplied parameters. I think this
+is due to the lack of implementation in `rdp_accountant.py` of any
+non-DPSGD-with-Gaussian privacy guarantee computation. There is already
+[an issue on this](https://github.com/tensorflow/privacy/issues/23), so
+hopefully it won\'t be long before the privacy guarantees can be
+automatically computed given a DP-SGD instance.
+
+Comparison among different methods 
+----------------------------------
+
+So far we have seen three routes to compute the privacy guarantees for
+DP-SGD with the Gaussian mechanism:
+
+1.  Claim 9 (single Gaussian mechanism privacy guarantee) -\> Claim 19
+    (Subsampling theorem) -\> Claim 18 (Advanced Adaptive Composition
+    Theorem)
+2.  Example 1 (RDP for the Gaussian mechanism) -\> Claim 22 (Moment
+    Composition Theorem) -\> Example 3 (Moment composition applied to
+    the Gaussian mechanism)
+3.  Hypothesis 0 (RDP for Gaussian mechanism with specific magnitudes
+    for subsampling rate) -\> Hypothesis 3 (Moment Composition Theorem
+    and translation to conventional DP)
+
+Which one is the best?
+
+To make fair comparison, we may use one parameter as the metric and set
+all others to be the same. For example, we can
+
+1.  Given the same $\epsilon$, $r$ (in Route 1 and 3), $k$, $\sigma$,
+    compare the $\delta$s
+2.  Given the same $\epsilon$, $r$ (in Route 1 and 3), $k$, $\delta$,
+    compare the $\sigma$s
+3.  Given the same $\delta$, $r$ (in Route 1 and 3), $k$, $\sigma$,
+    compare the $\epsilon$s.
+
+I find that the first one, where $\delta$ is used as a metric, the best.
+This is because we have the tightest bounds and the cleanest formula
+when comparing the $\delta$. For example, the Azuma and Chernoff bounds
+are both expressed as a bound for $\delta$. On the other hand, the
+inversion of these bounds either requires a cost in the tightness (Claim
+9, bounds on $\sigma$) or in the complexity of the formula (Claim 16
+Advanced Adaptive Composition Theorem, bounds on $\epsilon$).
+
+So if we use $\sigma$ or $\epsilon$ as a metric, either we get a less
+fair comparison, or have to use a much more complicated formula as the
+bounds.
+
+Let us first compare Route 1 and Route 2 without specialising to the
+Gaussian mechanism.
+
+**Disclaimer**. What follows is my original research and
+has not been reviewed by anyone.
+
+Suppose each mechanism $N_i$ satisfies
+$(\epsilon', \delta(\epsilon'))$-dp. Let
+$\tilde \epsilon := \log (1 + r (e^{\epsilon'} - 1))$, then we have the
+subsampled mechanism $M_i(x) = N_i(x_\gamma)$ is
+$(\tilde \epsilon, r \tilde \delta(\tilde \epsilon))$-dp, where
+
+$$\tilde \delta(\tilde \epsilon) = \delta(\log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1))$$
+
+Using the Azuma bound in the proof of Advanced Adaptive Composition
+Theorem (6.99):
+
+$$\mathbb P(L(p^k || q^k) \ge \epsilon) \le \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}).$$
+
+So we have the final bound for Route 1:
+
+$$\delta_1(\epsilon) = \min_{\tilde \epsilon: \epsilon > r^{-1} k a(\tilde \epsilon)} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \tilde \delta(\tilde \epsilon).$$
+
+As for Route 2, since we do not gain anything from subsampling in RDP,
+we do not subsample at all.
+
+By Claim 23, we have the bound for Route 2:
+
+$$\delta_2(\epsilon) = \exp(- k \kappa^* (\epsilon / k)).$$
+
+On one hand, one can compare $\delta_1$ and $\delta_2$ with numerical
+experiments. On the other hand, if we further specify
+$\delta(\epsilon')$ in Route 1 as the Chernoff bound for the cumulants
+of divergence variable, i.e.
+
+$$\delta(\epsilon') = \exp(- \kappa^* (\epsilon')),$$
+
+we have
+
+$$\delta_1 (\epsilon) = \min_{\tilde \epsilon: a(\tilde \epsilon) < r k^{-1} \epsilon} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \exp(- \kappa^* (b(\tilde\epsilon))),$$
+
+where
+
+$$b(\tilde \epsilon) := \log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1) \le r^{-1} \tilde\epsilon.$$
+
+We note that since
+$a(\tilde \epsilon) = \tilde\epsilon(e^{\tilde \epsilon} - 1) 1_{\tilde\epsilon < \log 2} + \tilde\epsilon 1_{\tilde\epsilon \ge \log 2}$,
+we may compare the two cases separately.
+
+Note that we have $\kappa^*$ is a monotonously increasing function,
+therefore
+
+$$\kappa^* (b(\tilde\epsilon)) \le \kappa^*(r^{-1} \tilde\epsilon).$$
+
+So for $\tilde \epsilon \ge \log 2$, we have
+
+$$k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(r^{-1} \tilde \epsilon)) \ge k \exp(- \kappa^*(k^{-1} \epsilon)) \ge \delta_2(\epsilon).$$
+
+For $\tilde\epsilon < \log 2$, it is harder to compare, as now
+
+$$k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(\epsilon / \sqrt{r k})).$$
+
+It is tempting to believe that this should also be greater than
+$\delta_2(\epsilon)$. But I can not say for sure. At least in the
+special case of Gaussian, we have
+
+$$k \exp(- \kappa^*(\epsilon / \sqrt{r k})) = k \exp(- (\sigma \sqrt{\epsilon / k r} - (2 \sigma)^{-1})^2) \ge \exp(- k ({\sigma \epsilon \over k} - (2 \sigma)^{-1})^2) = \delta_2(\epsilon)$$
+
+when $\epsilon$ is sufficiently small. However we still need to consider
+the case where $\epsilon$ is not too small. But overall it seems most
+likely Route 2 is superior than Route 1.
+
+So let us compare Route 2 with Route 3:
+
+Given the condition to obtain the Chernoff bound
+
+$${\sigma \epsilon \over k} > (2 \sigma)^{-1}$$
+
+we have
+
+$$\delta_2(\epsilon) > \exp(- k (\sigma \epsilon / k)^2) = \exp(- \sigma^2 \epsilon^2 / k).$$
+
+For this to achieve the same bound
+
+$$\delta_3(\epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right)$$
+
+we need $k < {2 \epsilon \over c_2}$. This is only possible if $c_2$ is
+small or $\epsilon$ is large, since $k$ is a positive integer.
+
+So taking at face value, Route 3 seems to achieve the best results.
+However, it also has some similar implicit conditions that need to be
+satisfied: First $T$ needs to be at least $1$, meaning
+
+$${c_2 \over C(c_1, c_2)} \epsilon \sigma^2 \ge 1.$$
+
+Second, $k$ needs to be at least $1$ as well, i.e.
+
+$$k = r T \ge {c_1 c_2 \over C(c_1, c_2)} \epsilon \sigma \ge 1.$$
+
+Both conditions rely on the magnitudes of $\epsilon$, $\sigma$, $c_1$,
+$c_2$, and the rate of growth of $C(c_1, c_2)$. The biggest problem in
+this list is the last, because if we know how fast $C$ grows then we\'ll
+have a better idea what are the constraints for the parameters to
+achieve the result in Route 3.
+
+Further questions 
+-----------------
+
+Here is a list of what I think may be interesting topics or potential
+problems to look at, with no guarantee that they are all awesome
+untouched research problems:
+
+1.  Prove Hypothesis 2
+2.  Find a theoretically definitive answer whether the methods in Part 1
+    or Part 2 yield better privacy guarantees.
+3.  Study the non-Gaussian cases, general or specific. Let $p$ be some
+    probability density, what is the tail bound of
+    $L(p(y) || p(y + \alpha))$ for $|\alpha| \le 1$? Can you find
+    anything better than Gaussian? For a start, perhaps the nice tables
+    of Rényi divergence in Gil-Alajaji-Linder 2013 may be useful?
+4.  Find out how useful Hypothesis 0 is. Perhaps start with computing
+    the constant $C$ nemerically.
+5.  Help with [the aforementioned
+    issue](https://github.com/tensorflow/privacy/issues/23) in the
+    Tensorflow privacy package.
+
+References 
+----------
+
+-   Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya
+    Mironov, Kunal Talwar, and Li Zhang. "Deep Learning with
+    Differential Privacy." Proceedings of the 2016 ACM SIGSAC Conference
+    on Computer and Communications Security - CCS'16, 2016, 308--18.
+    <https://doi.org/10.1145/2976749.2978318>.
+-   Erven, Tim van, and Peter Harremoës. "R\\'enyi Divergence and
+    Kullback-Leibler Divergence." IEEE Transactions on Information
+    Theory 60, no. 7 (July 2014): 3797--3820.
+    <https://doi.org/10.1109/TIT.2014.2320500>.
+-   Gil, M., F. Alajaji, and T. Linder. "Rényi Divergence Measures for
+    Commonly Used Univariate Continuous Distributions." Information
+    Sciences 249 (November 2013): 124--31.
+    <https://doi.org/10.1016/j.ins.2013.06.018>.
+-   Mironov, Ilya. "Renyi Differential Privacy." 2017 IEEE 30th Computer
+    Security Foundations Symposium (CSF), August 2017, 263--75.
+    <https://doi.org/10.1109/CSF.2017.11>.
-- 
cgit v1.2.3