From 147a19e84a743f1379f05bf2f444143b4afd7bd6 Mon Sep 17 00:00:00 2001 From: Yuchen Pei Date: Fri, 18 Jun 2021 12:58:44 +1000 Subject: Updated. --- ...019-03-14-great-but-manageable-expectations.org | 836 +++++++++++++++++++++ 1 file changed, 836 insertions(+) create mode 100644 posts/2019-03-14-great-but-manageable-expectations.org (limited to 'posts/2019-03-14-great-but-manageable-expectations.org') diff --git a/posts/2019-03-14-great-but-manageable-expectations.org b/posts/2019-03-14-great-but-manageable-expectations.org new file mode 100644 index 0000000..6438090 --- /dev/null +++ b/posts/2019-03-14-great-but-manageable-expectations.org @@ -0,0 +1,836 @@ +#+title: Great but Manageable Expectations + +#+date: <2019-03-14> + +This is Part 2 of a two-part blog post on differential privacy. +Continuing from [[file:2019-03-13-a-tail-of-two-densities.org][Part 1]], I discuss the Rényi differential privacy, corresponding to the +Rényi divergence, a study of the +[[https://en.wikipedia.org/wiki/Moment-generating_function][moment +generating functions]] of the divergence between probability measures in +order to derive the tail bounds. + +Like in Part 1, I prove a composition theorem and a subsampling theorem. + +I also attempt to reproduce a seemingly better moment bound for the +Gaussian mechanism with subsampling, with one intermediate step which I +am not able to prove. + +After that I explain the Tensorflow implementation of differential +privacy in its +[[https://github.com/tensorflow/privacy/tree/master/privacy][Privacy]] +module, which focuses on the differentially private stochastic gradient +descent algorithm (DP-SGD). + +Finally I use the results from both Part 1 and Part 2 to obtain some +privacy guarantees for composed subsampling queries in general, and for +DP-SGD in particular. I also compare these privacy guarantees. + +/If you are confused by any notations, ask me or try +[[/notations.html][this]]./ + +** Rényi divergence and differential privacy + :PROPERTIES: + :CUSTOM_ID: rényi-divergence-and-differential-privacy + :END: +Recall in the proof of Gaussian mechanism privacy guarantee (Claim 8) we +used the Chernoff bound for the Gaussian noise. Why not use the Chernoff +bound for the divergence variable / privacy loss directly, since the +latter is closer to the core subject than the noise? This leads us to +the study of Rényi divergence. + +So far we have seen several notions of divergence used in differential +privacy: the max divergence which is \(\epsilon\)-ind in disguise: + +\[D_\infty(p || q) := \max_y \log {p(y) \over q(y)},\] + +the \(\delta\)-approximate max divergence that defines the +\((\epsilon, \delta)\)-ind: + +\[D_\infty^\delta(p || q) := \max_y \log{p(y) - \delta \over q(y)},\] + +and the KL-divergence which is the expectation of the divergence +variable: + +\[D(p || q) = \mathbb E L(p || q) = \int \log {p(y) \over q(y)} p(y) dy.\] + +The Rényi divergence is an interpolation between the max divergence and +the KL-divergence, defined as the log moment generating function / +cumulants of the divergence variable: + +\[D_\lambda(p || q) = (\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1) L(p || q)) = (\lambda - 1)^{-1} \log \int {p(y)^\lambda \over q(y)^{\lambda - 1}} dy.\] + +Indeed, when \(\lambda \to \infty\) we recover the max divergence, and +when \(\lambda \to 1\), by recognising \(D_\lambda\) as a derivative in +\(\lambda\) at \(\lambda = 1\), we recover the KL divergence. In this post +we only consider \(\lambda > 1\). + +Using the Rényi divergence we may define: + +*Definition (Rényi differential privacy)* (Mironov 2017). An mechanism +\(M\) is \((\lambda, \rho)\)/-Rényi differentially private/ +(\((\lambda, \rho)\)-rdp) if for all \(x\) and \(x'\) with distance \(1\), + +\[D_\lambda(M(x) || M(x')) \le \rho.\] + +For convenience we also define two related notions, \(G_\lambda (f || g)\) +and \(\kappa_{f, g} (t)\) for \(\lambda > 1\), \(t > 0\) and positive +functions \(f\) and \(g\): + +\[G_\lambda(f || g) = \int f(y)^{\lambda} g(y)^{1 - \lambda} dy; \qquad \kappa_{f, g} (t) = \log G_{t + 1}(f || g).\] + +For probability densities \(p\) and \(q\), \(G_{t + 1}(p || q)\) and +\(\kappa_{p, q}(t)\) are the \(t\)th moment generating function and +[[https://en.wikipedia.org/wiki/Cumulant][cumulant]] of the divergence +variable \(L(p || q)\), and + +\[D_\lambda(p || q) = (\lambda - 1)^{-1} \kappa_{p, q}(\lambda - 1).\] + +In the following, whenever you see \(t\), think of it as \(\lambda - 1\). + +*Example 1 (RDP for the Gaussian mechanism)*. Using the scaling and +translation invariance of \(L\) (6.1), we have that the divergence +variable for two Gaussians with the same variance is + +\[L(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) \overset{d}{=} L(N(0, I) || N((\mu_2 - \mu_1) / \sigma, I)).\] + +With this we get + +\[D_\lambda(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) = {\lambda \|\mu_2 - \mu_1\|_2^2 \over 2 \sigma^2} = D_\lambda(N(\mu_2, \sigma^2 I) || N(\mu_1, \sigma^2 I)).\] + +Again due to the scaling invariance of \(L\), we only need to consider \(f\) +with sensitivity \(1\), see the discussion under (6.1). The Gaussian +mechanism on query \(f\) is thus \((\lambda, \lambda / 2 \sigma^2)\)-rdp for +any \(\lambda > 1\). + +From the example of Gaussian mechanism, we see that the relation between +\(\lambda\) and \(\rho\) is like that between \(\epsilon\) and \(\delta\). Given +\(\lambda\) (resp. \(\rho\)) and parameters like variance of the noise and +the sensitivity of the query, we can write \(\rho = \rho(\lambda)\) (resp. +\(\lambda = \lambda(\rho)\)). + +Using the Chernoff bound (6.7), we can bound the divergence variable: + +\[\mathbb P(L(p || q) \ge \epsilon) \le {\mathbb E \exp(t L(p || q)) \over \exp(t \epsilon))} = \exp (\kappa_{p, q}(t) - \epsilon t). \qquad (7.7)\] + +For a function \(f: I \to \mathbb R\), denote its +[[https://en.wikipedia.org/wiki/Legendre_transformation][Legendre +transform]] by + +\[f^*(\epsilon) := \sup_{t \in I} (\epsilon t - f(t)).\] + +By taking infimum on the RHS of (7.7), we obtain + +*Claim 20*. Two probability densities \(p\) and \(q\) are +\((\epsilon, \exp(-\kappa_{p, q}^*(\epsilon)))\)-ind. + +Given a mechanism \(M\), let \(\kappa_M(t)\) denote an upper bound for the +cumulant of its privacy loss: + +\[\log \mathbb E \exp(t L(M(x) || M(x'))) \le \kappa_M(t), \qquad \forall x, x'\text{ with } d(x, x') = 1.\] + +For example, we can set \(\kappa_M(t) = t \rho(t + 1)\). Using the same +argument we have the following: + +*Claim 21*. If \(M\) is \((\lambda, \rho)\)-rdp, then + +1. it is also \((\epsilon, \exp((\lambda - 1) (\rho - \epsilon)))\)-dp for + any \(\epsilon \ge \rho\). +2. Alternatively, \(M\) is \((\epsilon, - \exp(\kappa_M^*(\epsilon)))\)-dp + for any \(\epsilon > 0\). +3. Alternatively, for any \(0 < \delta \le 1\), \(M\) is + \((\rho + (\lambda - 1)^{-1} \log \delta^{-1}, \delta)\)-dp. + +*Example 2 (Gaussian mechanism)*. We can apply the above argument to the +Gaussian mechanism on query \(f\) and get: + +\[\delta \le \inf_{\lambda > 1} \exp((\lambda - 1) ({\lambda \over 2 \sigma^2} - \epsilon))\] + +By assuming \(\sigma^2 > (2 \epsilon)^{-1}\) we have that the infimum is +achieved when \(\lambda = (1 + 2 \epsilon / \sigma^2) / 2\) and + +\[\delta \le \exp(- ((2 \sigma)^{-1} - \epsilon \sigma)^2 / 2)\] + +which is the same result as (6.8), obtained using the Chernoff bound of +the noise. + +However, as we will see later, compositions will yield different results +from those obtained from methods in +[[/posts/2019-03-13-a-tail-of-two-densities.html][Part 1]] when +considering Rényi dp. + +*** Moment Composition + :PROPERTIES: + :CUSTOM_ID: moment-composition + :END: +*Claim 22 (Moment Composition Theorem)*. Let \(M\) be the adaptive +composition of \(M_{1 : k}\). Suppose for any \(y_{< i}\), \(M_i(y_{< i})\) is +\((\lambda, \rho)\)-rdp. Then \(M\) is \((\lambda, k\rho)\)-rdp. + +*Proof*. Rather straightforward. As before let \(p_i\) and \(q_i\) be the +conditional laws of adpative composition of \(M_{1 : i}\) at \(x\) and \(x'\) +respectively, and \(p^i\) and \(q^i\) be the joint laws of \(M_{1 : i}\) at +\(x\) and \(x'\) respectively. Denote + +\[D_i = \mathbb E \exp((\lambda - 1)\log {p^i(\xi_{1 : i}) \over q^i(\xi_{1 : i})})\] + +Then + +$$\begin{aligned} +D_i &= \mathbb E\mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) \exp((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}) \big| \xi_{< i}\right) \\ +&= \mathbb E \mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) | \xi_{< i}\right) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\ +&\le \mathbb E \exp((\lambda - 1) \rho) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\ +&= \exp((\lambda - 1) \rho) D_{i - 1}. +\end{aligned}$$ + +Applying this recursively we have + +\[D_k \le \exp(k(\lambda - 1) \rho),\] + +and so + +\[(\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1)\log {p^k(\xi_{1 : i}) \over q^k(\xi_{1 : i})}) = (\lambda - 1)^{-1} \log D_k \le k \rho.\] + +Since this holds for all \(x\) and \(x'\), we are done. \(\square\) + +This, together with the scaling property of the legendre transformation: + +\[(k f)^*(x) = k f^*(x / k)\] + +yields + +*Claim 23*. The \(k\)-fold adaptive composition of +\((\lambda, \rho(\lambda))\)-rdp mechanisms is +\((\epsilon, \exp(- k \kappa^*(\epsilon / k)))\)-dp, where +\(\kappa(t) := t \rho(t + 1)\). + +*Example 3 (Gaussian mechanism)*. We can apply the above claim to +Gaussian mechanism. Again, without loss of generality we assume +\(S_f = 1\). But let us do it manually to get the same results. If we +apply the Moment Composition Theorem to the an adaptive composition of +Gaussian mechanisms on the same query, then since each \(M_i\) is +\((\lambda, (2 \sigma^2)^{-1} \lambda)\)-rdp, the composition \(M\) is +\((\lambda, (2 \sigma^2)^{-1} k \lambda)\)-rdp. Processing this using the +Chernoff bound as in the previous example, we have + +\[\delta = \exp(- ((2 \sigma / \sqrt k)^{-1} - \epsilon \sigma / \sqrt k)^2 / 2),\] + +Substituting \(\sigma\) with \(\sigma / \sqrt k\) in (6.81), we conclude +that if + +\[\sigma > \sqrt k \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right)\] + +then the composition \(M\) is \((\epsilon, \delta)\)-dp. + +As we will see in the discussions at the end of this post, this result +is different from (and probably better than) the one obtained by using +the Advanced Composition Theorem (Claim 18). + +*** Subsampling + :PROPERTIES: + :CUSTOM_ID: subsampling + :END: +We also have a subsampling theorem for the Rényi dp. + +*Claim 24*. Fix \(r \in [0, 1]\). Let \(m \le n\) be two nonnegative +integers with \(m = r n\). Let \(N\) be a \((\lambda, \rho)\)-rdp machanism on +\(X^m\). Let \(\mathcal I := \{J \subset [n]: |J| = m\}\) be the set of +subsets of \([n]\) of size \(m\). Define mechanism \(M\) on \(X^n\) by + +\[M(x) = N(x_\gamma)\] + +where \(\gamma\) is sampled uniformly from \(\mathcal I\). Then \(M\) is +\((\lambda, {1 \over \lambda - 1} \log (1 + r(e^{(\lambda - 1) \rho} - 1)))\)-rdp. + +To prove Claim 24, we need a useful lemma: + +*Claim 25*. Let \(p_{1 : n}\) and \(q_{1 : n}\) be nonnegative integers, and +\(\lambda > 1\). Then + +\[{(\sum p_i)^\lambda \over (\sum q_i)^{\lambda - 1}} \le \sum_i {p_i^\lambda \over q_i^{\lambda - 1}}. \qquad (8)\] + +*Proof*. Let + +\[r(i) := p_i / P, \qquad u(i) := q_i / Q\] + +where + +\[P := \sum p_i, \qquad Q := \sum q_i\] + +then \(r\) and \(u\) are probability mass functions. Plugging in +\(p_i = r(i) P\) and \(q_i = u(i) Q\) into the objective (8), it suffices to +show + +\[1 \le \sum_i {r(i)^\lambda \over u(i)^{\lambda - 1}} = \mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda\] + +This is true due to Jensen's Inequality: + +\[\mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda \ge \left(\mathbb E_{\xi \sim u} {r(\xi) \over u(\xi)} \right)^\lambda = 1.\] + +\(\square\) + +*Proof of Claim 24*. Define \(\mathcal I\) as before. + +Let \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively. For any +\(I \in \mathcal I\), let \(p_I\) and \(q_I\) be the laws of \(N(x_I)\) and +\(N(x_I')\) respectively. Then we have + +$$\begin{aligned} +p(y) &= n^{-1} \sum_{I \in \mathcal I} p_I(y) \\ +q(y) &= n^{-1} \sum_{I \in \mathcal I} q_I(y), +\end{aligned}$$ + +where \(n = |\mathcal I|\). + +The MGF of \(L(p || q)\) is thus + +\[\mathbb E((\lambda - 1) L(p || q)) = n^{-1} \int {(\sum_I p_I(y))^\lambda \over (\sum_I q_I(y))^{\lambda - 1}} dy \le n^{-1} \sum_I \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy \qquad (9)\] + +where in the last step we used Claim 25. As in the proof of Claim 19, we +divide \(\mathcal I\) into disjoint sets \(\mathcal I_\in\) and +\(\mathcal I_\notin\). Furthermore we denote by \(n_\in\) and \(n_\notin\) +their cardinalities. Then the right hand side of (9) becomes + +\[n^{-1} \sum_{I \in \mathcal I_\in} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy + n^{-1} \sum_{I \in \mathcal I_\notin} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy\] + +The summands in the first are the MGF of \(L(p_I || q_I)\), and the +summands in the second term are \(1\), so + +$$\begin{aligned} +\mathbb E((\lambda - 1) L(p || q)) &\le n^{-1} \sum_{I \in \mathcal I_\in} \mathbb E \exp((\lambda - 1) L(p_I || q_I)) + (1 - r) \\ +&\le n^{-1} \sum_{I \in \mathcal I_\in} \exp((\lambda - 1) D_\lambda(p_I || q_I)) + (1 - r) \\ +&\le r \exp((\lambda - 1) \rho) + (1 - r). +\end{aligned}$$ + +Taking log and dividing by \((\lambda - 1)\) on both sides we have + +\[D_\lambda(p || q) \le (\lambda - 1)^{-1} \log (1 + r(\exp((\lambda - 1) \rho) - 1)).\] + +\(\square\) + +As before, we can rewrite the conclusion of Lemma 6 using +\(1 + z \le e^z\) and obtain +\((\lambda, (\lambda - 1)^{-1} r (e^{(\lambda - 1) \rho} - 1))\)-rdp, +which further gives \((\lambda, \alpha^{-1} (e^\alpha - 1) r \rho)\)-rdp +(or \((\lambda, O(r \rho))\)-rdp) if \((\lambda - 1) \rho < \alpha\) for +some \(\alpha\). + +It is not hard to see that the subsampling theorem in moment method, +even though similar to the results of that in the usual method, does not +help due to lack of an analogue of advanced composition theorem of the +moments. + +*Example 4 (Gaussian mechanism)*. Applying the moment subsampling +theorem to the Gaussian mechanism, we obtain +\((\lambda, O(r \lambda / \sigma^2))\)-rdp for a subsampled Gaussian +mechanism with rate \(r\). +Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016 (ACGMMTZ16 in the +following), however, gains an extra \(r\) in the bound given certain +assumptions. + +** ACGMMTZ16 + :PROPERTIES: + :CUSTOM_ID: acgmmtz16 + :END: +What follows is my understanding of this result. I call it a conjecture +because there is a gap which I am not able to reproduce their proof or +prove it myself. This does not mean the result is false. On the +contrary, I am inclined to believe it is true. + +*Claim 26*. Assuming Conjecture 1 (see below) is true. For a subsampled +Gaussian mechanism with ratio \(r\), if \(r = O(\sigma^{-1})\) and +\(\lambda = O(\sigma^2)\), then we have +\((\lambda, O(r^2 \lambda / \sigma^2))\)-rdp. + +Wait, why is there a conjecture? Well, I have tried but not been able to +prove the following, which is a hidden assumption in the original proof: + +*Conjecture 1*. Let \(M\) be the Gaussian mechanism with subsampling rate +\(r\), and \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively, +where \(d(x, x') = 1\). Then + +\[D_\lambda (p || q) \le D_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0)\] + +where \(\mu_i = N(i, \sigma^2)\). + +*Remark*. Conjecture 1 is heuristically reasonable. To see this, let us +use the notations \(p_I\) and \(q_I\) to be \(q\) and \(p\) conditioned on the +subsampling index \(I\), just like in the proof of the subsampling +theorems (Claim 19 and 24). Then for \(I \in \mathcal I_\in\), + +\[D_\lambda(p_I || q_I) \le D_\lambda(\mu_0 || \mu_1),\] + +and for \(I \in \mathcal I_\notin\), + +\[D_\lambda(p_I || q_I) = 0 = D_\lambda(\mu_0 || \mu_0).\] + +Since we are taking an average over \(\mathcal I\), of which +\(r |\mathcal I|\) are in \(\mathcal I_\in\) and \((1 - r) |\mathcal I|\) are +in \(\mathcal I_\notin\), (9.3) says "the inequalities carry over +averaging". + +[[https://math.stackexchange.com/a/3152296/149540][A more general +version of Conjecture 1 has been proven false]]. The counter example for +the general version does not apply here, so it is still possible +Conjecture 1 is true. + +Let \(p_\in\) (resp. \(q_\in\)) be the average of \(p_I\) (resp. \(q_I\)) over +\(I \in \mathcal I_\in\), and \(p_\notin\) (resp. \(q_\notin\)) be the average +of \(p_I\) (resp. \(q_I\)) over \(I \in \mathcal I_\notin\). + +Immediately we have \(p_\notin = q_\notin\), hence + +\[D_\lambda(p_\notin || q_\notin) = 0 = D_\lambda(\mu_0 || \mu_0). \qquad(9.7)\] + +By Claim 25, we have + +\[D_\lambda(p_\in || q_\in) \le D_\lambda (\mu_1 || \mu_0). \qquad(9.9) \] + +So one way to prove Conjecture 1 is perhaps prove a more specialised +comparison theorem than the false conjecture: + +Given (9.7) and (9.9), show that + +\[D_\lambda(r p_\in + (1 - r) p_\notin || r q_\in + (1 - r) q_\notin) \le D_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0).\] + +[End of Remark] + +#+begin_html + +#+end_html + +Recall the definition of \(G_\lambda\) under the definition of Rényi +differential privacy. The following Claim will be useful. + +*Claim 27*. Let \(\lambda\) be a positive integer, then + +\[G_\lambda(r p + (1 - r) q || q) = \sum_{k = 1 : \lambda} {\lambda \choose k} r^k (1 - r)^{\lambda - k} G_k(p || q).\] + +*Proof*. Quite straightforward, by expanding the numerator +\((r p + (1 - r) q)^\lambda\) using binomial expansion. \(\square\) + +*Proof of Claim 26*. By Conjecture 1, it suffices to prove the +following: + +If \(r \le c_1 \sigma^{-1}\) and \(\lambda \le c_2 \sigma^2\) for some +positive constant \(c_1\) and \(c_2\), then there exists \(C = C(c_1, c_2)\) +such that \(G_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0) \le C\) (since +\(O(r^2 \lambda^2 / \sigma^2) = O(1)\)). + +*Remark in the proof*. Note that the choice of \(c_1\), \(c_2\) and the +function \(C(c_1, c_2)\) are important to the practicality and usefulness +of Claim 26. + +#+begin_html + +#+end_html + +Using Claim 27 and Example 1, we have + +$$\begin{aligned} +G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0)) &= \sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} G_j(\mu_1 || \mu_0)\\ +&=\sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} \exp(j (j - 1) / 2 \sigma^2). \qquad (9.5) +\end{aligned}$$ + +Denote by \(n = \lceil \sigma^2 \rceil\). It suffices to show + +\[\sum_{j = 0 : n} {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j (j - 1) / 2 n) \le C\] + +Note that we can discard the linear term \(- c_2 j / \sigma^2\) in the +exponential term since we want to bound the sum from above. + +We examine the asymptotics of this sum when \(n\) is large, and treat the +sum as an approximation to an integration of a function +\(\phi: [0, 1] \to \mathbb R\). For \(j = x n\), where \(x \in (0, 1)\), +\(\phi\) is thus defined as (note we multiply the summand with \(n\) to +compensate the uniform measure on \(1, ..., n\): + +$$\begin{aligned} +\phi_n(x) &:= n {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j^2 / 2 n) \\ +&= n {n \choose x n} (c_1 n^{- 1 / 2})^{x n} (1 - c_1 n^{- 1 / 2})^{(1 - x) n} \exp(c_2 x^2 n / 2) +\end{aligned}$$ + +Using Stirling's approximation + +\[n! \approx \sqrt{2 \pi n} n^n e^{- n},\] + +we can approach the binomial coefficient: + +\[{n \choose x n} \approx (\sqrt{2 \pi x (1 - x)} x^{x n} (1 - x)^{(1 - x) n})^{-1}.\] + +We also approximate + +\[(1 - c_1 n^{- 1 / 2})^{(1 - x) n} \approx \exp(- c_1 \sqrt{n} (1 - x)).\] + +With these we have + +\[\phi_n(x) \approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left(- {1 \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x) + {1 \over 2} c_2 x^2) n + {1 \over 2} \log n\right).\] + +This vanishes as \(n \to \infty\), and since \(\phi_n(x)\) is bounded above +by the integrable function \({1 \over \sqrt{2 \pi x (1 - x)}}\) (c.f. the +arcsine law), and below by \(0\), we may invoke the dominant convergence +theorem and exchange the integral with the limit and get + +$$\begin{aligned} +\lim_{n \to \infty} &G_n (r \mu_1 + (1 - r) \mu_0 || \mu_0)) \\ +&\le \lim_{n \to \infty} \int \phi_n(x) dx = \int \lim_{n \to \infty} \phi_n(x) dx = 0. +\end{aligned}$$ + +Thus we have that the generating function of the divergence variable +\(L(r \mu_1 + (1 - r) \mu_0 || \mu_0)\) is bounded. + +Can this be true for better orders + +\[r \le c_1 \sigma^{- d_r},\qquad \lambda \le c_2 \sigma^{d_\lambda}\] + +for some \(d_r \in (0, 1]\) and \(d_\lambda \in [2, \infty)\)? If we follow +the same approximation using these exponents, then letting +\(n = c_2 \sigma^{d_\lambda}\), + +$$\begin{aligned} +{n \choose j} &r^j (1 - r)^{n - j} G_j(\mu_0 || \mu_1) \le \phi_n(x) \\ +&\approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left({1 \over 2} c_2^{2 \over d_\lambda} x^2 n^{2 - {2 \over d_\lambda}} - {d_r \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x)) n + {1 \over 2} \log n\right). +\end{aligned}$$ + +So we see that to keep the divergence moments bounded it is possible to +have any \(r = O(\sigma^{- d_r})\) for \(d_r \in (0, 1)\), but relaxing +\(\lambda\) may not be safe. + +If we relax \(r\), then we get + +\[G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0) = O(r^{2 / d_r} \lambda^2 \sigma^{-2}) = O(1).\] + +Note that now the constant \(C\) depends on \(d_r\) as well. Numerical +experiments seem to suggest that \(C\) can increase quite rapidly as \(d_r\) +decreases from \(1\). \(\square\) + +In the following for consistency we retain \(k\) as the number of epochs, +and use \(T := k / r\) to denote the number of compositions / steps / +minibatches. With Claim 26 we have: + +*Claim 28*. Assuming Conjecture 1 is true. Let \(\epsilon, c_1, c_2 > 0\), +\(r \le c_1 \sigma^{-1}\), +\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\). then the DP-SGD with +subsampling rate \(r\), and \(T\) steps is \((\epsilon, \delta)\)-dp for + +\[\delta = \exp(- {1 \over 2} c_2 \sigma^2 \epsilon).\] + +In other words, for + +\[\sigma \ge \sqrt{2 c_2^{-1}} \epsilon^{- {1 \over 2}} \sqrt{\log \delta^{-1}},\] + +we can achieve \((\epsilon, \delta)\)-dp. + +*Proof*. By Claim 26 and the Moment Composition Theorem (Claim 22), for +\(\lambda = c_2 \sigma^2\), substituting +\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\), we have + +\[\mathbb P(L(p || q) \ge \epsilon) \le \exp(k C(c_1, c_2) - \lambda \epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right).\] + +\(\square\) + +*Remark*. Claim 28 is my understanding / version of Theorem 1 in +[ACGMMTZ16], by using the same proof technique. Here I quote the +original version of theorem with notions and notations altered for +consistency with this post: + +#+begin_quote + There exists constants \(c_1', c_2' > 0\) so that for any + \(\epsilon < c_1' r^2 T\), DP-SGD is \((\epsilon, \delta)\)-differentially + private for any \(\delta > 0\) if we choose +#+end_quote + +\[\sigma \ge c_2' {r \sqrt{T \log (1 / \delta)} \over \epsilon}. \qquad (10)\] + +I am however unable to reproduce this version, assuming Conjecture 1 is +true, for the following reasons: + +1. In the proof in the paper, we have \(\epsilon = c_1' r^2 T\) instead of + "less than" in the statement of the Theorem. If we change it to + \(\epsilon < c_1' r^2 T\) then the direction of the inequality becomes + opposite to the direction we want to prove: + \[\exp(k C(c_1, c_2) - \lambda \epsilon) \ge ...\] + +2. The condition \(r = O(\sigma^{-1})\) of Claim 26 whose result is used + in the proof of this theorem is not mentioned in the statement of the + proof. The implication is that (10) becomes an ill-formed condition + as the right hand side also depends on \(\sigma\). + +** Tensorflow implementation + :PROPERTIES: + :CUSTOM_ID: tensorflow-implementation + :END: +The DP-SGD is implemented in +[[https://github.com/tensorflow/privacy][TensorFlow Privacy]]. In the +following I discuss the package in the current state (2019-03-11). It is +divided into two parts: +[[https://github.com/tensorflow/privacy/tree/master/privacy/optimizers][=optimizers=]] +which implements the actual differentially private algorithms, and +[[https://github.com/tensorflow/privacy/tree/master/privacy/analysis][=analysis=]] +which computes the privacy guarantee. + +The =analysis= part implements a privacy ledger that "keeps a record of +all queries executed over a given dataset for the purpose of computing +privacy guarantees". On the other hand, all the computation is done in +[[https://github.com/tensorflow/privacy/blob/7e2d796bdee9b60dce21a82a397eefda35b0ac10/privacy/analysis/rdp_accountant.py][=rdp_accountant.py=]]. +At this moment, =rdp_accountant.py= only implements the computation of +the privacy guarantees for DP-SGD with Gaussian mechanism. In the +following I will briefly explain the code in this file. + +Some notational correspondences: their =alpha= is our \(\lambda\), their +=q= is our \(r\), their =A_alpha= (in the comments) is our +\(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2)} (\lambda - 1)\), at +least when \(\lambda\) is an integer. + +- The function =_compute_log_a= presumably computes the cumulants + \(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2), N(0, \sigma^2)}(\lambda - 1)\). + It calls =_compute_log_a_int= or =_compute_log_a_frac= depending on + whether \(\lambda\) is an integer. +- The function =_compute_log_a_int= computes the cumulant using (9.5). +- When \(\lambda\) is not an integer, we can't use (9.5). I have yet to + decode how =_compute_log_a_frac= computes the cumulant (or an upper + bound of it) in this case +- The function =_compute_delta= computes \(\delta\)s for a list of + \(\lambda\)s and \(\kappa\)s using Item 1 of Claim 25 and return the + smallest one, and the function =_compute_epsilon= computes epsilon + uses Item 3 in Claim 25 in the same way. + +In =optimizers=, among other things, the DP-SGD with Gaussian mechanism +is implemented in =dp_optimizer.py= and =gaussian_query.py=. See the +definition of =DPGradientDescentGaussianOptimizer= in =dp_optimizer.py= +and trace the calls therein. + +At this moment, the privacy guarantee computation part and the optimizer +part are separated, with =rdp_accountant.py= called in +=compute_dp_sgd_privacy.py= with user-supplied parameters. I think this +is due to the lack of implementation in =rdp_accountant.py= of any +non-DPSGD-with-Gaussian privacy guarantee computation. There is already +[[https://github.com/tensorflow/privacy/issues/23][an issue on this]], +so hopefully it won't be long before the privacy guarantees can be +automatically computed given a DP-SGD instance. + +** Comparison among different methods + :PROPERTIES: + :CUSTOM_ID: comparison-among-different-methods + :END: +So far we have seen three routes to compute the privacy guarantees for +DP-SGD with the Gaussian mechanism: + +1. Claim 9 (single Gaussian mechanism privacy guarantee) -> Claim 19 + (Subsampling theorem) -> Claim 18 (Advanced Adaptive Composition + Theorem) +2. Example 1 (RDP for the Gaussian mechanism) -> Claim 22 (Moment + Composition Theorem) -> Example 3 (Moment composition applied to the + Gaussian mechanism) +3. Claim 26 (RDP for Gaussian mechanism with specific magnitudes for + subsampling rate) -> Claim 28 (Moment Composition Theorem and + translation to conventional DP) + +Which one is the best? + +To make fair comparison, we may use one parameter as the metric and set +all others to be the same. For example, we can + +1. Given the same \(\epsilon\), \(r\) (in Route 1 and 3), \(k\), \(\sigma\), + compare the \(\delta\)s +2. Given the same \(\epsilon\), \(r\) (in Route 1 and 3), \(k\), \(\delta\), + compare the \(\sigma\)s +3. Given the same \(\delta\), \(r\) (in Route 1 and 3), \(k\), \(\sigma\), + compare the \(\epsilon\)s. + +I find that the first one, where \(\delta\) is used as a metric, the best. +This is because we have the tightest bounds and the cleanest formula +when comparing the \(\delta\). For example, the Azuma and Chernoff bounds +are both expressed as a bound for \(\delta\). On the other hand, the +inversion of these bounds either requires a cost in the tightness (Claim +9, bounds on \(\sigma\)) or in the complexity of the formula (Claim 16 +Advanced Adaptive Composition Theorem, bounds on \(\epsilon\)). + +So if we use \(\sigma\) or \(\epsilon\) as a metric, either we get a less +fair comparison, or have to use a much more complicated formula as the +bounds. + +Let us first compare Route 1 and Route 2 without specialising to the +Gaussian mechanism. + +*Warning*. What follows is a bit messy. + +Suppose each mechanism \(N_i\) satisfies +\((\epsilon', \delta(\epsilon'))\)-dp. Let +\(\tilde \epsilon := \log (1 + r (e^{\epsilon'} - 1))\), then we have the +subsampled mechanism \(M_i(x) = N_i(x_\gamma)\) is +\((\tilde \epsilon, r \tilde \delta(\tilde \epsilon))\)-dp, where + +\[\tilde \delta(\tilde \epsilon) = \delta(\log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1))\] + +Using the Azuma bound in the proof of Advanced Adaptive Composition +Theorem (6.99): + +\[\mathbb P(L(p^k || q^k) \ge \epsilon) \le \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}).\] + +So we have the final bound for Route 1: + +\[\delta_1(\epsilon) = \min_{\tilde \epsilon: \epsilon > r^{-1} k a(\tilde \epsilon)} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \tilde \delta(\tilde \epsilon).\] + +As for Route 2, since we do not gain anything from subsampling in RDP, +we do not subsample at all. + +By Claim 23, we have the bound for Route 2: + +\[\delta_2(\epsilon) = \exp(- k \kappa^* (\epsilon / k)).\] + +On one hand, one can compare \(\delta_1\) and \(\delta_2\) with numerical +experiments. On the other hand, if we further specify +\(\delta(\epsilon')\) in Route 1 as the Chernoff bound for the cumulants +of divergence variable, i.e. + +\[\delta(\epsilon') = \exp(- \kappa^* (\epsilon')),\] + +we have + +\[\delta_1 (\epsilon) = \min_{\tilde \epsilon: a(\tilde \epsilon) < r k^{-1} \epsilon} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \exp(- \kappa^* (b(\tilde\epsilon))),\] + +where + +\[b(\tilde \epsilon) := \log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1) \le r^{-1} \tilde\epsilon.\] + +We note that since +\(a(\tilde \epsilon) = \tilde\epsilon(e^{\tilde \epsilon} - 1) 1_{\tilde\epsilon < \log 2} + \tilde\epsilon 1_{\tilde\epsilon \ge \log 2}\), +we may compare the two cases separately. + +Note that we have \(\kappa^*\) is a monotonously increasing function, +therefore + +\[\kappa^* (b(\tilde\epsilon)) \le \kappa^*(r^{-1} \tilde\epsilon).\] + +So for \(\tilde \epsilon \ge \log 2\), we have + +\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(r^{-1} \tilde \epsilon)) \ge k \exp(- \kappa^*(k^{-1} \epsilon)) \ge \delta_2(\epsilon).\] + +For \(\tilde\epsilon < \log 2\), it is harder to compare, as now + +\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(\epsilon / \sqrt{r k})).\] + +It is tempting to believe that this should also be greater than +\(\delta_2(\epsilon)\). But I can not say for sure. At least in the +special case of Gaussian, we have + +\[k \exp(- \kappa^*(\epsilon / \sqrt{r k})) = k \exp(- (\sigma \sqrt{\epsilon / k r} - (2 \sigma)^{-1})^2) \ge \exp(- k ({\sigma \epsilon \over k} - (2 \sigma)^{-1})^2) = \delta_2(\epsilon)\] + +when \(\epsilon\) is sufficiently small. However we still need to consider +the case where \(\epsilon\) is not too small. But overall it seems most +likely Route 2 is superior than Route 1. + +So let us compare Route 2 with Route 3: + +Given the condition to obtain the Chernoff bound + +\[{\sigma \epsilon \over k} > (2 \sigma)^{-1}\] + +we have + +\[\delta_2(\epsilon) > \exp(- k (\sigma \epsilon / k)^2) = \exp(- \sigma^2 \epsilon^2 / k).\] + +For this to achieve the same bound + +\[\delta_3(\epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right)\] + +we need \(k < {2 \epsilon \over c_2}\). This is only possible if \(c_2\) is +small or \(\epsilon\) is large, since \(k\) is a positive integer. + +So taking at face value, Route 3 seems to achieve the best results. +However, it also has some similar implicit conditions that need to be +satisfied: First \(T\) needs to be at least \(1\), meaning + +\[{c_2 \over C(c_1, c_2)} \epsilon \sigma^2 \ge 1.\] + +Second, \(k\) needs to be at least \(1\) as well, i.e. + +\[k = r T \ge {c_1 c_2 \over C(c_1, c_2)} \epsilon \sigma \ge 1.\] + +Both conditions rely on the magnitudes of \(\epsilon\), \(\sigma\), \(c_1\), +\(c_2\), and the rate of growth of \(C(c_1, c_2)\). The biggest problem in +this list is the last, because if we know how fast \(C\) grows then we'll +have a better idea what are the constraints for the parameters to +achieve the result in Route 3. + +** Further questions + :PROPERTIES: + :CUSTOM_ID: further-questions + :END: +Here is a list of what I think may be interesting topics or potential +problems to look at, with no guarantee that they are all awesome +untouched research problems: + +1. Prove Conjecture 1 +2. Find a theoretically definitive answer whether the methods in Part 1 + or Part 2 yield better privacy guarantees. +3. Study the non-Gaussian cases, general or specific. Let \(p\) be some + probability density, what is the tail bound of + \(L(p(y) || p(y + \alpha))\) for \(|\alpha| \le 1\)? Can you find + anything better than Gaussian? For a start, perhaps the nice tables + of Rényi divergence in Gil-Alajaji-Linder 2013 may be useful? +4. Find out how useful Claim 26 is. Perhaps start with computing the + constant \(C\) nemerically. +5. Help with [[https://github.com/tensorflow/privacy/issues/23][the + aforementioned issue]] in the Tensorflow privacy package. + +** References + :PROPERTIES: + :CUSTOM_ID: references + :END: + +- Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya + Mironov, Kunal Talwar, and Li Zhang. "Deep Learning with Differential + Privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer + and Communications Security - CCS'16, 2016, 308--18. + [[https://doi.org/10.1145/2976749.2978318]]. +- Erven, Tim van, and Peter Harremoës. "R\'enyi Divergence and + Kullback-Leibler Divergence." IEEE Transactions on Information Theory + 60, no. 7 (July 2014): 3797--3820. + [[https://doi.org/10.1109/TIT.2014.2320500]]. +- Gil, M., F. Alajaji, and T. Linder. "Rényi Divergence Measures for + Commonly Used Univariate Continuous Distributions." Information + Sciences 249 (November 2013): 124--31. + [[https://doi.org/10.1016/j.ins.2013.06.018]]. +- Mironov, Ilya. "Renyi Differential Privacy." 2017 IEEE 30th Computer + Security Foundations Symposium (CSF), August 2017, 263--75. + [[https://doi.org/10.1109/CSF.2017.11]]. -- cgit v1.2.3