diff options
Diffstat (limited to 'posts')
-rw-r--r-- | posts/2019-03-14-great-but-manageable-expectations.md | 752 |
1 files changed, 752 insertions, 0 deletions
diff --git a/posts/2019-03-14-great-but-manageable-expectations.md b/posts/2019-03-14-great-but-manageable-expectations.md new file mode 100644 index 0000000..6c4e04e --- /dev/null +++ b/posts/2019-03-14-great-but-manageable-expectations.md @@ -0,0 +1,752 @@ +--- +title: 'Great, but Manageable Expectations' +date: 2019-03-14 +template: post +comments: true +--- + +Let us continue with the study of differential privacy from [Part 1 of this post](2019-03-13-a-tail-of-two-densities.html). + +Rényi divergence and differential privacy +----------------------------------------- + +Recall in the proof of Gaussian mechanism privacy guarantee (Claim 8) we +used the Chernoff bound for the Gaussian noise. Why not use the Chernoff +bound for the divergence variable / privacy loss directly, since the +latter is closer to the core subject than the noise? This leads us to +the study of Rényi divergence. + +So far we have seen several notions of divergence used in differential +privacy: the max divergence which is $\epsilon$-ind in disguise: + +$$D_\infty(p || q) := \max_y \log {p(y) \over q(y)},$$ + +the $\delta$-approximate max divergence that defines the +$(\epsilon, \delta)$-ind: + +$$D_\infty^\delta(p || q) := \max_y \log{p(y) - \delta \over q(y)},$$ + +and the KL-divergence which is the expectation of the divergence +variable: + +$$D(p || q) = \mathbb E L(p || q) = \int \log {p(y) \over q(y)} p(y) dy.$$ + +The Rényi divergence is an interpolation between the max divergence and +the KL-divergence, defined as the log moment generating function / +cumulants of the divergence variable: + +$$D_\lambda(p || q) = (\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1) L(p || q)) = (\lambda - 1)^{-1} \log \int {p(y)^\lambda \over q(y)^{\lambda - 1}} dx.$$ + +Indeed, when $\lambda \to \infty$ we recover the max divergence, and +when $\lambda \to 1$, by recognising $D_\lambda$ as a derivative in +$\lambda$ at $\lambda = 1$, we recover the KL divergence. In this post +we only consider $\lambda > 1$. + +Using the Rényi divergence we may define: + +**Definition (Rényi +differential privacy)** (Mironov 2017). An mechanism $M$ is +$(\lambda, \rho)$*-Rényi differentially private* ($(\lambda, \rho)$-rdp) +if for all $x$ and $x'$ with distance $1$, + +$$D_\lambda(M(x) || M(x')) \le \rho.$$ + +For convenience we also define two related notions, $G_\lambda (f || g)$ +and $\kappa_{f, g} (t)$ for $\lambda > 1$, $t > 0$ and positive +functions $f$ and $g$: + +$$G_\lambda(f || g) = \int f(y)^{\lambda} g(y)^{1 - \lambda} dy; \qquad \kappa_{f, g} (t) = \log G_{t + 1}(f || g).$$ + +For probability densities $p$ and $q$, $G_{t + 1}(p || q)$ and +$\kappa_{p, q}(t)$ are the $t$th moment generating function and cumulant +of the divergence variable $L(p || q)$, and + +$$D_\lambda(p || q) = (\lambda - 1)^{-1} \kappa_{p, q}(\lambda - 1).$$ + +In the following, whenever you see $t$, think of it as $\lambda - 1$. + +**Example 1 (RDP for Gaussian +mechanism)**. Using the scaling and translation invariance of $L$ (6.1), +we have that the divergence variable for two Gaussians with the same +variance is + +$$L(N(\mu_1, \sigma^2) || N(\mu_2, \sigma^2)) \overset{d}{=} L(N(0, 1) || N((\mu_2 - \mu_1) / \sigma, 1)).$$ + +With this we get + +$$D_\lambda(N(\mu_1, \sigma^2) || N(\mu_2, \sigma^2)) = {\lambda (\mu_2 - \mu_1)^2 \over 2 \sigma^2} = D_\lambda(N(\mu_2, \sigma^2) || N(\mu_1, \sigma^2)).$$ + +Also due to the scaling invariance of $L$, we only need to consider $f$ +with sensitivity $1$, see the discussion under (6.1). The Gaussian +mechanism on query $f$ is thus $(\lambda, \lambda / 2 \sigma^2)$-rdp for +any $\lambda > 1$. + +From the example of Gaussian mechanism, we see that the relation between +$\lambda$ and $\rho$ is like that between $\epsilon$ and $\delta$. Given +$\lambda$ (resp. $\rho$) and parameters like variance of the noise and +the sensitivity of the query, we can write $\rho = \rho(\lambda)$ (resp. +$\lambda = \lambda(\rho)$). + +Using the Chernoff bound (6.7), we can bound the divergence variable: + +$$\mathbb P(L(p || q) \ge \epsilon) \le {\mathbb E \exp(t L(p || q)) \over \exp(t \epsilon))} = \exp (\kappa_{p, q}(t) - \epsilon t). \qquad (7.7)$$ + +For a function $f: I \to \mathbb R$, denote its Legendre transform by + +$$f^*(\epsilon) := \sup_{t \in I} (\epsilon t - f(t)).$$ + +By taking infimum on the RHS of (7.7), we obtain + +**Claim 20**. Two probability densities $p$ and $q$ are +$(\epsilon, \exp(-\kappa_{p, q}^*(\epsilon)))$-ind. + +Given a mechanism $M$, let $\kappa_M(t)$ denote an upper bound for the +cumulant of its privacy loss: + +$$\log \mathbb E \exp(t L(M(x) || M(x'))) \le \kappa_M(t), \qquad \forall x, x'\text{ with } d(x, x') = 1.$$ + +For example, we can set $\kappa_M(t) = t \rho(t + 1)$. Using the same +argument we have the following: + +**Claim 21**. + +1. If $M$ is $(\lambda, \rho)$-rdp, then it is also + $(\epsilon, \exp((\lambda - 1) (\rho - \epsilon)))$-dp for any + $\epsilon \ge \rho$. +2. Alternatively, $M$ is $(\epsilon, - \exp(\kappa_M^*(\epsilon)))$-dp + for any $\epsilon > 0$. +3. Alternatively, for any $0 < \delta \le 1$, $M$ is + $(\rho + (\lambda - 1)^{-1} \log \delta^{-1}, \delta)$-dp. + +**Example 2 (Gaussian mechanism)**. +We can apply the above argument to the Gaussian mechanism on query $f$ +and get: + +$$\delta \le \inf_{\lambda > 1} \exp((\lambda - 1) ({\lambda \over 2 \sigma^2} - \epsilon))$$ + +By assuming $\sigma^2 > (2 \epsilon)^{-1}$ we have that the infimum is +achieved when $\lambda = (1 + 2 \epsilon / \sigma^2) / 2$ and + +$$\delta \le \exp(- ((2 \sigma)^{-1} - \epsilon \sigma)^2 / 2)$$ + +which is the same result as (6.8), obtained using the Chernoff bound of +the noise. + +However, as we will see later, compositions will yield different results +from those obtained from methods in [Part 1](2019-03-13-a-tail-of-two-densities.html) when considering Rényi dp. + +**Claim 22 (Moment Composition +Theorem)**. Let $M$ be the adaptive composition of $M_{1 : k}$. Suppose +for any $y_{< i}$, $M_i(y_{< i})$ is $(\lambda, \rho)$-rdp. Then $M$ is +$(\lambda, k\rho)$-rdp. + +**Proof**. Rather straightforward. As before let $p_i$ and +$q_i$ be the conditional laws of adpative composition of $M_{1 : i}$ at +$x$ and $x'$ respectively, and $p^i$ and $q^i$ be the joint laws of +$M_{1 : i}$ at $x$ and $x'$ respectively. Denote + +$$D_i = \mathbb E \exp((\lambda - 1)\log {p^i(\xi_{1 : i}) \over q^i(\xi_{1 : i})})$$ + +Then + +$$\begin{aligned} +D_i &= \mathbb E\mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) \exp((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}) \big| \xi_{< i}\right) \\ +&= \mathbb E \mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) | \xi_{< i}\right) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\ +&\le \mathbb E \exp((\lambda - 1) \rho) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\ +&= \exp((\lambda - 1) \rho) D_{i - 1}. +\end{aligned}$$ + +Applying this recursively we have + +$$D_k \le \exp(k(\lambda - 1) \rho),$$ + +and so + +$$(\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1)\log {p^k(\xi_{1 : i}) \over q^k(\xi_{1 : i})}) = (\lambda - 1)^{-1} \log D_k \le k \rho.$$ + +Since this holds for all $x$ and $x'$, we are done. $\square$ + +This, together with the scaling property of the legendre transformation: + +$$(k f)^*(x) = k f^*(x / k)$$ + +yields + +**Claim 23**. The $k$-fold adaptive composition of +$(\lambda, \rho(\lambda))$-rdp mechanisms is +$(\epsilon, \exp(- k \kappa^*(\epsilon / k)))$-dp, where +$\kappa(t) := t \rho(t + 1)$. + +**Example 3 (Gaussian mechanism)**. +We can apply the above claim to Gaussian mechanism. Again, without loss +of generality we assume $S_f = 1$. But let us do it manually to get the +same results. If we apply the Moment Composition Theorem to the an +adaptive composition of Gaussian mechanisms on the same query, then +since each $M_i$ is $(\lambda, (2 \sigma^2)^{-1} \lambda)$-rdp, the +composition $M$ is $(\lambda, (2 \sigma^2)^{-1} k \lambda)$-rdp. +Processing this using the Chernoff bound as in the previous example, we +have + +$$\delta = \exp(- ((2 \sigma / \sqrt k)^{-1} - \epsilon \sigma / \sqrt k)^2 / 2),$$ + +Substituting $\sigma$ with $\sigma / \sqrt k$ in (6.81), we conclude +that if + +$$\sigma > \sqrt k \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right)$$ + +then the composition $M$ is $(\epsilon, \delta)$-dp. + +As we will see in the discussions at the end of this post, this result +is different from (and probably better than) the one obtained by using +the Advanced Composition Theorem (Claim 18). + +We also have a subsampling theorem for the Rényi dp. + +**Claim 24**. Fix $r \in [0, 1]$. Let $m \le n$ be two +nonnegative integers with $m = r n$. Let $N$ be a $(\lambda, \rho)$-rdp +machanism on $X^m$. Let $\mathcal I := \{J \subset [n]: |J| = m\}$ be +the set of subsets of $[n]$ of size $m$. Define mechanism $M$ on $X^n$ +by + +$$M(x) = N(x_\gamma)$$ + +where $\gamma$ is sampled uniformly from $\mathcal I$. Then $M$ is +$(\lambda, {1 \over \lambda - 1} \log (1 + r(e^{(\lambda - 1) \rho} - 1)))$-rdp. + +To prove Claim 24, we need a useful lemma: + +{#Claim 25}**Claim 25**. Let $p_{1 : n}$ and $q_{1 : n}$ be +nonnegative integers, and $\lambda > 1$. Then + +$${(\sum p_i)^\lambda \over (\sum q_i)^{\lambda - 1}} \le \sum_i {p_i^\lambda \over q_i^{\lambda - 1}}. \qquad (8)$$ + +**Proof**. Let + +$$r(i) := p_i / P, \qquad u(i) := q_i / Q$$ + +where + +$$P := \sum p_i, \qquad Q := \sum q_i$$ + +then $r$ and $u$ are probability mass functions. Plugging in +$p_i = r(i) P$ and $q_i = u(i) Q$ into the objective (8), it suffices to +show + +$$1 \le \sum_i {r(i)^\lambda \over u(i)^{\lambda - 1}} = \mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda$$ + +This is true due to Jensen\'s Inequality: + +$$\mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda \ge \left(\mathbb E_{\xi \sim u} {r(\xi) \over u(\xi)} \right)^\lambda = 1.$$ + +$\square$ + +**Proof of Claim 24**. Define $\mathcal I$ as +before. + +Let $p$ and $q$ be the laws of $M(x)$ and $M(x')$ respectively. For any +$I \in \mathcal I$, let $p_I$ and $q_I$ be the laws of $N(x_I)$ and +$N(x_I')$ respectively. Then we have + +$$\begin{aligned} +p(y) &= n^{-1} \sum_{I \in \mathcal I} p_I(y) \\ +q(y) &= n^{-1} \sum_{I \in \mathcal I} q_I(y), +\end{aligned}$$ + +where $n = |\mathcal I|$. + +The MGF of $L(p || q)$ is thus + +$$\mathbb E((\lambda - 1) L(p || q)) = n^{-1} \int {(\sum_I p_I(y))^\lambda \over (\sum_I q_I(y))^{\lambda - 1}} dy \le n^{-1} \sum_I \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy \qquad (9)$$ + +where in the last step we used Claim 25. As in the proof of Claim 19, we +divide $\mathcal I$ into disjoint sets $\mathcal I_\in$ and +$\mathcal I_\notin$. Furthermore we denote by $n_\in$ and $n_\notin$ +their cardinalities. Then the right hand side of (9) becomes + +$$n^{-1} \sum_{I \in \mathcal I_\in} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy + n^{-1} \sum_{I \in \mathcal I_\notin} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy$$ + +The summands in the first are the MGF of $L(p_I || q_I)$, and the +summands in the second term are $1$, so + +$$\begin{aligned} +\mathbb E((\lambda - 1) L(p || q)) &\le n^{-1} \sum_{I \in \mathcal I_\in} \mathbb E \exp((\lambda - 1) L(p_I || q_I)) + (1 - r) \\ +&\le n^{-1} \sum_{I \in \mathcal I_\in} \exp((\lambda - 1) D_\lambda(p_I || q_I)) + (1 - r) \\ +&\le r \exp((\lambda - 1) \rho) + (1 - r). +\end{aligned}$$ + +Taking log and dividing by $(\lambda - 1)$ on both sides we have + +$$D_\lambda(p || q) \le (\lambda - 1)^{-1} \log (1 + r(\exp((\lambda - 1) \rho) - 1)).$$ + +$\square$ + +As before, we can rewrite the conclusion of Lemma 6 using +$1 + z \le e^z$ and obtain +$(\lambda, (\lambda - 1)^{-1} r (e^{(\lambda - 1) \rho} - 1))$-rdp, +which further gives $(\lambda, \alpha^{-1} (e^\alpha - 1) r \rho)$-rdp +(or $(\lambda, O(r \rho))$-rdp) if $(\lambda - 1) \rho < \alpha$ for +some $\alpha$. + +It is not hard to see that the subsampling theorem in moment method, +even though similar to the results of that in the usual method, does not +help due to lack of an analogue of advanced composition theorem of the +moments. + +**Example 4 (Gaussian mechanism)**. +Applying the moment subsampling theorem to the Gaussian mechanism, we +obtain $(\lambda, O(r \lambda / \sigma^2))$-rdp for a subsampled +Gaussian mechanism with rate $r$. +Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016 (ACGMMTZ16 in the +following), however, gains an extra $r$ in the bound given certain +assumptions. + +ACGMMTZ16 +--------- + +What follows is my understanding of this result. I call it a hypothesis +because there is a gap which I am not able to reproduce their proof or +prove it myself. This does not mean the result is false. On the +contrary, I am inclined to believe it is true. + +**Hypothesis 0**. For a subsampled Gaussian mechanism +with ratio $r$, if $r = O(\sigma^{-1})$ and $\lambda = O(\sigma^2)$, +then we have $(\lambda, O(r^2 \lambda / \sigma^2))$-rdp. + +Recall the definition of $G_\lambda$ under the definition of Rényi +differential privacy. The following Claim will be useful. + +**Claim 26**. Let $\lambda$ be a positive integer, then + +$$G_\lambda(r p + (1 - r) q || q) = \sum_{k = 1 : \lambda} {\lambda \choose k} r^k (1 - r)^{\lambda - k} G_k(p || q).$$ + +**Proof**. Quite straightforward, by expanding the numerator +$(r p + (1 - r) q)^\lambda$ using binomial expansion. $\square$ + +**Proof of Hypothesis 0 +(Incomplete)**. I will break the proof into two parts and discuss each +one. + +1. The MGF of the privacy loss is bounded by that of + $L(r \mu_1 + (1 - r) \mu_0 || \mu_0)$ where + $\mu_i = N(i, \sigma^2)$. +2. If $r \le c_1 \sigma^{-1}$ and $\lambda \le c_2 \sigma^2$, then + there exists $C = C(c_1, c_2)$ such that + $G_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0) \le C$ (since + $O(r^2 \lambda^2 / \sigma^2) = O(1)$). + +**Remark in the proof**. Note that the choice of +$c_1$, $c_2$ and the function $C(c_1, c_2)$ are important to the +practicality and usefulness of Hypothesis 0. + +Item 1 can be deduced from the following hypothesis, which, as the only +gap in the Proof of Hypothesis 0, is heuristically reasonable, hence why +I am inclined to believe Hypothesis 0 is true. + +**Hypothesis 1**. Let $p_i$, $q_i$, $\mu_i$, $\nu_i$ be +probability densities on the same space for $i = 1 : n$. If +$D_\lambda(p_i || q_i) \le D_\lambda(\mu_i || \nu_i)$ for all $i$, then + +$$D_\lambda(n^{-1} \sum_i p_i || n^{-1} \sum_i q_i) \le D_\lambda(n^{-1} \sum_i \mu_i || n^{-1} \sum_i \nu_i).$$ + +Basically, it is saying \"if for each $i$, $p_i$ and $q_i$ are closer to +each other than $\mu_i$ and $\nu_i$, then so is their average over +$i$\". + +To see how Hypothesis 1 implies Item 1, note that + +$$D_\lambda(q_I || p_I) = D_\lambda(p_I || q_I) +\begin{cases} +\le D_\lambda(\mu_0 || \mu_1) = D_\lambda(\mu_1 || \mu_0), & I \in \mathcal I_\in\\ += D_\lambda(\mu_0 || \mu_0) = D_\lambda(\mu_1 || \mu_1) = 0 & I \in \mathcal I_\notin +\end{cases}$$ + +and that $p = |\mathcal I|^{-1} \sum_{I \in \mathcal I} p_I$ and +$q = |\mathcal I|^{-1} \sum_{I \in \mathcal I} q_I$ and +$|\mathcal I_\in| = r |\mathcal I|$. + +Alternatively, one may try to prove a weaker version of Hypothesis 1, by +specialising on mixture of Gaussians. + +One way one may try to prove Hypothesis 1 is to prove an equivalent +statement: + +**Hypothesis 2**. Let $p_1$, $q_1$, $p_2$, $q_2$, +$\mu_1$, $\mu_2$, $\nu_1$, $\nu_2$ be probability densities, and suppose + +$$\int {p_i^\lambda \over q_i^{\lambda - 1}} \le \int {\mu_i^\lambda \over \nu_i^{\lambda - 1}}, \qquad i = 1, 2$$ + +then + +$$\int {(p_1 + p_2)^\lambda \over (q_1 + q_2)^{\lambda - 1}} \le \int {(\mu_1 + \mu_1)^\lambda \over (\nu_1 + \nu_2)^{\lambda - 1}}.$$ + +Indeed, on one hand, Hypothesis 2 is a special case of Hypothesis 1 and +on the other hand, given Hypothesis 2, Hypothesis 1 can be shown using +induction by replacing one pair of densities a time. + +Now let us prove Item 2. + +Using Claim 26 and Example 1, we have + +$$\begin{aligned} +G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0)) &= \sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} G_j(\mu_1 || \mu_0)\\ +&=\sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} \exp(j (j - 1) / 2 \sigma^2). \qquad (9.5) +\end{aligned}$$ + +Denote by $n = \lceil \sigma^2 \rceil$. It suffices to show + +$$\sum_{j = 0 : n} {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j (j - 1) / 2 n) \le C$$ + +Note that we can discard the linear term $- c_2 j / \sigma^2$ in the +exponential term since we want to bound the sum from above. + +We examine the asymptotics of this sum when $n$ is large, and treat the +sum as an approximation to an integration of a function +$\phi: [0, 1] \to \mathbb R$. For $j = x n$, where $x \in (0, 1)$, +$\phi$ is thus defined as (note we multiply the summand with $n$ to +compensate the uniform measure on $1, ..., n$: + +$$\begin{aligned} +\phi_n(x) &:= n {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j^2 / 2 n) \\ +&= n {n \choose x n} (c_1 n^{- 1 / 2})^{x n} (1 - c_1 n^{- 1 / 2})^{(1 - x) n} \exp(c_2 x^2 n / 2) +\end{aligned}$$ + +Using Stirling\'s approximation + +$$n! \approx \sqrt{2 \pi n} n^n e^{- n},$$ + +we can approach the binomial coefficient: + +$${n \choose x n} \approx (\sqrt{2 \pi x (1 - x)} x^{x n} (1 - x)^{(1 - x) n})^{-1}.$$ + +We also approximate + +$$(1 - c_1 n^{- 1 / 2})^{(1 - x) n} \approx \exp(- c_1 \sqrt{n} (1 - x)).$$ + +With these we have + +$$\phi_n(x) \approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left(- {1 \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x) + {1 \over 2} c_2 x^2) n + {1 \over 2} \log n\right).$$ + +This vanishes as $n \to \infty$, and since $\phi_n(x)$ is bounded above +by the integrable function ${1 \over \sqrt{2 \pi x (1 - x)}}$ (c.f. the +arcsine law), and below by $0$, we may invoke the dominant convergence +theorem and exchange the integral with the limit and get + +$$\begin{aligned} +\lim_{n \to \infty} &G_n (r \mu_1 + (1 - r) \mu_0 || \mu_0)) \\ +&\le \lim_{n \to \infty} \int \phi_n(x) dx = \int \lim_{n \to \infty} \phi_n(x) dx = 0. +\end{aligned}$$ + +Thus we have that the generating function of the divergence variable +$L(r \mu_1 + (1 - r) \mu_0 || \mu_0)$ is bounded. + +Can this be true for better orders + +$$r \le c_1 \sigma^{- d_r},\qquad \lambda \le c_2 \sigma^{d_\lambda}$$ + +for some $d_r \in (0, 1]$ and $d_\lambda \in [2, \infty)$? If we follow +the same approximation using these exponents, then letting +$n = c_2 \sigma^{d_\lambda}$, + +$$\begin{aligned} +{n \choose j} &r^j (1 - r)^{n - j} G_j(\mu_0 || \mu_1) \le \phi_n(x) \\ +&\approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left({1 \over 2} c_2^{2 \over d_\lambda} x^2 n^{2 - {2 \over d_\lambda}} - {d_r \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x)) n + {1 \over 2} \log n\right). +\end{aligned}$$ + +So we see that to keep the divergence moments bounded it is possible to +have any $r = O(\sigma^{- d_r})$ for $d_r \in (0, 1)$, but relaxing +$\lambda$ may not be safe. + +If we relax $r$, then we get + +$$G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0) = O(r^{2 / d_r} \lambda^2 \sigma^{-2}) = O(1).$$ + +Note that now the constant $C$ depends on $d_r$ as well. Numerical +experiments seem to suggest that $C$ can increase quite rapidly as $d_r$ +decreases from $1$. $\square$ + +In the following for consistency we retain $k$ as the number of epochs, +and use $T := k / r$ to denote the number of compositions / steps / +minibatches. With Hypothesis 0 we have: + +**Hypothesis 3**. Let $\epsilon, c_1, c_2 > 0$, +$r \le c_1 \sigma^{-1}$, +$T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2$. then the DP-SGD with +subsampling rate $r$, and $T$ steps is $(\epsilon, \delta)$-dp for + +$$\delta = \exp(- {1 \over 2} c_2 \sigma^2 \epsilon).$$ + +In other words, for + +$$\sigma \ge \sqrt{2 c_2^{-1}} \epsilon^{- {1 \over 2}} \sqrt{\log \delta^{-1}},$$ + +we can achieve $(\epsilon, \delta)$-dp. + +**Proof**. By Hypothesis 0 and the Moment Composition Theorem +(Claim 22), for $\lambda = c_2 \sigma^2$, substituting +$T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2$, we have + +$$\mathbb P(L(p || q) \ge \epsilon) \le \exp(k C(c_1, c_2) - \lambda \epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right).$$ + +$\square$ + +**Remark**. Hypothesis 3 is my understanding / version of +Theorem 1 in \[ACGMMTZ16\], by using the same proof technique. Here I +quote the original version of theorem with notions and notations altered +for consistency with this post: + +> There exists constants $c_1', c_2' > 0$ so that for any +> $\epsilon < c_1' r^2 T$, DP-SGD is $(\epsilon, \delta)$-differentially +> private for any $\delta > 0$ if we choose + +$$\sigma \ge c_2' {r \sqrt{T \log (1 / \delta)} \over \epsilon}. \qquad (10)$$ + +I am however unable to reproduce this version, assuming Hypothesis 0 is +true, for the following reasons: + +1. In the proof in the paper, we have $\epsilon = c_1' r^2 T$ instead + of \"less than\" in the statement of the Theorem. If we change it to + $\epsilon < c_1' r^2 T$ then the direction of the inequality becomes + opposite to the direction we want to prove: + $$\exp(k C(c_1, c_2) - \lambda \epsilon) \ge ...$$ + +2. The implicit condition $r = O(\sigma^{-1})$ of Hypothesis 0 whose + result is used in the proof of this theorem is not mentioned in the + statement of the proof. The implication is that (10) becomes an + ill-formed condition as the right hand side also depends on + $\sigma$. + +Tensorflow implementation +------------------------- + +The DP-SGD is implemented in [TensorFlow +Privacy](https://github.com/tensorflow/privacy). In the following I +discuss the package in the current state (2019-03-11). It is divided +into two parts: `optimizers` which implements the actual differentially +private algorithm, and `analysis` which computes the privacy guarantee. + +The `analysis` parts implements a privacy ledger that \"keeps a record +of all queries executed over a given dataset for the purpose of +computing privacy guarantees\". On the other hand, all the computation +is done in `rdp_accountant.py` +([link](https://github.com/tensorflow/privacy/blob/7e2d796bdee9b60dce21a82a397eefda35b0ac10/privacy/analysis/rdp_accountant.py). +At this moment, `rdp_accountant.py` only implements the computation of +the privacy guarantees for DP-SGD with Gaussian mechanism. In the +following I will briefly explain the code in this file. + +Some notational correspondences: their `alpha` is our $\lambda$, their +`q` is our $r$, their `A_alpha` (in the comments) is our +$\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2)} (\lambda - 1)$, at +least when $\lambda$ is an integer. + +- The function `_compute_log_a` presumably computes the cumulants + $\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2), N(0, \sigma^2)}(\lambda - 1)$. + It calls `_compute_log_a_int` or `_compute_log_a_frac` depending on + whether $\lambda$ is an integer. +- The function `_compute_log_a_int` computes the cumulant using (9.5). +- When $\lambda$ is not an integer, we can\'t use (9.5). I have yet to + decode how `_compute_log_a_frac` computes the cumulant (or an upper + bound of it) in this case +- The function `_compute_delta` computes $\delta$s for a list of + $\lambda$s and $\kappa$s using Item 1 of Claim 3 and return the + smallest one, and the function `_compute_epsilon` computes epsilon + uses Item 3 in the same way. + +In `optimizers`, among other things, the DP-SGD with Gaussian mechanism +is implemented in `dp_optimizer.py` and `gaussian_query.py`. See the +definition of `DPGradientDescentGaussianOptimizer` in `dp_optimizer.py` +and trace the calls therein. + +At this moment, the privacy guarantee computation part and the optimizer +part are separated, with `rdp_accountant.py` called in +`compute_dp_sgd_privacy.py` with user-supplied parameters. I think this +is due to the lack of implementation in `rdp_accountant.py` of any +non-DPSGD-with-Gaussian privacy guarantee computation. There is already +[an issue on this](https://github.com/tensorflow/privacy/issues/23), so +hopefully it won\'t be long before the privacy guarantees can be +automatically computed given a DP-SGD instance. + +Comparison among different methods +---------------------------------- + +So far we have seen three routes to compute the privacy guarantees for +DP-SGD with the Gaussian mechanism: + +1. Claim 9 (single Gaussian mechanism privacy guarantee) -\> Claim 19 + (Subsampling theorem) -\> Claim 18 (Advanced Adaptive Composition + Theorem) +2. Example 1 (RDP for the Gaussian mechanism) -\> Claim 22 (Moment + Composition Theorem) -\> Example 3 (Moment composition applied to + the Gaussian mechanism) +3. Hypothesis 0 (RDP for Gaussian mechanism with specific magnitudes + for subsampling rate) -\> Hypothesis 3 (Moment Composition Theorem + and translation to conventional DP) + +Which one is the best? + +To make fair comparison, we may use one parameter as the metric and set +all others to be the same. For example, we can + +1. Given the same $\epsilon$, $r$ (in Route 1 and 3), $k$, $\sigma$, + compare the $\delta$s +2. Given the same $\epsilon$, $r$ (in Route 1 and 3), $k$, $\delta$, + compare the $\sigma$s +3. Given the same $\delta$, $r$ (in Route 1 and 3), $k$, $\sigma$, + compare the $\epsilon$s. + +I find that the first one, where $\delta$ is used as a metric, the best. +This is because we have the tightest bounds and the cleanest formula +when comparing the $\delta$. For example, the Azuma and Chernoff bounds +are both expressed as a bound for $\delta$. On the other hand, the +inversion of these bounds either requires a cost in the tightness (Claim +9, bounds on $\sigma$) or in the complexity of the formula (Claim 16 +Advanced Adaptive Composition Theorem, bounds on $\epsilon$). + +So if we use $\sigma$ or $\epsilon$ as a metric, either we get a less +fair comparison, or have to use a much more complicated formula as the +bounds. + +Let us first compare Route 1 and Route 2 without specialising to the +Gaussian mechanism. + +**Disclaimer**. What follows is my original research and +has not been reviewed by anyone. + +Suppose each mechanism $N_i$ satisfies +$(\epsilon', \delta(\epsilon'))$-dp. Let +$\tilde \epsilon := \log (1 + r (e^{\epsilon'} - 1))$, then we have the +subsampled mechanism $M_i(x) = N_i(x_\gamma)$ is +$(\tilde \epsilon, r \tilde \delta(\tilde \epsilon))$-dp, where + +$$\tilde \delta(\tilde \epsilon) = \delta(\log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1))$$ + +Using the Azuma bound in the proof of Advanced Adaptive Composition +Theorem (6.99): + +$$\mathbb P(L(p^k || q^k) \ge \epsilon) \le \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}).$$ + +So we have the final bound for Route 1: + +$$\delta_1(\epsilon) = \min_{\tilde \epsilon: \epsilon > r^{-1} k a(\tilde \epsilon)} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \tilde \delta(\tilde \epsilon).$$ + +As for Route 2, since we do not gain anything from subsampling in RDP, +we do not subsample at all. + +By Claim 23, we have the bound for Route 2: + +$$\delta_2(\epsilon) = \exp(- k \kappa^* (\epsilon / k)).$$ + +On one hand, one can compare $\delta_1$ and $\delta_2$ with numerical +experiments. On the other hand, if we further specify +$\delta(\epsilon')$ in Route 1 as the Chernoff bound for the cumulants +of divergence variable, i.e. + +$$\delta(\epsilon') = \exp(- \kappa^* (\epsilon')),$$ + +we have + +$$\delta_1 (\epsilon) = \min_{\tilde \epsilon: a(\tilde \epsilon) < r k^{-1} \epsilon} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \exp(- \kappa^* (b(\tilde\epsilon))),$$ + +where + +$$b(\tilde \epsilon) := \log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1) \le r^{-1} \tilde\epsilon.$$ + +We note that since +$a(\tilde \epsilon) = \tilde\epsilon(e^{\tilde \epsilon} - 1) 1_{\tilde\epsilon < \log 2} + \tilde\epsilon 1_{\tilde\epsilon \ge \log 2}$, +we may compare the two cases separately. + +Note that we have $\kappa^*$ is a monotonously increasing function, +therefore + +$$\kappa^* (b(\tilde\epsilon)) \le \kappa^*(r^{-1} \tilde\epsilon).$$ + +So for $\tilde \epsilon \ge \log 2$, we have + +$$k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(r^{-1} \tilde \epsilon)) \ge k \exp(- \kappa^*(k^{-1} \epsilon)) \ge \delta_2(\epsilon).$$ + +For $\tilde\epsilon < \log 2$, it is harder to compare, as now + +$$k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(\epsilon / \sqrt{r k})).$$ + +It is tempting to believe that this should also be greater than +$\delta_2(\epsilon)$. But I can not say for sure. At least in the +special case of Gaussian, we have + +$$k \exp(- \kappa^*(\epsilon / \sqrt{r k})) = k \exp(- (\sigma \sqrt{\epsilon / k r} - (2 \sigma)^{-1})^2) \ge \exp(- k ({\sigma \epsilon \over k} - (2 \sigma)^{-1})^2) = \delta_2(\epsilon)$$ + +when $\epsilon$ is sufficiently small. However we still need to consider +the case where $\epsilon$ is not too small. But overall it seems most +likely Route 2 is superior than Route 1. + +So let us compare Route 2 with Route 3: + +Given the condition to obtain the Chernoff bound + +$${\sigma \epsilon \over k} > (2 \sigma)^{-1}$$ + +we have + +$$\delta_2(\epsilon) > \exp(- k (\sigma \epsilon / k)^2) = \exp(- \sigma^2 \epsilon^2 / k).$$ + +For this to achieve the same bound + +$$\delta_3(\epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right)$$ + +we need $k < {2 \epsilon \over c_2}$. This is only possible if $c_2$ is +small or $\epsilon$ is large, since $k$ is a positive integer. + +So taking at face value, Route 3 seems to achieve the best results. +However, it also has some similar implicit conditions that need to be +satisfied: First $T$ needs to be at least $1$, meaning + +$${c_2 \over C(c_1, c_2)} \epsilon \sigma^2 \ge 1.$$ + +Second, $k$ needs to be at least $1$ as well, i.e. + +$$k = r T \ge {c_1 c_2 \over C(c_1, c_2)} \epsilon \sigma \ge 1.$$ + +Both conditions rely on the magnitudes of $\epsilon$, $\sigma$, $c_1$, +$c_2$, and the rate of growth of $C(c_1, c_2)$. The biggest problem in +this list is the last, because if we know how fast $C$ grows then we\'ll +have a better idea what are the constraints for the parameters to +achieve the result in Route 3. + +Further questions +----------------- + +Here is a list of what I think may be interesting topics or potential +problems to look at, with no guarantee that they are all awesome +untouched research problems: + +1. Prove Hypothesis 2 +2. Find a theoretically definitive answer whether the methods in Part 1 + or Part 2 yield better privacy guarantees. +3. Study the non-Gaussian cases, general or specific. Let $p$ be some + probability density, what is the tail bound of + $L(p(y) || p(y + \alpha))$ for $|\alpha| \le 1$? Can you find + anything better than Gaussian? For a start, perhaps the nice tables + of Rényi divergence in Gil-Alajaji-Linder 2013 may be useful? +4. Find out how useful Hypothesis 0 is. Perhaps start with computing + the constant $C$ nemerically. +5. Help with [the aforementioned + issue](https://github.com/tensorflow/privacy/issues/23) in the + Tensorflow privacy package. + +References +---------- + +- Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya + Mironov, Kunal Talwar, and Li Zhang. "Deep Learning with + Differential Privacy." Proceedings of the 2016 ACM SIGSAC Conference + on Computer and Communications Security - CCS'16, 2016, 308--18. + <https://doi.org/10.1145/2976749.2978318>. +- Erven, Tim van, and Peter Harremoës. "R\\'enyi Divergence and + Kullback-Leibler Divergence." IEEE Transactions on Information + Theory 60, no. 7 (July 2014): 3797--3820. + <https://doi.org/10.1109/TIT.2014.2320500>. +- Gil, M., F. Alajaji, and T. Linder. "Rényi Divergence Measures for + Commonly Used Univariate Continuous Distributions." Information + Sciences 249 (November 2013): 124--31. + <https://doi.org/10.1016/j.ins.2013.06.018>. +- Mironov, Ilya. "Renyi Differential Privacy." 2017 IEEE 30th Computer + Security Foundations Symposium (CSF), August 2017, 263--75. + <https://doi.org/10.1109/CSF.2017.11>. |