1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
|
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Great but Manageable Expectations</title>
<link rel="stylesheet" href="../assets/css/default.css" />
<script data-isso="/comments/"
data-isso-css="true"
data-isso-lang="en"
data-isso-reply-to-self="false"
data-isso-require-author="true"
data-isso-require-email="true"
data-isso-max-comments-top="10"
data-isso-max-comments-nested="5"
data-isso-reveal-on-click="5"
data-isso-avatar="true"
data-isso-avatar-bg="#f0f0f0"
data-isso-avatar-fg="#9abf88 #5698c4 #e279a3 #9163b6 ..."
data-isso-vote="true"
data-vote-levels=""
src="/comments/js/embed.min.js"></script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
<script src="../assets/js/analytics.js" type="text/javascript"></script>
</head>
<body>
<header>
<span class="logo">
<a href="../blog.html">Yuchen's Blog</a>
</span>
<nav>
<a href="../index.html">About</a><a href="../postlist.html">All posts</a><a href="../blog-feed.xml">Feed</a>
</nav>
</header>
<div class="main">
<div class="bodyitem">
<h2> Great but Manageable Expectations </h2>
<p>Posted on 2019-03-14 | <a href="/posts/2019-03-14-great-but-manageable-expectations.html#isso-thread">Comments</a> </p>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<title>Untitled</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-AMS_CHTML-full" type="text/javascript"></script>
<!--[if lt IE 9]>
<script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
<![endif]-->
</head>
<body>
<nav id="TOC">
<ul>
<li><a href="#rényi-divergence-and-differential-privacy">Rényi divergence and differential privacy</a></li>
<li><a href="#acgmmtz16">ACGMMTZ16</a></li>
<li><a href="#tensorflow-implementation">Tensorflow implementation</a></li>
<li><a href="#comparison-among-different-methods">Comparison among different methods</a></li>
<li><a href="#further-questions">Further questions</a></li>
<li><a href="#references">References</a></li>
</ul>
</nav>
<p>This is Part 2 of a two-part blog post on differential privacy. Continuing from <a href="/posts/2019-03-13-a-tail-of-two-densities.html">Part 1</a>, I discuss the Rényi differential privacy, corresponding to the Rényi divergence, a study of the moment generating functions the divergence between probability measures to derive the tail bounds.</p>
<p>Like in Part 1, I prove a composition theorem and a subsampling theorem.</p>
<p>I also attempt to reproduce a seemingly better moment bound for the Gaussian mechanism with subsampling, with one intermediate step which I am not able to prove.</p>
<p>After that I explain the Tensorflow implementation of differential privacy in its <a href="https://github.com/tensorflow/privacy/tree/master/privacy">Privacy</a> module, which focuses on the differentially private stochastic gradient descent algorithm (DP-SGD).</p>
<p>Finally I use the results from both Part 1 and Part 2 to obtain some privacy guarantees for composed subsampling queries in general, and for DP-SGD in particular. I also compare these privacy guarantees.</p>
<p><em>If you are confused by any notations, ask me or try <a href="/notations.html">this</a>.</em></p>
<h2 id="rényi-divergence-and-differential-privacy">Rényi divergence and differential privacy</h2>
<p>Recall in the proof of Gaussian mechanism privacy guarantee (Claim 8) we used the Chernoff bound for the Gaussian noise. Why not use the Chernoff bound for the divergence variable / privacy loss directly, since the latter is closer to the core subject than the noise? This leads us to the study of Rényi divergence.</p>
<p>So far we have seen several notions of divergence used in differential privacy: the max divergence which is <span class="math inline">\(\epsilon\)</span>-ind in disguise:</p>
<p><span class="math display">\[D_\infty(p || q) := \max_y \log {p(y) \over q(y)},\]</span></p>
<p>the <span class="math inline">\(\delta\)</span>-approximate max divergence that defines the <span class="math inline">\((\epsilon, \delta)\)</span>-ind:</p>
<p><span class="math display">\[D_\infty^\delta(p || q) := \max_y \log{p(y) - \delta \over q(y)},\]</span></p>
<p>and the KL-divergence which is the expectation of the divergence variable:</p>
<p><span class="math display">\[D(p || q) = \mathbb E L(p || q) = \int \log {p(y) \over q(y)} p(y) dy.\]</span></p>
<p>The Rényi divergence is an interpolation between the max divergence and the KL-divergence, defined as the log moment generating function / cumulants of the divergence variable:</p>
<p><span class="math display">\[D_\lambda(p || q) = (\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1) L(p || q)) = (\lambda - 1)^{-1} \log \int {p(y)^\lambda \over q(y)^{\lambda - 1}} dx.\]</span></p>
<p>Indeed, when <span class="math inline">\(\lambda \to \infty\)</span> we recover the max divergence, and when <span class="math inline">\(\lambda \to 1\)</span>, by recognising <span class="math inline">\(D_\lambda\)</span> as a derivative in <span class="math inline">\(\lambda\)</span> at <span class="math inline">\(\lambda = 1\)</span>, we recover the KL divergence. In this post we only consider <span class="math inline">\(\lambda > 1\)</span>.</p>
<p>Using the Rényi divergence we may define:</p>
<p><strong>Definition (Rényi differential privacy)</strong> (Mironov 2017). An mechanism <span class="math inline">\(M\)</span> is <span class="math inline">\((\lambda, \rho)\)</span><em>-Rényi differentially private</em> (<span class="math inline">\((\lambda, \rho)\)</span>-rdp) if for all <span class="math inline">\(x\)</span> and <span class="math inline">\(x'\)</span> with distance <span class="math inline">\(1\)</span>,</p>
<p><span class="math display">\[D_\lambda(M(x) || M(x')) \le \rho.\]</span></p>
<p>For convenience we also define two related notions, <span class="math inline">\(G_\lambda (f || g)\)</span> and <span class="math inline">\(\kappa_{f, g} (t)\)</span> for <span class="math inline">\(\lambda > 1\)</span>, <span class="math inline">\(t > 0\)</span> and positive functions <span class="math inline">\(f\)</span> and <span class="math inline">\(g\)</span>:</p>
<p><span class="math display">\[G_\lambda(f || g) = \int f(y)^{\lambda} g(y)^{1 - \lambda} dy; \qquad \kappa_{f, g} (t) = \log G_{t + 1}(f || g).\]</span></p>
<p>For probability densities <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span>, <span class="math inline">\(G_{t + 1}(p || q)\)</span> and <span class="math inline">\(\kappa_{p, q}(t)\)</span> are the <span class="math inline">\(t\)</span>th moment generating function and cumulant of the divergence variable <span class="math inline">\(L(p || q)\)</span>, and</p>
<p><span class="math display">\[D_\lambda(p || q) = (\lambda - 1)^{-1} \kappa_{p, q}(\lambda - 1).\]</span></p>
<p>In the following, whenever you see <span class="math inline">\(t\)</span>, think of it as <span class="math inline">\(\lambda - 1\)</span>.</p>
<p><strong>Example 1 (RDP for the Gaussian mechanism)</strong>. Using the scaling and translation invariance of <span class="math inline">\(L\)</span> (6.1), we have that the divergence variable for two Gaussians with the same variance is</p>
<p><span class="math display">\[L(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) \overset{d}{=} L(N(0, I) || N((\mu_2 - \mu_1) / \sigma, I)).\]</span></p>
<p>With this we get</p>
<p><span class="math display">\[D_\lambda(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) = {\lambda \|\mu_2 - \mu_1\|_2^2 \over 2 \sigma^2} = D_\lambda(N(\mu_2, \sigma^2 I) || N(\mu_1, \sigma^2 I)).\]</span></p>
<p>Again due to the scaling invariance of <span class="math inline">\(L\)</span>, we only need to consider <span class="math inline">\(f\)</span> with sensitivity <span class="math inline">\(1\)</span>, see the discussion under (6.1). The Gaussian mechanism on query <span class="math inline">\(f\)</span> is thus <span class="math inline">\((\lambda, \lambda / 2 \sigma^2)\)</span>-rdp for any <span class="math inline">\(\lambda > 1\)</span>.</p>
<p>From the example of Gaussian mechanism, we see that the relation between <span class="math inline">\(\lambda\)</span> and <span class="math inline">\(\rho\)</span> is like that between <span class="math inline">\(\epsilon\)</span> and <span class="math inline">\(\delta\)</span>. Given <span class="math inline">\(\lambda\)</span> (resp. <span class="math inline">\(\rho\)</span>) and parameters like variance of the noise and the sensitivity of the query, we can write <span class="math inline">\(\rho = \rho(\lambda)\)</span> (resp. <span class="math inline">\(\lambda = \lambda(\rho)\)</span>).</p>
<p>Using the Chernoff bound (6.7), we can bound the divergence variable:</p>
<p><span class="math display">\[\mathbb P(L(p || q) \ge \epsilon) \le {\mathbb E \exp(t L(p || q)) \over \exp(t \epsilon))} = \exp (\kappa_{p, q}(t) - \epsilon t). \qquad (7.7)\]</span></p>
<p>For a function <span class="math inline">\(f: I \to \mathbb R\)</span>, denote its Legendre transform by</p>
<p><span class="math display">\[f^*(\epsilon) := \sup_{t \in I} (\epsilon t - f(t)).\]</span></p>
<p>By taking infimum on the RHS of (7.7), we obtain</p>
<p><strong>Claim 20</strong>. Two probability densities <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are <span class="math inline">\((\epsilon, \exp(-\kappa_{p, q}^*(\epsilon)))\)</span>-ind.</p>
<p>Given a mechanism <span class="math inline">\(M\)</span>, let <span class="math inline">\(\kappa_M(t)\)</span> denote an upper bound for the cumulant of its privacy loss:</p>
<p><span class="math display">\[\log \mathbb E \exp(t L(M(x) || M(x'))) \le \kappa_M(t), \qquad \forall x, x'\text{ with } d(x, x') = 1.\]</span></p>
<p>For example, we can set <span class="math inline">\(\kappa_M(t) = t \rho(t + 1)\)</span>. Using the same argument we have the following:</p>
<p><strong>Claim 21</strong>. If <span class="math inline">\(M\)</span> is <span class="math inline">\((\lambda, \rho)\)</span>-rdp, then</p>
<ol type="1">
<li>it is also <span class="math inline">\((\epsilon, \exp((\lambda - 1) (\rho - \epsilon)))\)</span>-dp for any <span class="math inline">\(\epsilon \ge \rho\)</span>.</li>
<li>Alternatively, <span class="math inline">\(M\)</span> is <span class="math inline">\((\epsilon, - \exp(\kappa_M^*(\epsilon)))\)</span>-dp for any <span class="math inline">\(\epsilon > 0\)</span>.</li>
<li>Alternatively, for any <span class="math inline">\(0 < \delta \le 1\)</span>, <span class="math inline">\(M\)</span> is <span class="math inline">\((\rho + (\lambda - 1)^{-1} \log \delta^{-1}, \delta)\)</span>-dp.</li>
</ol>
<p><strong>Example 2 (Gaussian mechanism)</strong>. We can apply the above argument to the Gaussian mechanism on query <span class="math inline">\(f\)</span> and get:</p>
<p><span class="math display">\[\delta \le \inf_{\lambda > 1} \exp((\lambda - 1) ({\lambda \over 2 \sigma^2} - \epsilon))\]</span></p>
<p>By assuming <span class="math inline">\(\sigma^2 > (2 \epsilon)^{-1}\)</span> we have that the infimum is achieved when <span class="math inline">\(\lambda = (1 + 2 \epsilon / \sigma^2) / 2\)</span> and</p>
<p><span class="math display">\[\delta \le \exp(- ((2 \sigma)^{-1} - \epsilon \sigma)^2 / 2)\]</span></p>
<p>which is the same result as (6.8), obtained using the Chernoff bound of the noise.</p>
<p>However, as we will see later, compositions will yield different results from those obtained from methods in <a href="/posts/2019-03-13-a-tail-of-two-densities.html">Part 1</a> when considering Rényi dp.</p>
<p><strong>Claim 22 (Moment Composition Theorem)</strong>. Let <span class="math inline">\(M\)</span> be the adaptive composition of <span class="math inline">\(M_{1 : k}\)</span>. Suppose for any <span class="math inline">\(y_{< i}\)</span>, <span class="math inline">\(M_i(y_{< i})\)</span> is <span class="math inline">\((\lambda, \rho)\)</span>-rdp. Then <span class="math inline">\(M\)</span> is <span class="math inline">\((\lambda, k\rho)\)</span>-rdp.</p>
<p><strong>Proof</strong>. Rather straightforward. As before let <span class="math inline">\(p_i\)</span> and <span class="math inline">\(q_i\)</span> be the conditional laws of adpative composition of <span class="math inline">\(M_{1 : i}\)</span> at <span class="math inline">\(x\)</span> and <span class="math inline">\(x'\)</span> respectively, and <span class="math inline">\(p^i\)</span> and <span class="math inline">\(q^i\)</span> be the joint laws of <span class="math inline">\(M_{1 : i}\)</span> at <span class="math inline">\(x\)</span> and <span class="math inline">\(x'\)</span> respectively. Denote</p>
<p><span class="math display">\[D_i = \mathbb E \exp((\lambda - 1)\log {p^i(\xi_{1 : i}) \over q^i(\xi_{1 : i})})\]</span></p>
<p>Then</p>
<p><span class="math display">\[\begin{aligned}
D_i &= \mathbb E\mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) \exp((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}) \big| \xi_{< i}\right) \\
&= \mathbb E \mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) | \xi_{< i}\right) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\
&\le \mathbb E \exp((\lambda - 1) \rho) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\
&= \exp((\lambda - 1) \rho) D_{i - 1}.
\end{aligned}\]</span></p>
<p>Applying this recursively we have</p>
<p><span class="math display">\[D_k \le \exp(k(\lambda - 1) \rho),\]</span></p>
<p>and so</p>
<p><span class="math display">\[(\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1)\log {p^k(\xi_{1 : i}) \over q^k(\xi_{1 : i})}) = (\lambda - 1)^{-1} \log D_k \le k \rho.\]</span></p>
<p>Since this holds for all <span class="math inline">\(x\)</span> and <span class="math inline">\(x'\)</span>, we are done. <span class="math inline">\(\square\)</span></p>
<p>This, together with the scaling property of the legendre transformation:</p>
<p><span class="math display">\[(k f)^*(x) = k f^*(x / k)\]</span></p>
<p>yields</p>
<p><strong>Claim 23</strong>. The <span class="math inline">\(k\)</span>-fold adaptive composition of <span class="math inline">\((\lambda, \rho(\lambda))\)</span>-rdp mechanisms is <span class="math inline">\((\epsilon, \exp(- k \kappa^*(\epsilon / k)))\)</span>-dp, where <span class="math inline">\(\kappa(t) := t \rho(t + 1)\)</span>.</p>
<p><strong>Example 3 (Gaussian mechanism)</strong>. We can apply the above claim to Gaussian mechanism. Again, without loss of generality we assume <span class="math inline">\(S_f = 1\)</span>. But let us do it manually to get the same results. If we apply the Moment Composition Theorem to the an adaptive composition of Gaussian mechanisms on the same query, then since each <span class="math inline">\(M_i\)</span> is <span class="math inline">\((\lambda, (2 \sigma^2)^{-1} \lambda)\)</span>-rdp, the composition <span class="math inline">\(M\)</span> is <span class="math inline">\((\lambda, (2 \sigma^2)^{-1} k \lambda)\)</span>-rdp. Processing this using the Chernoff bound as in the previous example, we have</p>
<p><span class="math display">\[\delta = \exp(- ((2 \sigma / \sqrt k)^{-1} - \epsilon \sigma / \sqrt k)^2 / 2),\]</span></p>
<p>Substituting <span class="math inline">\(\sigma\)</span> with <span class="math inline">\(\sigma / \sqrt k\)</span> in (6.81), we conclude that if</p>
<p><span class="math display">\[\sigma > \sqrt k \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right)\]</span></p>
<p>then the composition <span class="math inline">\(M\)</span> is <span class="math inline">\((\epsilon, \delta)\)</span>-dp.</p>
<p>As we will see in the discussions at the end of this post, this result is different from (and probably better than) the one obtained by using the Advanced Composition Theorem (Claim 18).</p>
<p>We also have a subsampling theorem for the Rényi dp.</p>
<p><strong>Claim 24</strong>. Fix <span class="math inline">\(r \in [0, 1]\)</span>. Let <span class="math inline">\(m \le n\)</span> be two nonnegative integers with <span class="math inline">\(m = r n\)</span>. Let <span class="math inline">\(N\)</span> be a <span class="math inline">\((\lambda, \rho)\)</span>-rdp machanism on <span class="math inline">\(X^m\)</span>. Let <span class="math inline">\(\mathcal I := \{J \subset [n]: |J| = m\}\)</span> be the set of subsets of <span class="math inline">\([n]\)</span> of size <span class="math inline">\(m\)</span>. Define mechanism <span class="math inline">\(M\)</span> on <span class="math inline">\(X^n\)</span> by</p>
<p><span class="math display">\[M(x) = N(x_\gamma)\]</span></p>
<p>where <span class="math inline">\(\gamma\)</span> is sampled uniformly from <span class="math inline">\(\mathcal I\)</span>. Then <span class="math inline">\(M\)</span> is <span class="math inline">\((\lambda, {1 \over \lambda - 1} \log (1 + r(e^{(\lambda - 1) \rho} - 1)))\)</span>-rdp.</p>
<p>To prove Claim 24, we need a useful lemma:</p>
<p><strong>Claim 25</strong>. Let <span class="math inline">\(p_{1 : n}\)</span> and <span class="math inline">\(q_{1 : n}\)</span> be nonnegative integers, and <span class="math inline">\(\lambda > 1\)</span>. Then</p>
<p><span class="math display">\[{(\sum p_i)^\lambda \over (\sum q_i)^{\lambda - 1}} \le \sum_i {p_i^\lambda \over q_i^{\lambda - 1}}. \qquad (8)\]</span></p>
<p><strong>Proof</strong>. Let</p>
<p><span class="math display">\[r(i) := p_i / P, \qquad u(i) := q_i / Q\]</span></p>
<p>where</p>
<p><span class="math display">\[P := \sum p_i, \qquad Q := \sum q_i\]</span></p>
<p>then <span class="math inline">\(r\)</span> and <span class="math inline">\(u\)</span> are probability mass functions. Plugging in <span class="math inline">\(p_i = r(i) P\)</span> and <span class="math inline">\(q_i = u(i) Q\)</span> into the objective (8), it suffices to show</p>
<p><span class="math display">\[1 \le \sum_i {r(i)^\lambda \over u(i)^{\lambda - 1}} = \mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda\]</span></p>
<p>This is true due to Jensen's Inequality:</p>
<p><span class="math display">\[\mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda \ge \left(\mathbb E_{\xi \sim u} {r(\xi) \over u(\xi)} \right)^\lambda = 1.\]</span></p>
<p><span class="math inline">\(\square\)</span></p>
<p><strong>Proof of Claim 24</strong>. Define <span class="math inline">\(\mathcal I\)</span> as before.</p>
<p>Let <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> be the laws of <span class="math inline">\(M(x)\)</span> and <span class="math inline">\(M(x')\)</span> respectively. For any <span class="math inline">\(I \in \mathcal I\)</span>, let <span class="math inline">\(p_I\)</span> and <span class="math inline">\(q_I\)</span> be the laws of <span class="math inline">\(N(x_I)\)</span> and <span class="math inline">\(N(x_I')\)</span> respectively. Then we have</p>
<p><span class="math display">\[\begin{aligned}
p(y) &= n^{-1} \sum_{I \in \mathcal I} p_I(y) \\
q(y) &= n^{-1} \sum_{I \in \mathcal I} q_I(y),
\end{aligned}\]</span></p>
<p>where <span class="math inline">\(n = |\mathcal I|\)</span>.</p>
<p>The MGF of <span class="math inline">\(L(p || q)\)</span> is thus</p>
<p><span class="math display">\[\mathbb E((\lambda - 1) L(p || q)) = n^{-1} \int {(\sum_I p_I(y))^\lambda \over (\sum_I q_I(y))^{\lambda - 1}} dy \le n^{-1} \sum_I \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy \qquad (9)\]</span></p>
<p>where in the last step we used Claim 25. As in the proof of Claim 19, we divide <span class="math inline">\(\mathcal I\)</span> into disjoint sets <span class="math inline">\(\mathcal I_\in\)</span> and <span class="math inline">\(\mathcal I_\notin\)</span>. Furthermore we denote by <span class="math inline">\(n_\in\)</span> and <span class="math inline">\(n_\notin\)</span> their cardinalities. Then the right hand side of (9) becomes</p>
<p><span class="math display">\[n^{-1} \sum_{I \in \mathcal I_\in} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy + n^{-1} \sum_{I \in \mathcal I_\notin} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy\]</span></p>
<p>The summands in the first are the MGF of <span class="math inline">\(L(p_I || q_I)\)</span>, and the summands in the second term are <span class="math inline">\(1\)</span>, so</p>
<p><span class="math display">\[\begin{aligned}
\mathbb E((\lambda - 1) L(p || q)) &\le n^{-1} \sum_{I \in \mathcal I_\in} \mathbb E \exp((\lambda - 1) L(p_I || q_I)) + (1 - r) \\
&\le n^{-1} \sum_{I \in \mathcal I_\in} \exp((\lambda - 1) D_\lambda(p_I || q_I)) + (1 - r) \\
&\le r \exp((\lambda - 1) \rho) + (1 - r).
\end{aligned}\]</span></p>
<p>Taking log and dividing by <span class="math inline">\((\lambda - 1)\)</span> on both sides we have</p>
<p><span class="math display">\[D_\lambda(p || q) \le (\lambda - 1)^{-1} \log (1 + r(\exp((\lambda - 1) \rho) - 1)).\]</span></p>
<p><span class="math inline">\(\square\)</span></p>
<p>As before, we can rewrite the conclusion of Lemma 6 using <span class="math inline">\(1 + z \le e^z\)</span> and obtain <span class="math inline">\((\lambda, (\lambda - 1)^{-1} r (e^{(\lambda - 1) \rho} - 1))\)</span>-rdp, which further gives <span class="math inline">\((\lambda, \alpha^{-1} (e^\alpha - 1) r \rho)\)</span>-rdp (or <span class="math inline">\((\lambda, O(r \rho))\)</span>-rdp) if <span class="math inline">\((\lambda - 1) \rho < \alpha\)</span> for some <span class="math inline">\(\alpha\)</span>.</p>
<p>It is not hard to see that the subsampling theorem in moment method, even though similar to the results of that in the usual method, does not help due to lack of an analogue of advanced composition theorem of the moments.</p>
<p><strong>Example 4 (Gaussian mechanism)</strong>. Applying the moment subsampling theorem to the Gaussian mechanism, we obtain <span class="math inline">\((\lambda, O(r \lambda / \sigma^2))\)</span>-rdp for a subsampled Gaussian mechanism with rate <span class="math inline">\(r\)</span>. Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016 (ACGMMTZ16 in the following), however, gains an extra <span class="math inline">\(r\)</span> in the bound given certain assumptions.</p>
<h2 id="acgmmtz16">ACGMMTZ16</h2>
<p>What follows is my understanding of this result. I call it a conjecture because there is a gap which I am not able to reproduce their proof or prove it myself. This does not mean the result is false. On the contrary, I am inclined to believe it is true.</p>
<p><strong>Claim 26</strong>. Assuming Conjecture 1 (see below) is true. For a subsampled Gaussian mechanism with ratio <span class="math inline">\(r\)</span>, if <span class="math inline">\(r = O(\sigma^{-1})\)</span> and <span class="math inline">\(\lambda = O(\sigma^2)\)</span>, then we have <span class="math inline">\((\lambda, O(r^2 \lambda / \sigma^2))\)</span>-rdp.</p>
<p>Wait, why is there a conjecture? Well, I have tried but not been able to prove the following, which is a hidden assumption in the original proof:</p>
<p><strong>Conjecture 1</strong>. Let <span class="math inline">\(p_i\)</span>, <span class="math inline">\(q_i\)</span>, <span class="math inline">\(\mu_i\)</span>, <span class="math inline">\(\nu_i\)</span> be probability densities on the same space for <span class="math inline">\(i = 1 : n\)</span>. If <span class="math inline">\(D_\lambda(p_i || q_i) \le D_\lambda(\mu_i || \nu_i)\)</span> for all <span class="math inline">\(i\)</span>, then</p>
<p><span class="math display">\[D_\lambda(n^{-1} \sum_i p_i || n^{-1} \sum_i q_i) \le D_\lambda(n^{-1} \sum_i \mu_i || n^{-1} \sum_i \nu_i).\]</span></p>
<p>Basically, it is saying "if for each <span class="math inline">\(i\)</span>, <span class="math inline">\(p_i\)</span> and <span class="math inline">\(q_i\)</span> are closer to each other than <span class="math inline">\(\mu_i\)</span> and <span class="math inline">\(\nu_i\)</span>, then so are their averages over <span class="math inline">\(i\)</span>". So it is heuristically reasonable.</p>
<p>This conjecture is equivalent to its special case when <span class="math inline">\(n = 2\)</span> by an induction argument (replacing one pair of densities at a time).</p>
<p>Recall the definition of <span class="math inline">\(G_\lambda\)</span> under the definition of Rényi differential privacy. The following Claim will be useful.</p>
<p><strong>Claim 27</strong>. Let <span class="math inline">\(\lambda\)</span> be a positive integer, then</p>
<p><span class="math display">\[G_\lambda(r p + (1 - r) q || q) = \sum_{k = 1 : \lambda} {\lambda \choose k} r^k (1 - r)^{\lambda - k} G_k(p || q).\]</span></p>
<p><strong>Proof</strong>. Quite straightforward, by expanding the numerator <span class="math inline">\((r p + (1 - r) q)^\lambda\)</span> using binomial expansion. <span class="math inline">\(\square\)</span></p>
<p><strong>Proof of Claim 26</strong>. Let <span class="math inline">\(M\)</span> be the Gaussian mechanism with subsampling rate <span class="math inline">\(r\)</span>, and <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> be the laws of <span class="math inline">\(M(x)\)</span> and <span class="math inline">\(M(x')\)</span> respectively, where <span class="math inline">\(d(x, x') = 1\)</span>. I will break the proof into two parts:</p>
<ol type="1">
<li>The MGF of the privacy loss <span class="math inline">\(L(p || q)\)</span> is bounded by that of <span class="math inline">\(L(r \mu_1 + (1 - r) \mu_0 || \mu_0)\)</span> where <span class="math inline">\(\mu_i = N(i, \sigma^2)\)</span>.</li>
<li>If <span class="math inline">\(r \le c_1 \sigma^{-1}\)</span> and <span class="math inline">\(\lambda \le c_2 \sigma^2\)</span>, then there exists <span class="math inline">\(C = C(c_1, c_2)\)</span> such that <span class="math inline">\(G_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0) \le C\)</span> (since <span class="math inline">\(O(r^2 \lambda^2 / \sigma^2) = O(1)\)</span>).</li>
</ol>
<p><strong>Remark in the proof</strong>. Note that the choice of <span class="math inline">\(c_1\)</span>, <span class="math inline">\(c_2\)</span> and the function <span class="math inline">\(C(c_1, c_2)\)</span> are important to the practicality and usefulness of Conjecture 0.</p>
<p>Part 1 can be derived using Conjecture 1. We use the notations <span class="math inline">\(p_I\)</span> and <span class="math inline">\(q_I\)</span> to be <span class="math inline">\(q\)</span> and <span class="math inline">\(p\)</span> conditioned on the subsampling index <span class="math inline">\(I\)</span>, just like in the proof of the subsampling theorems (Claim 19 and 24). Then</p>
<p><span class="math display">\[D_\lambda(q_I || p_I) = D_\lambda(p_I || q_I)
\begin{cases}
\le D_\lambda(\mu_0 || \mu_1) = D_\lambda(\mu_1 || \mu_0), & I \in \mathcal I_\in\\
= D_\lambda(\mu_0 || \mu_0) = D_\lambda(\mu_1 || \mu_1) = 0 & I \in \mathcal I_\notin
\end{cases}\]</span></p>
<p>Since <span class="math inline">\(p = |\mathcal I|^{-1} \sum_{I \in \mathcal I} p_I\)</span> and <span class="math inline">\(q = |\mathcal I|^{-1} \sum_{I \in \mathcal I} q_I\)</span> and <span class="math inline">\(|\mathcal I_\in| = r |\mathcal I|\)</span>, by Conjecture 1, we have Part 1.</p>
<p><strong>Remark in the proof</strong>. As we can see here, instead of trying to prove Conjecture 1, it suffices to prove a weaker version of it, by specialising on mixture of Gaussians, in order to have a Claim 26 without any conjectural assumptions. I have in fact posted the Conjecture on <a href="https://math.stackexchange.com/questions/3147963/an-inequality-related-to-the-renyi-divergence">Stackexchange</a>.</p>
<p>Now let us verify Part 2.</p>
<p>Using Claim 27 and Example 1, we have</p>
<p><span class="math display">\[\begin{aligned}
G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0)) &= \sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} G_j(\mu_1 || \mu_0)\\
&=\sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} \exp(j (j - 1) / 2 \sigma^2). \qquad (9.5)
\end{aligned}\]</span></p>
<p>Denote by <span class="math inline">\(n = \lceil \sigma^2 \rceil\)</span>. It suffices to show</p>
<p><span class="math display">\[\sum_{j = 0 : n} {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j (j - 1) / 2 n) \le C\]</span></p>
<p>Note that we can discard the linear term <span class="math inline">\(- c_2 j / \sigma^2\)</span> in the exponential term since we want to bound the sum from above.</p>
<p>We examine the asymptotics of this sum when <span class="math inline">\(n\)</span> is large, and treat the sum as an approximation to an integration of a function <span class="math inline">\(\phi: [0, 1] \to \mathbb R\)</span>. For <span class="math inline">\(j = x n\)</span>, where <span class="math inline">\(x \in (0, 1)\)</span>, <span class="math inline">\(\phi\)</span> is thus defined as (note we multiply the summand with <span class="math inline">\(n\)</span> to compensate the uniform measure on <span class="math inline">\(1, ..., n\)</span>:</p>
<p><span class="math display">\[\begin{aligned}
\phi_n(x) &:= n {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j^2 / 2 n) \\
&= n {n \choose x n} (c_1 n^{- 1 / 2})^{x n} (1 - c_1 n^{- 1 / 2})^{(1 - x) n} \exp(c_2 x^2 n / 2)
\end{aligned}\]</span></p>
<p>Using Stirling's approximation</p>
<p><span class="math display">\[n! \approx \sqrt{2 \pi n} n^n e^{- n},\]</span></p>
<p>we can approach the binomial coefficient:</p>
<p><span class="math display">\[{n \choose x n} \approx (\sqrt{2 \pi x (1 - x)} x^{x n} (1 - x)^{(1 - x) n})^{-1}.\]</span></p>
<p>We also approximate</p>
<p><span class="math display">\[(1 - c_1 n^{- 1 / 2})^{(1 - x) n} \approx \exp(- c_1 \sqrt{n} (1 - x)).\]</span></p>
<p>With these we have</p>
<p><span class="math display">\[\phi_n(x) \approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left(- {1 \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x) + {1 \over 2} c_2 x^2) n + {1 \over 2} \log n\right).\]</span></p>
<p>This vanishes as <span class="math inline">\(n \to \infty\)</span>, and since <span class="math inline">\(\phi_n(x)\)</span> is bounded above by the integrable function <span class="math inline">\({1 \over \sqrt{2 \pi x (1 - x)}}\)</span> (c.f. the arcsine law), and below by <span class="math inline">\(0\)</span>, we may invoke the dominant convergence theorem and exchange the integral with the limit and get</p>
<p><span class="math display">\[\begin{aligned}
\lim_{n \to \infty} &G_n (r \mu_1 + (1 - r) \mu_0 || \mu_0)) \\
&\le \lim_{n \to \infty} \int \phi_n(x) dx = \int \lim_{n \to \infty} \phi_n(x) dx = 0.
\end{aligned}\]</span></p>
<p>Thus we have that the generating function of the divergence variable <span class="math inline">\(L(r \mu_1 + (1 - r) \mu_0 || \mu_0)\)</span> is bounded.</p>
<p>Can this be true for better orders</p>
<p><span class="math display">\[r \le c_1 \sigma^{- d_r},\qquad \lambda \le c_2 \sigma^{d_\lambda}\]</span></p>
<p>for some <span class="math inline">\(d_r \in (0, 1]\)</span> and <span class="math inline">\(d_\lambda \in [2, \infty)\)</span>? If we follow the same approximation using these exponents, then letting <span class="math inline">\(n = c_2 \sigma^{d_\lambda}\)</span>,</p>
<p><span class="math display">\[\begin{aligned}
{n \choose j} &r^j (1 - r)^{n - j} G_j(\mu_0 || \mu_1) \le \phi_n(x) \\
&\approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left({1 \over 2} c_2^{2 \over d_\lambda} x^2 n^{2 - {2 \over d_\lambda}} - {d_r \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x)) n + {1 \over 2} \log n\right).
\end{aligned}\]</span></p>
<p>So we see that to keep the divergence moments bounded it is possible to have any <span class="math inline">\(r = O(\sigma^{- d_r})\)</span> for <span class="math inline">\(d_r \in (0, 1)\)</span>, but relaxing <span class="math inline">\(\lambda\)</span> may not be safe.</p>
<p>If we relax <span class="math inline">\(r\)</span>, then we get</p>
<p><span class="math display">\[G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0) = O(r^{2 / d_r} \lambda^2 \sigma^{-2}) = O(1).\]</span></p>
<p>Note that now the constant <span class="math inline">\(C\)</span> depends on <span class="math inline">\(d_r\)</span> as well. Numerical experiments seem to suggest that <span class="math inline">\(C\)</span> can increase quite rapidly as <span class="math inline">\(d_r\)</span> decreases from <span class="math inline">\(1\)</span>. <span class="math inline">\(\square\)</span></p>
<p>In the following for consistency we retain <span class="math inline">\(k\)</span> as the number of epochs, and use <span class="math inline">\(T := k / r\)</span> to denote the number of compositions / steps / minibatches. With Conjecture 0 we have:</p>
<p><strong>Claim 28</strong>. Assuming Conjecture 1 is true. Let <span class="math inline">\(\epsilon, c_1, c_2 > 0\)</span>, <span class="math inline">\(r \le c_1 \sigma^{-1}\)</span>, <span class="math inline">\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\)</span>. then the DP-SGD with subsampling rate <span class="math inline">\(r\)</span>, and <span class="math inline">\(T\)</span> steps is <span class="math inline">\((\epsilon, \delta)\)</span>-dp for</p>
<p><span class="math display">\[\delta = \exp(- {1 \over 2} c_2 \sigma^2 \epsilon).\]</span></p>
<p>In other words, for</p>
<p><span class="math display">\[\sigma \ge \sqrt{2 c_2^{-1}} \epsilon^{- {1 \over 2}} \sqrt{\log \delta^{-1}},\]</span></p>
<p>we can achieve <span class="math inline">\((\epsilon, \delta)\)</span>-dp.</p>
<p><strong>Proof</strong>. By Claim 26 and the Moment Composition Theorem (Claim 22), for <span class="math inline">\(\lambda = c_2 \sigma^2\)</span>, substituting <span class="math inline">\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\)</span>, we have</p>
<p><span class="math display">\[\mathbb P(L(p || q) \ge \epsilon) \le \exp(k C(c_1, c_2) - \lambda \epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right).\]</span></p>
<p><span class="math inline">\(\square\)</span></p>
<p><strong>Remark</strong>. Claim 28 is my understanding / version of Theorem 1 in [ACGMMTZ16], by using the same proof technique. Here I quote the original version of theorem with notions and notations altered for consistency with this post:</p>
<blockquote>
<p>There exists constants <span class="math inline">\(c_1', c_2' > 0\)</span> so that for any <span class="math inline">\(\epsilon < c_1' r^2 T\)</span>, DP-SGD is <span class="math inline">\((\epsilon, \delta)\)</span>-differentially private for any <span class="math inline">\(\delta > 0\)</span> if we choose</p>
</blockquote>
<p><span class="math display">\[\sigma \ge c_2' {r \sqrt{T \log (1 / \delta)} \over \epsilon}. \qquad (10)\]</span></p>
<p>I am however unable to reproduce this version, assuming Conjecture 0 is true, for the following reasons:</p>
<ol type="1">
<li><p>In the proof in the paper, we have <span class="math inline">\(\epsilon = c_1' r^2 T\)</span> instead of "less than" in the statement of the Theorem. If we change it to <span class="math inline">\(\epsilon < c_1' r^2 T\)</span> then the direction of the inequality becomes opposite to the direction we want to prove: <span class="math display">\[\exp(k C(c_1, c_2) - \lambda \epsilon) \ge ...\]</span></p></li>
<li><p>The implicit condition <span class="math inline">\(r = O(\sigma^{-1})\)</span> of Conjecture 0 whose result is used in the proof of this theorem is not mentioned in the statement of the proof. The implication is that (10) becomes an ill-formed condition as the right hand side also depends on <span class="math inline">\(\sigma\)</span>.</p></li>
</ol>
<h2 id="tensorflow-implementation">Tensorflow implementation</h2>
<p>The DP-SGD is implemented in <a href="https://github.com/tensorflow/privacy">TensorFlow Privacy</a>. In the following I discuss the package in the current state (2019-03-11). It is divided into two parts: <a href="https://github.com/tensorflow/privacy/tree/master/privacy/optimizers"><code>optimizers</code></a> which implements the actual differentially private algorithms, and <a href="https://github.com/tensorflow/privacy/tree/master/privacy/analysis"><code>analysis</code></a> which computes the privacy guarantee.</p>
<p>The <code>analysis</code> part implements a privacy ledger that "keeps a record of all queries executed over a given dataset for the purpose of computing privacy guarantees". On the other hand, all the computation is done in <a href="https://github.com/tensorflow/privacy/blob/7e2d796bdee9b60dce21a82a397eefda35b0ac10/privacy/analysis/rdp_accountant.py"><code>rdp_accountant.py</code></a>. At this moment, <code>rdp_accountant.py</code> only implements the computation of the privacy guarantees for DP-SGD with Gaussian mechanism. In the following I will briefly explain the code in this file.</p>
<p>Some notational correspondences: their <code>alpha</code> is our <span class="math inline">\(\lambda\)</span>, their <code>q</code> is our <span class="math inline">\(r\)</span>, their <code>A_alpha</code> (in the comments) is our <span class="math inline">\(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2)} (\lambda - 1)\)</span>, at least when <span class="math inline">\(\lambda\)</span> is an integer.</p>
<ul>
<li>The function <code>_compute_log_a</code> presumably computes the cumulants <span class="math inline">\(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2), N(0, \sigma^2)}(\lambda - 1)\)</span>. It calls <code>_compute_log_a_int</code> or <code>_compute_log_a_frac</code> depending on whether <span class="math inline">\(\lambda\)</span> is an integer.</li>
<li>The function <code>_compute_log_a_int</code> computes the cumulant using (9.5).</li>
<li>When <span class="math inline">\(\lambda\)</span> is not an integer, we can't use (9.5). I have yet to decode how <code>_compute_log_a_frac</code> computes the cumulant (or an upper bound of it) in this case</li>
<li>The function <code>_compute_delta</code> computes <span class="math inline">\(\delta\)</span>s for a list of <span class="math inline">\(\lambda\)</span>s and <span class="math inline">\(\kappa\)</span>s using Item 1 of Claim 25 and return the smallest one, and the function <code>_compute_epsilon</code> computes epsilon uses Item 3 in Claim 25 in the same way.</li>
</ul>
<p>In <code>optimizers</code>, among other things, the DP-SGD with Gaussian mechanism is implemented in <code>dp_optimizer.py</code> and <code>gaussian_query.py</code>. See the definition of <code>DPGradientDescentGaussianOptimizer</code> in <code>dp_optimizer.py</code> and trace the calls therein.</p>
<p>At this moment, the privacy guarantee computation part and the optimizer part are separated, with <code>rdp_accountant.py</code> called in <code>compute_dp_sgd_privacy.py</code> with user-supplied parameters. I think this is due to the lack of implementation in <code>rdp_accountant.py</code> of any non-DPSGD-with-Gaussian privacy guarantee computation. There is already <a href="https://github.com/tensorflow/privacy/issues/23">an issue on this</a>, so hopefully it won't be long before the privacy guarantees can be automatically computed given a DP-SGD instance.</p>
<h2 id="comparison-among-different-methods">Comparison among different methods</h2>
<p>So far we have seen three routes to compute the privacy guarantees for DP-SGD with the Gaussian mechanism:</p>
<ol type="1">
<li>Claim 9 (single Gaussian mechanism privacy guarantee) -> Claim 19 (Subsampling theorem) -> Claim 18 (Advanced Adaptive Composition Theorem)</li>
<li>Example 1 (RDP for the Gaussian mechanism) -> Claim 22 (Moment Composition Theorem) -> Example 3 (Moment composition applied to the Gaussian mechanism)</li>
<li>Claim 26 (RDP for Gaussian mechanism with specific magnitudes for subsampling rate) -> Claim 28 (Moment Composition Theorem and translation to conventional DP)</li>
</ol>
<p>Which one is the best?</p>
<p>To make fair comparison, we may use one parameter as the metric and set all others to be the same. For example, we can</p>
<ol type="1">
<li>Given the same <span class="math inline">\(\epsilon\)</span>, <span class="math inline">\(r\)</span> (in Route 1 and 3), <span class="math inline">\(k\)</span>, <span class="math inline">\(\sigma\)</span>, compare the <span class="math inline">\(\delta\)</span>s</li>
<li>Given the same <span class="math inline">\(\epsilon\)</span>, <span class="math inline">\(r\)</span> (in Route 1 and 3), <span class="math inline">\(k\)</span>, <span class="math inline">\(\delta\)</span>, compare the <span class="math inline">\(\sigma\)</span>s</li>
<li>Given the same <span class="math inline">\(\delta\)</span>, <span class="math inline">\(r\)</span> (in Route 1 and 3), <span class="math inline">\(k\)</span>, <span class="math inline">\(\sigma\)</span>, compare the <span class="math inline">\(\epsilon\)</span>s.</li>
</ol>
<p>I find that the first one, where <span class="math inline">\(\delta\)</span> is used as a metric, the best. This is because we have the tightest bounds and the cleanest formula when comparing the <span class="math inline">\(\delta\)</span>. For example, the Azuma and Chernoff bounds are both expressed as a bound for <span class="math inline">\(\delta\)</span>. On the other hand, the inversion of these bounds either requires a cost in the tightness (Claim 9, bounds on <span class="math inline">\(\sigma\)</span>) or in the complexity of the formula (Claim 16 Advanced Adaptive Composition Theorem, bounds on <span class="math inline">\(\epsilon\)</span>).</p>
<p>So if we use <span class="math inline">\(\sigma\)</span> or <span class="math inline">\(\epsilon\)</span> as a metric, either we get a less fair comparison, or have to use a much more complicated formula as the bounds.</p>
<p>Let us first compare Route 1 and Route 2 without specialising to the Gaussian mechanism.</p>
<p><strong>Disclaimer</strong>. What follows is a bit messy and has not been reviewed by anyone.</p>
<p>Suppose each mechanism <span class="math inline">\(N_i\)</span> satisfies <span class="math inline">\((\epsilon', \delta(\epsilon'))\)</span>-dp. Let <span class="math inline">\(\tilde \epsilon := \log (1 + r (e^{\epsilon'} - 1))\)</span>, then we have the subsampled mechanism <span class="math inline">\(M_i(x) = N_i(x_\gamma)\)</span> is <span class="math inline">\((\tilde \epsilon, r \tilde \delta(\tilde \epsilon))\)</span>-dp, where</p>
<p><span class="math display">\[\tilde \delta(\tilde \epsilon) = \delta(\log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1))\]</span></p>
<p>Using the Azuma bound in the proof of Advanced Adaptive Composition Theorem (6.99):</p>
<p><span class="math display">\[\mathbb P(L(p^k || q^k) \ge \epsilon) \le \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}).\]</span></p>
<p>So we have the final bound for Route 1:</p>
<p><span class="math display">\[\delta_1(\epsilon) = \min_{\tilde \epsilon: \epsilon > r^{-1} k a(\tilde \epsilon)} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \tilde \delta(\tilde \epsilon).\]</span></p>
<p>As for Route 2, since we do not gain anything from subsampling in RDP, we do not subsample at all.</p>
<p>By Claim 23, we have the bound for Route 2:</p>
<p><span class="math display">\[\delta_2(\epsilon) = \exp(- k \kappa^* (\epsilon / k)).\]</span></p>
<p>On one hand, one can compare <span class="math inline">\(\delta_1\)</span> and <span class="math inline">\(\delta_2\)</span> with numerical experiments. On the other hand, if we further specify <span class="math inline">\(\delta(\epsilon')\)</span> in Route 1 as the Chernoff bound for the cumulants of divergence variable, i.e.</p>
<p><span class="math display">\[\delta(\epsilon') = \exp(- \kappa^* (\epsilon')),\]</span></p>
<p>we have</p>
<p><span class="math display">\[\delta_1 (\epsilon) = \min_{\tilde \epsilon: a(\tilde \epsilon) < r k^{-1} \epsilon} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \exp(- \kappa^* (b(\tilde\epsilon))),\]</span></p>
<p>where</p>
<p><span class="math display">\[b(\tilde \epsilon) := \log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1) \le r^{-1} \tilde\epsilon.\]</span></p>
<p>We note that since <span class="math inline">\(a(\tilde \epsilon) = \tilde\epsilon(e^{\tilde \epsilon} - 1) 1_{\tilde\epsilon < \log 2} + \tilde\epsilon 1_{\tilde\epsilon \ge \log 2}\)</span>, we may compare the two cases separately.</p>
<p>Note that we have <span class="math inline">\(\kappa^*\)</span> is a monotonously increasing function, therefore</p>
<p><span class="math display">\[\kappa^* (b(\tilde\epsilon)) \le \kappa^*(r^{-1} \tilde\epsilon).\]</span></p>
<p>So for <span class="math inline">\(\tilde \epsilon \ge \log 2\)</span>, we have</p>
<p><span class="math display">\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(r^{-1} \tilde \epsilon)) \ge k \exp(- \kappa^*(k^{-1} \epsilon)) \ge \delta_2(\epsilon).\]</span></p>
<p>For <span class="math inline">\(\tilde\epsilon < \log 2\)</span>, it is harder to compare, as now</p>
<p><span class="math display">\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(\epsilon / \sqrt{r k})).\]</span></p>
<p>It is tempting to believe that this should also be greater than <span class="math inline">\(\delta_2(\epsilon)\)</span>. But I can not say for sure. At least in the special case of Gaussian, we have</p>
<p><span class="math display">\[k \exp(- \kappa^*(\epsilon / \sqrt{r k})) = k \exp(- (\sigma \sqrt{\epsilon / k r} - (2 \sigma)^{-1})^2) \ge \exp(- k ({\sigma \epsilon \over k} - (2 \sigma)^{-1})^2) = \delta_2(\epsilon)\]</span></p>
<p>when <span class="math inline">\(\epsilon\)</span> is sufficiently small. However we still need to consider the case where <span class="math inline">\(\epsilon\)</span> is not too small. But overall it seems most likely Route 2 is superior than Route 1.</p>
<p>So let us compare Route 2 with Route 3:</p>
<p>Given the condition to obtain the Chernoff bound</p>
<p><span class="math display">\[{\sigma \epsilon \over k} > (2 \sigma)^{-1}\]</span></p>
<p>we have</p>
<p><span class="math display">\[\delta_2(\epsilon) > \exp(- k (\sigma \epsilon / k)^2) = \exp(- \sigma^2 \epsilon^2 / k).\]</span></p>
<p>For this to achieve the same bound</p>
<p><span class="math display">\[\delta_3(\epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right)\]</span></p>
<p>we need <span class="math inline">\(k < {2 \epsilon \over c_2}\)</span>. This is only possible if <span class="math inline">\(c_2\)</span> is small or <span class="math inline">\(\epsilon\)</span> is large, since <span class="math inline">\(k\)</span> is a positive integer.</p>
<p>So taking at face value, Route 3 seems to achieve the best results. However, it also has some similar implicit conditions that need to be satisfied: First <span class="math inline">\(T\)</span> needs to be at least <span class="math inline">\(1\)</span>, meaning</p>
<p><span class="math display">\[{c_2 \over C(c_1, c_2)} \epsilon \sigma^2 \ge 1.\]</span></p>
<p>Second, <span class="math inline">\(k\)</span> needs to be at least <span class="math inline">\(1\)</span> as well, i.e.</p>
<p><span class="math display">\[k = r T \ge {c_1 c_2 \over C(c_1, c_2)} \epsilon \sigma \ge 1.\]</span></p>
<p>Both conditions rely on the magnitudes of <span class="math inline">\(\epsilon\)</span>, <span class="math inline">\(\sigma\)</span>, <span class="math inline">\(c_1\)</span>, <span class="math inline">\(c_2\)</span>, and the rate of growth of <span class="math inline">\(C(c_1, c_2)\)</span>. The biggest problem in this list is the last, because if we know how fast <span class="math inline">\(C\)</span> grows then we'll have a better idea what are the constraints for the parameters to achieve the result in Route 3.</p>
<h2 id="further-questions">Further questions</h2>
<p>Here is a list of what I think may be interesting topics or potential problems to look at, with no guarantee that they are all awesome untouched research problems:</p>
<ol type="1">
<li>Prove Conjecture 1</li>
<li>Find a theoretically definitive answer whether the methods in Part 1 or Part 2 yield better privacy guarantees.</li>
<li>Study the non-Gaussian cases, general or specific. Let <span class="math inline">\(p\)</span> be some probability density, what is the tail bound of <span class="math inline">\(L(p(y) || p(y + \alpha))\)</span> for <span class="math inline">\(|\alpha| \le 1\)</span>? Can you find anything better than Gaussian? For a start, perhaps the nice tables of Rényi divergence in Gil-Alajaji-Linder 2013 may be useful?</li>
<li>Find out how useful Claim 26 is. Perhaps start with computing the constant <span class="math inline">\(C\)</span> nemerically.</li>
<li>Help with <a href="https://github.com/tensorflow/privacy/issues/23">the aforementioned issue</a> in the Tensorflow privacy package.</li>
</ol>
<h2 id="references">References</h2>
<ul>
<li>Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. “Deep Learning with Differential Privacy.” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security - CCS’16, 2016, 308–18. <a href="https://doi.org/10.1145/2976749.2978318" class="uri">https://doi.org/10.1145/2976749.2978318</a>.</li>
<li>Erven, Tim van, and Peter Harremoës. “R\’enyi Divergence and Kullback-Leibler Divergence.” IEEE Transactions on Information Theory 60, no. 7 (July 2014): 3797–3820. <a href="https://doi.org/10.1109/TIT.2014.2320500" class="uri">https://doi.org/10.1109/TIT.2014.2320500</a>.</li>
<li>Gil, M., F. Alajaji, and T. Linder. “Rényi Divergence Measures for Commonly Used Univariate Continuous Distributions.” Information Sciences 249 (November 2013): 124–31. <a href="https://doi.org/10.1016/j.ins.2013.06.018" class="uri">https://doi.org/10.1016/j.ins.2013.06.018</a>.</li>
<li>Mironov, Ilya. “Renyi Differential Privacy.” 2017 IEEE 30th Computer Security Foundations Symposium (CSF), August 2017, 263–75. <a href="https://doi.org/10.1109/CSF.2017.11" class="uri">https://doi.org/10.1109/CSF.2017.11</a>.</li>
</ul>
</body>
</html>
</div>
<section id="isso-thread"></section>
</div>
</body>
</html>
|