1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
|
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<meta name="dcterms.date" content="2019-03-13" />
<title>A Tail of Two Densities</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-AMS_CHTML-full" type="text/javascript"></script>
<!--[if lt IE 9]>
<script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
<![endif]-->
</head>
<body>
<header id="title-block-header">
<h1 class="title">A Tail of Two Densities</h1>
<p class="date">2019-03-13</p>
</header>
<nav id="TOC">
<ul>
<li><a href="#the-gist-of-differential-privacy">The gist of differential privacy</a></li>
<li><a href="#epsilon-dp"><span class="math inline">\(\epsilon\)</span>-dp</a></li>
<li><a href="#approximate-differential-privacy">Approximate differential privacy</a><ul>
<li><a href="#indistinguishability">Indistinguishability</a></li>
<li><a href="#back-to-approximate-differential-privacy">Back to approximate differential privacy</a></li>
</ul></li>
<li><a href="#composition-theorems">Composition theorems</a></li>
<li><a href="#subsampling">Subsampling</a></li>
<li><a href="#references">References</a></li>
</ul>
</nav>
<p>This is Part 1 of a two-part post where I give an introduction to differential privacy, which is a study of tail bounds of the divergence between probability measures, with the end goal of applying it to stochastic gradient descent.</p>
<p>I start with the definition of <span class="math inline">\(\epsilon\)</span>-differential privacy (corresponding to max divergence), followed by <span class="math inline">\((\epsilon, \delta)\)</span>-differential privacy (a.k.a. approximate differential privacy, corresponding to the <span class="math inline">\(\delta\)</span>-approximate max divergence). I show a characterisation of the <span class="math inline">\((\epsilon, \delta)\)</span>-differential privacy as conditioned <span class="math inline">\(\epsilon\)</span>-differential privacy. Also, as examples, I illustrate the <span class="math inline">\(\epsilon\)</span>-dp with Laplace mechanism and, using some common tail bounds, the approximate dp with the Gaussian mechanism.</p>
<p>Then I continue to show the effect of combinatorial and sequential compositions of randomised queries (called mechanisms) on privacy by stating and proving the composition theorems for differential privacy, as well as the effect of mixing mechanisms, by presenting the subsampling theorem (a.k.a. amplification theorem).</p>
<p>In <a href="/posts/2019-03-14-great-but-manageable-expectations.html">Part 2</a>, I discuss the Rényi differential privacy, corresponding to the Rényi divergence, a study of the moment generating functions of the divergence between probability measures to derive the tail bounds.</p>
<p>Like in Part 1, I prove a composition theorem and a subsampling theorem.</p>
<p>I also attempt to reproduce a seemingly better moment bound for the Gaussian mechanism with subsampling, with one intermediate step which I am not able to prove.</p>
<p>After that I explain the Tensorflow implementation of differential privacy in its <a href="https://github.com/tensorflow/privacy/tree/master/privacy">Privacy</a> module, which focuses on the differentially private stochastic gradient descent algorithm (DP-SGD).</p>
<p>Finally I use the results from both Part 1 and Part 2 to obtain some privacy guarantees for composed subsampling queries in general, and for DP-SGD in particular. I also compare these privacy guarantees.</p>
<p><strong>Acknowledgement</strong>. I would like to thank <a href="https://stockholm.ai">Stockholm AI</a> for introducing me to the subject of differential privacy. Thanks to (in chronological order) Reynaldo Boulogne, Martin Abedi, Ilya Mironov, Kurt Johansson, Mark Bun, Salil Vadhan, Jonathan Ullman, Yuanyuan Xu and Yiting Li for communication and discussions. The research was done while working at <a href="https://www.kth.se/en/sci/institutioner/math">KTH Department of Mathematics</a>.</p>
<p><em>If you are confused by any notations, ask me or try <a href="/notations.html">this</a>. This post (including both Part 1 and Part2) is licensed under <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA</a> and <a href="https://www.gnu.org/licenses/fdl.html">GNU FDL</a>.</em></p>
<h2 id="the-gist-of-differential-privacy">The gist of differential privacy</h2>
<p>If you only have one minute, here is what differential privacy is about:</p>
<p>Let <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> be two probability densities, we define the <em>divergence variable</em> of <span class="math inline">\((p, q)\)</span> to be</p>
<p><span class="math display">\[L(p || q) := \log {p(\xi) \over q(\xi)}\]</span></p>
<p>where <span class="math inline">\(\xi\)</span> is a random variable distributed according to <span class="math inline">\(p\)</span>.</p>
<p>Roughly speaking, differential privacy is the study of the tail bound of <span class="math inline">\(L(p || q)\)</span>: for certain <span class="math inline">\(p\)</span>s and <span class="math inline">\(q\)</span>s, and for <span class="math inline">\(\epsilon > 0\)</span>, find <span class="math inline">\(\delta(\epsilon)\)</span> such that</p>
<p><span class="math display">\[\mathbb P(L(p || q) > \epsilon) < \delta(\epsilon),\]</span></p>
<p>where <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are the laws of the outputs of a randomised functions on two very similar inputs. Moreover, to make matters even simpler, only three situations need to be considered:</p>
<ol type="1">
<li>(General case) <span class="math inline">\(q\)</span> is in the form of <span class="math inline">\(q(y) = p(y + \Delta)\)</span> for some bounded constant <span class="math inline">\(\Delta\)</span>.</li>
<li>(Compositions) <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are combinatorial or sequential compositions of some simpler <span class="math inline">\(p_i\)</span>’s and <span class="math inline">\(q_i\)</span>’s respectively</li>
<li>(Subsampling) <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are mixtures / averages of some simpler <span class="math inline">\(p_i\)</span>’s and <span class="math inline">\(q_i\)</span>’s respectively</li>
</ol>
<p>In applications, the inputs are databases and the randomised functions are queries with an added noise, and the tail bounds give privacy guarantees. When it comes to gradient descent, the input is the training dataset, and the query updates the parameters, and privacy is achieved by adding noise to the gradients.</p>
<p>Now if you have an hour...</p>
<h2 id="epsilon-dp"><span class="math inline">\(\epsilon\)</span>-dp</h2>
<p><strong>Definition (Mechanisms)</strong>. Let <span class="math inline">\(X\)</span> be a space with a metric <span class="math inline">\(d: X \times X \to \mathbb N\)</span>. A <em>mechanism</em> <span class="math inline">\(M\)</span> is a function that takes <span class="math inline">\(x \in X\)</span> as input and outputs a random variable on <span class="math inline">\(Y\)</span>.</p>
<p>In this post, <span class="math inline">\(X = Z^m\)</span> is the space of datasets of <span class="math inline">\(m\)</span> rows for some integer <span class="math inline">\(m\)</span>, where each item resides in <span class="math inline">\(Z\)</span>. In this case the distance <span class="math inline">\(d(x, x') := \#\{i: x_i \neq x'_i\}\)</span> is the number of rows that differ between <span class="math inline">\(x\)</span> and <span class="math inline">\(x'\)</span>.</p>
<p>Normally we have a query <span class="math inline">\(f: X \to Y\)</span>, and construct the mechanism <span class="math inline">\(M\)</span> from <span class="math inline">\(f\)</span> by adding a noise:</p>
<p><span class="math display">\[M(x) := f(x) + \text{noise}.\]</span></p>
<p>Later, we will also consider mechanisms constructed from composition or mixture of other mechanisms.</p>
<p>In this post <span class="math inline">\(Y = \mathbb R^d\)</span> for some <span class="math inline">\(d\)</span>.</p>
<p><strong>Definition (Sensitivity)</strong>. Let <span class="math inline">\(f: X \to \mathbb R^d\)</span> be a function. The <em>sensitivity</em> <span class="math inline">\(S_f\)</span> of <span class="math inline">\(f\)</span> is defined as</p>
<p><span class="math display">\[S_f := \sup_{x, x' \in X: d(x, x') = 1} \|f(x) - f(x')\|_2,\]</span></p>
<p>where <span class="math inline">\(\|y\|_2 = \sqrt{y_1^2 + ... + y_d^2}\)</span> is the <span class="math inline">\(\ell^2\)</span>-norm.</p>
<p><strong>Definition (Differential Privacy)</strong>. A mechanism <span class="math inline">\(M\)</span> is called <span class="math inline">\(\epsilon\)</span><em>-differential privacy</em> (<span class="math inline">\(\epsilon\)</span>-dp) if it satisfies the following condition: for all <span class="math inline">\(x, x' \in X\)</span> with <span class="math inline">\(d(x, x') = 1\)</span>, and for all measureable set <span class="math inline">\(S \subset \mathbb R^n\)</span>,</p>
<p><span class="math display">\[\mathbb P(M(x) \in S) \le e^\epsilon P(M(x') \in S). \qquad (1)\]</span></p>
<p>An example of <span class="math inline">\(\epsilon\)</span>-dp mechanism is the Laplace mechanism.</p>
<p><strong>Definition</strong>. The Laplace distribution over <span class="math inline">\(\mathbb R\)</span> with parameter <span class="math inline">\(b > 0\)</span> has probability density function</p>
<p><span class="math display">\[f_{\text{Lap}(b)}(x) = {1 \over 2 b} e^{- {|x| \over b}}.\]</span></p>
<p><strong>Definition</strong>. Let <span class="math inline">\(d = 1\)</span>. The Laplace mechanism is defined by</p>
<p><span class="math display">\[M(x) = f(x) + \text{Lap}(b).\]</span></p>
<p><strong>Claim</strong>. The Laplace mechanism with</p>
<p><span class="math display">\[b \ge \epsilon^{-1} S_f \qquad (1.5)\]</span></p>
<p>is <span class="math inline">\(\epsilon\)</span>-dp.</p>
<p><strong>Proof</strong>. Quite straightforward. Let <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> be the laws of <span class="math inline">\(M(x)\)</span> and <span class="math inline">\(M(x')\)</span> respectively.</p>
<p><span class="math display">\[{p (y) \over q (y)} = {f_{\text{Lap}(b)} (y - f(x)) \over f_{\text{Lap}(b)} (y - f(x'))} = \exp(b^{-1} (|y - f(x')| - |y - f(x)|))\]</span></p>
<p>Using triangular inequality <span class="math inline">\(|A| - |B| \le |A - B|\)</span> on the right hand side, we have</p>
<p><span class="math display">\[{p (y) \over q (y)} \le \exp(b^{-1} (|f(x) - f(x')|)) \le \exp(\epsilon)\]</span></p>
<p>where in the last step we use the condition (1.5). <span class="math inline">\(\square\)</span></p>
<h2 id="approximate-differential-privacy">Approximate differential privacy</h2>
<p>Unfortunately, <span class="math inline">\(\epsilon\)</span>-dp does not apply to the most commonly used noise, the Gaussian noise. To fix this, we need to relax the definition a bit.</p>
<p><strong>Definition</strong>. A mechanism <span class="math inline">\(M\)</span> is said to be <span class="math inline">\((\epsilon, \delta)\)</span><em>-differentially private</em> if for all <span class="math inline">\(x, x' \in X\)</span> with <span class="math inline">\(d(x, x') = 1\)</span> and for all measureable <span class="math inline">\(S \subset \mathbb R^d\)</span></p>
<p><span class="math display">\[\mathbb P(M(x) \in S) \le e^\epsilon P(M(x') \in S) + \delta. \qquad (2)\]</span></p>
<p>Immediately we see that the <span class="math inline">\((\epsilon, \delta)\)</span>-dp is meaningful only if <span class="math inline">\(\delta < 1\)</span>.</p>
<h3 id="indistinguishability">Indistinguishability</h3>
<p>To understand <span class="math inline">\((\epsilon, \delta)\)</span>-dp, it is helpful to study <span class="math inline">\((\epsilon, \delta)\)</span>-indistinguishability.</p>
<p><strong>Definition</strong>. Two probability measures <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> on the same space are called <span class="math inline">\((\epsilon, \delta)\)</span><em>-ind(istinguishable)</em> if for all measureable sets <span class="math inline">\(S\)</span>:</p>
<p><span class="math display">\[\begin{aligned}
p(S) \le e^\epsilon q(S) + \delta, \qquad (3) \\
q(S) \le e^\epsilon p(S) + \delta. \qquad (4)
\end{aligned}\]</span></p>
<p>As before, we also call random variables <span class="math inline">\(\xi\)</span> and <span class="math inline">\(\eta\)</span> to be <span class="math inline">\((\epsilon, \delta)\)</span>-ind if their laws are <span class="math inline">\((\epsilon, \delta)\)</span>-ind. When <span class="math inline">\(\delta = 0\)</span>, we call it <span class="math inline">\(\epsilon\)</span>-ind.</p>
<p>Immediately we have</p>
<p><strong>Claim 0</strong>. <span class="math inline">\(M\)</span> is <span class="math inline">\((\epsilon, \delta)\)</span>-dp (resp. <span class="math inline">\(\epsilon\)</span>-dp) iff <span class="math inline">\(M(x)\)</span> and <span class="math inline">\(M(x')\)</span> are <span class="math inline">\((\epsilon, \delta)\)</span>-ind (resp. <span class="math inline">\(\epsilon\)</span>-ind) for all <span class="math inline">\(x\)</span> and <span class="math inline">\(x'\)</span> with distance <span class="math inline">\(1\)</span>.</p>
<p><strong>Definition (Divergence Variable)</strong>. Let <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> be two probability measures. Let <span class="math inline">\(\xi\)</span> be a random variable distributed according to <span class="math inline">\(p\)</span>, we define a random variable <span class="math inline">\(L(p || q)\)</span> by</p>
<p><span class="math display">\[L(p || q) := \log {p(\xi) \over q(\xi)},\]</span></p>
<p>and call it the <em>divergence variable</em> of <span class="math inline">\((p, q)\)</span>.</p>
<p>One interesting and readily verifiable fact is</p>
<p><span class="math display">\[\mathbb E L(p || q) = D(p || q)\]</span></p>
<p>where <span class="math inline">\(D\)</span> is the KL-divergence.</p>
<p><strong>Claim 1</strong>. If</p>
<p><span class="math display">\[\begin{aligned}
\mathbb P(L(p || q) \le \epsilon) &\ge 1 - \delta, \qquad(5) \\
\mathbb P(L(q || p) \le \epsilon) &\ge 1 - \delta
\end{aligned}\]</span></p>
<p>then <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are <span class="math inline">\((\epsilon, \delta)\)</span>-ind.</p>
<p><strong>Proof</strong>. We verify (3), and (4) can be shown in the same way. Let <span class="math inline">\(A := \{y \in Y: \log {p(y) \over q(y)} > \epsilon\}\)</span>, then by (5) we have</p>
<p><span class="math display">\[p(A) < \delta.\]</span></p>
<p>So</p>
<p><span class="math display">\[p(S) = p(S \cap A) + p(S \setminus A) \le \delta + e^\epsilon q(S \setminus A) \le \delta + e^\epsilon q(S).\]</span></p>
<p><span class="math inline">\(\square\)</span></p>
<p>This Claim translates differential privacy to the tail bound of divergence variables, and for the rest of this post all dp results are obtained by estimating this tail bound.</p>
<p>In the following we discuss the converse of Claim 1. The discussions are rather technical, and readers can skip to the next subsection on first reading.</p>
<p>The converse of Claim 1 is not true.</p>
<p><strong>Claim 2</strong>. There exists <span class="math inline">\(\epsilon, \delta > 0\)</span>, and <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> that are <span class="math inline">\((\epsilon, \delta)\)</span>-ind, such that</p>
<p><span class="math display">\[\begin{aligned}
\mathbb P(L(p || q) \le \epsilon) &< 1 - \delta, \\
\mathbb P(L(q || p) \le \epsilon) &< 1 - \delta
\end{aligned}\]</span></p>
<p><strong>Proof</strong>. Here's a example. Let <span class="math inline">\(Y = \{0, 1\}\)</span>, and <span class="math inline">\(p(0) = q(1) = 2 / 5\)</span> and <span class="math inline">\(p(1) = q(0) = 3 / 5\)</span>. Then it is not hard to verify that <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are <span class="math inline">\((\log {4 \over 3}, {1 \over 3})\)</span>-ind: just check (3) for all four possible <span class="math inline">\(S \subset Y\)</span> and (4) holds by symmetry. On the other hand,</p>
<p><span class="math display">\[\mathbb P(L(p || q) \le \log {4 \over 3}) = \mathbb P(L(q || p) \le \log {4 \over 3}) = {2 \over 5} < {2 \over 3}.\]</span></p>
<p><span class="math inline">\(\square\)</span></p>
<p>A weaker version of the converse of Claim 1 is true (Kasiviswanathan-Smith 2015), though:</p>
<p><strong>Claim 3</strong>. Let <span class="math inline">\(\alpha > 1\)</span>. If <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are <span class="math inline">\((\epsilon, \delta)\)</span>-ind, then</p>
<p><span class="math display">\[\mathbb P(L(p || q) > \alpha \epsilon) < {1 \over 1 - \exp((1 - \alpha) \epsilon)} \delta.\]</span></p>
<p><strong>Proof</strong>. Define</p>
<p><span class="math display">\[S = \{y: p(y) > e^{\alpha \epsilon} q(y)\}.\]</span></p>
<p>Then we have</p>
<p><span class="math display">\[e^{\alpha \epsilon} q(S) < p(S) \le e^\epsilon q(S) + \delta,\]</span></p>
<p>where the first inequality is due to the definition of <span class="math inline">\(S\)</span>, and the second due to the <span class="math inline">\((\epsilon, \delta)\)</span>-ind. Therefore</p>
<p><span class="math display">\[q(S) \le {\delta \over e^{\alpha \epsilon} - e^\epsilon}.\]</span></p>
<p>Using the <span class="math inline">\((\epsilon, \delta)\)</span>-ind again we have</p>
<p><span class="math display">\[p(S) \le e^\epsilon q(S) + \delta = {1 \over 1 - e^{(1 - \alpha) \epsilon}} \delta.\]</span></p>
<p><span class="math inline">\(\square\)</span></p>
<p>This can be quite bad if <span class="math inline">\(\epsilon\)</span> is small.</p>
<p>To prove the composition theorems in the next section, we need a condition better than that in Claim 1 so that we can go back and forth between indistinguishability and such condition. In other words, we need a <em>characterisation</em> of indistinguishability.</p>
<p>Let us take a careful look at the condition in Claim 1 and call it <strong>C1</strong>:</p>
<p><strong>C1</strong>. <span class="math inline">\(\mathbb P(L(p || q) \le \epsilon) \ge 1 - \delta\)</span> and <span class="math inline">\(\mathbb P(L(q || p) \le \epsilon) \ge 1 - \delta\)</span></p>
<p>It is equivalent to</p>
<p><strong>C2</strong>. there exist events <span class="math inline">\(A, B \subset Y\)</span> with probabilities <span class="math inline">\(p(A)\)</span> and <span class="math inline">\(q(B)\)</span> at least <span class="math inline">\(1 - \delta\)</span> such that <span class="math inline">\(\log p(y) - \log q(y) \le \epsilon\)</span> for all <span class="math inline">\(y \in A\)</span> and <span class="math inline">\(\log q(y) - \log p(y) \le \epsilon\)</span> for all <span class="math inline">\(y \in B\)</span>.</p>
<p>A similar-looking condition to <strong>C2</strong> is the following:</p>
<p><strong>C3</strong>. Let <span class="math inline">\(\Omega\)</span> be the <a href="https://en.wikipedia.org/wiki/Probability_space#Definition">underlying probability space</a>. There exist two events <span class="math inline">\(E, F \subset \Omega\)</span> with <span class="math inline">\(\mathbb P(E), \mathbb P(F) \ge 1 - \delta\)</span>, such that <span class="math inline">\(|\log p_{|E}(y) - \log q_{|F}(y)| \le \epsilon\)</span> for all <span class="math inline">\(y \in Y\)</span>.</p>
<p>Here <span class="math inline">\(p_{|E}\)</span> (resp. <span class="math inline">\(q_{|F}\)</span>) is <span class="math inline">\(p\)</span> (resp. <span class="math inline">\(q\)</span>) conditioned on event <span class="math inline">\(E\)</span> (resp. <span class="math inline">\(F\)</span>).</p>
<p><strong>Remark</strong>. Note that the events in <strong>C2</strong> and <strong>C3</strong> are in different spaces, and therefore we can not write <span class="math inline">\(p_{|E}(S)\)</span> as <span class="math inline">\(p(S | E)\)</span> or <span class="math inline">\(q_{|F}(S)\)</span> as <span class="math inline">\(q(S | F)\)</span>. In fact, if we let <span class="math inline">\(E\)</span> and <span class="math inline">\(F\)</span> in <strong>C3</strong> be subsets of <span class="math inline">\(Y\)</span> with <span class="math inline">\(p(E), q(F) \ge 1 - \delta\)</span> and assume <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> have the same supports, then <strong>C3</strong> degenerates to a stronger condition than <strong>C2</strong>. Indeed, in this case <span class="math inline">\(p_E(y) = p(y) 1_{y \in E}\)</span> and <span class="math inline">\(q_F(y) = q(y) 1_{y \in F}\)</span>, and so <span class="math inline">\(p_E(y) \le e^\epsilon q_F(y)\)</span> forces <span class="math inline">\(E \subset F\)</span>. We also obtain <span class="math inline">\(F \subset E\)</span> in the same way. This gives us <span class="math inline">\(E = F\)</span>, and <strong>C3</strong> becomes <strong>C2</strong> with <span class="math inline">\(A = B = E = F\)</span>.</p>
<p>As it turns out, <strong>C3</strong> is the condition we need.</p>
<p><strong>Claim 4</strong>. Two probability measures <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are <span class="math inline">\((\epsilon, \delta)\)</span>-ind if and only if <strong>C3</strong> holds.</p>
<p><strong>Proof</strong>(Murtagh-Vadhan 2018). The "if" direction is proved in the same way as Lemma 1. Without loss of generality we may assume <span class="math inline">\(\mathbb P(E) = \mathbb P(F) \ge 1 - \delta\)</span>. To see this, suppose <span class="math inline">\(F\)</span> has higher probability than <span class="math inline">\(E\)</span>, then we can substitute <span class="math inline">\(F\)</span> with a subset of <span class="math inline">\(F\)</span> that has the same probability as <span class="math inline">\(E\)</span> (with possible enlargement of the probability space).</p>
<p>Let <span class="math inline">\(\xi \sim p\)</span> and <span class="math inline">\(\eta \sim q\)</span> be two independent random variables, then</p>
<p><span class="math display">\[\begin{aligned}
p(S) &= \mathbb P(\xi \in S | E) \mathbb P(E) + \mathbb P(\xi \in S; E^c) \\
&\le e^\epsilon \mathbb P(\eta \in S | F) \mathbb P(E) + \delta \\
&= e^\epsilon \mathbb P(\eta \in S | F) \mathbb P(F) + \delta\\
&\le e^\epsilon q(S) + \delta.
\end{aligned}\]</span></p>
<p>The "only-if" direction is more involved.</p>
<p>We construct events <span class="math inline">\(E\)</span> and <span class="math inline">\(F\)</span> by constructing functions <span class="math inline">\(e, f: Y \to [0, \infty)\)</span> satisfying the following conditions:</p>
<ol type="1">
<li><span class="math inline">\(0 \le e(y) \le p(y)\)</span> and <span class="math inline">\(0 \le f(y) \le q(y)\)</span> for all <span class="math inline">\(y \in Y\)</span>.</li>
<li><span class="math inline">\(|\log e(y) - \log f(y)| \le \epsilon\)</span> for all <span class="math inline">\(y \in Y\)</span>.</li>
<li><span class="math inline">\(e(Y), f(Y) \ge 1 - \delta\)</span>.</li>
<li><span class="math inline">\(e(Y) = f(Y)\)</span>.</li>
</ol>
<p>Here for a set <span class="math inline">\(S \subset Y\)</span>, <span class="math inline">\(e(S) := \int_S e(y) dy\)</span>, and the same goes for <span class="math inline">\(f(S)\)</span>.</p>
<p>Let <span class="math inline">\(\xi \sim p\)</span> and <span class="math inline">\(\eta \sim q\)</span>. Then we define <span class="math inline">\(E\)</span> and <span class="math inline">\(F\)</span> by</p>
<p><span class="math display">\[\mathbb P(E | \xi = y) = e(y) / p(y) \\
\mathbb P(F | \eta = y) = f(y) / q(y).\]</span></p>
<p><strong>Remark inside proof</strong>. This can seem a bit confusing. Intuitively, we can think of it this way when <span class="math inline">\(Y\)</span> is finite: Recall a random variable on <span class="math inline">\(Y\)</span> is a function from the probability space <span class="math inline">\(\Omega\)</span> to <span class="math inline">\(Y\)</span>. Let event <span class="math inline">\(G_y \subset \Omega\)</span> be defined as <span class="math inline">\(G_y = \xi^{-1} (y)\)</span>. We cut <span class="math inline">\(G_y\)</span> into the disjoint union of <span class="math inline">\(E_y\)</span> and <span class="math inline">\(G_y \setminus E_y\)</span> such that <span class="math inline">\(\mathbb P(E_y) = e(y)\)</span>. Then <span class="math inline">\(E = \bigcup_{y \in Y} E_y\)</span>. So <span class="math inline">\(e(y)\)</span> can be seen as the "density" of <span class="math inline">\(E\)</span>.</p>
<p>Indeed, given <span class="math inline">\(E\)</span> and <span class="math inline">\(F\)</span> defined this way, we have</p>
<p><span class="math display">\[p_E(y) = {e(y) \over e(Y)} \le {\exp(\epsilon) f(y) \over e(Y)} = {\exp(\epsilon) f(y) \over f(Y)} = \exp(\epsilon) q_F(y).\]</span></p>
<p>and</p>
<p><span class="math display">\[\mathbb P(E) = \int \mathbb P(E | \xi = y) p(y) dy = e(Y) \ge 1 - \delta,\]</span></p>
<p>and the same goes for <span class="math inline">\(\mathbb P(F)\)</span>.</p>
<p>What remains is to construct <span class="math inline">\(e(y)\)</span> and <span class="math inline">\(f(y)\)</span> satisfying the four conditions.</p>
<p>Like in the proof of Claim 1, let <span class="math inline">\(S, T \subset Y\)</span> be defined as</p>
<p><span class="math display">\[\begin{aligned}
S := \{y: p(y) > \exp(\epsilon) q(y)\},\\
T := \{y: q(y) > \exp(\epsilon) p(y)\}.
\end{aligned}\]</span></p>
<p>Let</p>
<p><span class="math display">\[\begin{aligned}
e(y) &:= \exp(\epsilon) q(y) 1_{y \in S} + p(y) 1_{y \notin S}\\
f(y) &:= \exp(\epsilon) p(y) 1_{y \in T} + q(y) 1_{y \notin T}. \qquad (6)
\end{aligned}\]</span></p>
<p>By checking them on the three disjoint subsets <span class="math inline">\(S\)</span>, <span class="math inline">\(T\)</span>, <span class="math inline">\((S \cup T)^c\)</span>, it is not hard to verify that the <span class="math inline">\(e(y)\)</span> and <span class="math inline">\(f(y)\)</span> constructed this way satisfy the first two conditions. They also satisfy the third condition:</p>
<p><span class="math display">\[\begin{aligned}
e(Y) &= 1 - (p(S) - \exp(\epsilon) q(S)) \ge 1 - \delta, \\
f(Y) &= 1 - (q(T) - \exp(\epsilon) p(T)) \ge 1 - \delta.
\end{aligned}\]</span></p>
<p>If <span class="math inline">\(e(Y) = f(Y)\)</span> then we are done. Otherwise, without loss of generality, assume <span class="math inline">\(e(Y) < f(Y)\)</span>, then all it remains to do is to reduce the value of <span class="math inline">\(f(y)\)</span> while preserving Condition 1, 2 and 3, until <span class="math inline">\(f(Y) = e(Y)\)</span>.</p>
<p>As it turns out, this can be achieved by reducing <span class="math inline">\(f(y)\)</span> on the set <span class="math inline">\(\{y \in Y: q(y) > p(y)\}\)</span>. To see this, let us rename the <span class="math inline">\(f(y)\)</span> defined in (6) <span class="math inline">\(f_+(y)\)</span>, and construct <span class="math inline">\(f_-(y)\)</span> by</p>
<p><span class="math display">\[f_-(y) := p(y) 1_{y \in T} + (q(y) \wedge p(y)) 1_{y \notin T}.\]</span></p>
<p>It is not hard to show that not only <span class="math inline">\(e(y)\)</span> and <span class="math inline">\(f_-(y)\)</span> also satisfy conditions 1-3, but</p>
<p><span class="math display">\[e(y) \ge f_-(y), \forall y \in Y,\]</span></p>
<p>and thus <span class="math inline">\(e(Y) \ge f_-(Y)\)</span>. Therefore there exists an <span class="math inline">\(f\)</span> that interpolates between <span class="math inline">\(f_-\)</span> and <span class="math inline">\(f_+\)</span> with <span class="math inline">\(f(Y) = e(Y)\)</span>. <span class="math inline">\(\square\)</span></p>
<p>To prove the adaptive composition theorem for approximate differential privacy, we need a similar claim (We use index shorthand <span class="math inline">\(\xi_{< i} = \xi_{1 : i - 1}\)</span> and similarly for other notations):</p>
<p><strong>Claim 5</strong>. Let <span class="math inline">\(\xi_{1 : i}\)</span> and <span class="math inline">\(\eta_{1 : i}\)</span> be random variables. Let</p>
<p><span class="math display">\[\begin{aligned}
p_i(S | y_{1 : i - 1}) := \mathbb P(\xi_i \in S | \xi_{1 : i - 1} = y_{1 : i - 1})\\
q_i(S | y_{1 : i - 1}) := \mathbb P(\eta_i \in S | \eta_{1 : i - 1} = y_{1 : i - 1})
\end{aligned}\]</span></p>
<p>be the conditional laws of <span class="math inline">\(\xi_i | \xi_{< i}\)</span> and <span class="math inline">\(\eta_i | \eta_{< i}\)</span> respectively. Then the following are equivalent:</p>
<ol type="1">
<li>For any <span class="math inline">\(y_{< i} \in Y^{i - 1}\)</span>, <span class="math inline">\(p_i(\cdot | y_{< i})\)</span> and <span class="math inline">\(q_i(\cdot | y_{< i})\)</span> are <span class="math inline">\((\epsilon, \delta)\)</span>-ind</li>
<li><p>There exists events <span class="math inline">\(E_i, F_i \subset \Omega\)</span> with <span class="math inline">\(\mathbb P(E_i | \xi_{<i} = y_{<i}) = \mathbb P(F_i | \eta_{<i} = y_{< i}) \ge 1 - \delta\)</span> for any <span class="math inline">\(y_{< i}\)</span>, such that <span class="math inline">\(p_{i | E_i}(\cdot | y_{< i})\)</span> and <span class="math inline">\(q_{i | E_i} (\cdot | y_{< i})\)</span> are <span class="math inline">\(\epsilon\)</span>-ind for any <span class="math inline">\(y_{< i}\)</span>, where <span class="math display">\[\begin{aligned}
p_{i | E_i}(S | y_{1 : i - 1}) := \mathbb P(\xi_i \in S | E_i, \xi_{1 : i - 1} = y_{1 : i - 1})\\
q_{i | F_i}(S | y_{1 : i - 1}) := \mathbb P(\eta_i \in S | F_i, \eta_{1 : i - 1} = y_{1 : i - 1})
\end{aligned}\]</span></p>
<p>are <span class="math inline">\(p_i\)</span> and <span class="math inline">\(q_i\)</span> conditioned on <span class="math inline">\(E_i\)</span> and <span class="math inline">\(F_i\)</span> respectively.</p></li>
</ol>
<p><strong>Proof</strong>. Item 2 => Item 1: as in the Proof of Claim 4,</p>
<p><span class="math display">\[\begin{aligned}
p_i(S | y_{< i}) &= p_{i | E_i} (S | y_{< i}) \mathbb P(E_i | \xi_{< i} = y_{< i}) + p_{i | E_i^c}(S | y_{< i}) \mathbb P(E_i^c | \xi_{< i} = y_{< i}) \\
&\le p_{i | E_i} (S | y_{< i}) \mathbb P(E_i | \xi_{< i} = y_{< i}) + \delta \\
&= p_{i | E_i} (S | y_{< i}) \mathbb P(F_i | \xi_{< i} = y_{< i}) + \delta \\
&\le e^\epsilon q_{i | F_i} (S | y_{< i}) \mathbb P(F_i | \xi_{< i} = y_{< i}) + \delta \\
&= e^\epsilon q_i (S | y_{< i}) + \delta.
\end{aligned}\]</span></p>
<p>The direction from <span class="math inline">\(q_i(S | y_{< i}) \le e^\epsilon p_i(S | y_{< i}) + \delta\)</span> can be shown in the same way.</p>
<p>Item 1 => Item 2: as in the Proof of Claim 4 we construct <span class="math inline">\(e(y_{1 : i})\)</span> and <span class="math inline">\(f(y_{1 : i})\)</span> as "densities" of events <span class="math inline">\(E_i\)</span> and <span class="math inline">\(F_i\)</span>.</p>
<p>Let</p>
<p><span class="math display">\[\begin{aligned}
e(y_{1 : i}) &:= e^\epsilon q_i(y_i | y_{< i}) 1_{y_i \in S_i(y_{< i})} + p_i(y_i | y_{< i}) 1_{y_i \notin S_i(y_{< i})}\\
f(y_{1 : i}) &:= e^\epsilon p_i(y_i | y_{< i}) 1_{y_i \in T_i(y_{< i})} + q_i(y_i | y_{< i}) 1_{y_i \notin T_i(y_{< i})}\\
\end{aligned}\]</span></p>
<p>where</p>
<p><span class="math display">\[\begin{aligned}
S_i(y_{< i}) = \{y_i \in Y: p_i(y_i | y_{< i}) > e^\epsilon q_i(y_i | y_{< i})\}\\
T_i(y_{< i}) = \{y_i \in Y: q_i(y_i | y_{< i}) > e^\epsilon p_i(y_i | y_{< i})\}.
\end{aligned}\]</span></p>
<p>Then <span class="math inline">\(E_i\)</span> and <span class="math inline">\(F_i\)</span> are defined as</p>
<p><span class="math display">\[\begin{aligned}
\mathbb P(E_i | \xi_{\le i} = y_{\le i}) &= {e(y_{\le i}) \over p_i(y_{\le i})},\\
\mathbb P(F_i | \xi_{\le i} = y_{\le i}) &= {f(y_{\le i}) \over q_i(y_{\le i})}.
\end{aligned}\]</span></p>
<p>The rest of the proof is almost the same as the proof of Lemma 2. <span class="math inline">\(\square\)</span></p>
<h3 id="back-to-approximate-differential-privacy">Back to approximate differential privacy</h3>
<p>By Claim 0 and 1 we have</p>
<p><strong>Claim 6</strong>. If for all <span class="math inline">\(x, x' \in X\)</span> with distance <span class="math inline">\(1\)</span></p>
<p><span class="math display">\[\mathbb P(L(M(x) || M(x')) \le \epsilon) \ge 1 - \delta,\]</span></p>
<p>then <span class="math inline">\(M\)</span> is <span class="math inline">\((\epsilon, \delta)\)</span>-dp.</p>
<p>Note that in the literature the divergence variable <span class="math inline">\(L(M(x) || M(x'))\)</span> is also called the <em>privacy loss</em>.</p>
<p>By Claim 0 and Claim 4 we have</p>
<p><strong>Claim 7</strong>. <span class="math inline">\(M\)</span> is <span class="math inline">\((\epsilon, \delta)\)</span>-dp if and only if for every <span class="math inline">\(x, x' \in X\)</span> with distance <span class="math inline">\(1\)</span>, there exist events <span class="math inline">\(E, F \subset \Omega\)</span> with <span class="math inline">\(\mathbb P(E) = \mathbb P(F) \ge 1 - \delta\)</span>, <span class="math inline">\(M(x) | E\)</span> and <span class="math inline">\(M(x') | F\)</span> are <span class="math inline">\(\epsilon\)</span>-ind.</p>
<p>We can further simplify the privacy loss <span class="math inline">\(L(M(x) || M(x'))\)</span>, by observing the translational and scaling invariance of <span class="math inline">\(L(\cdot||\cdot)\)</span>:</p>
<p><span class="math display">\[\begin{aligned}
L(\xi || \eta) &\overset{d}{=} L(\alpha \xi + \beta || \alpha \eta + \beta), \qquad \alpha \neq 0. \qquad (6.1)
\end{aligned}\]</span></p>
<p>With this and the definition</p>
<p><span class="math display">\[M(x) = f(x) + \zeta\]</span></p>
<p>for some random variable <span class="math inline">\(\zeta\)</span>, we have</p>
<p><span class="math display">\[L(M(x) || M(x')) \overset{d}{=} L(\zeta || \zeta + f(x') - f(x)).\]</span></p>
<p>Without loss of generality, we can consider <span class="math inline">\(f\)</span> with sensitivity <span class="math inline">\(1\)</span>, for</p>
<p><span class="math display">\[L(f(x) + S_f \zeta || f(x') + S_f \zeta) \overset{d}{=} L(S_f^{-1} f(x) + \zeta || S_f^{-1} f(x') + \zeta)\]</span></p>
<p>so for any noise <span class="math inline">\(\zeta\)</span> that achieves <span class="math inline">\((\epsilon, \delta)\)</span>-dp for a function with sensitivity <span class="math inline">\(1\)</span>, we have the same privacy guarantee by for an arbitrary function with sensitivity <span class="math inline">\(S_f\)</span> by adding a noise <span class="math inline">\(S_f \zeta\)</span>.</p>
<p>With Claim 6 we can show that the Gaussian mechanism is approximately differentially private. But first we need to define it.</p>
<p><strong>Definition (Gaussian mechanism)</strong>. Given a query <span class="math inline">\(f: X \to Y\)</span>, the <em>Gaussian mechanism</em> <span class="math inline">\(M\)</span> adds a Gaussian noise to the query:</p>
<p><span class="math display">\[M(x) = f(x) + N(0, \sigma^2 I).\]</span></p>
<p>Some tail bounds for the Gaussian distribution will be useful.</p>
<p><strong>Claim 8 (Gaussian tail bounds)</strong>. Let <span class="math inline">\(\xi \sim N(0, 1)\)</span> be a standard normal distribution. Then for <span class="math inline">\(t > 0\)</span></p>
<p><span class="math display">\[\mathbb P(\xi > t) < {1 \over \sqrt{2 \pi} t} e^{- {t^2 \over 2}}, \qquad (6.3)\]</span></p>
<p>and</p>
<p><span class="math display">\[\mathbb P(\xi > t) < e^{- {t^2 \over 2}}. \qquad (6.5)\]</span></p>
<p><strong>Proof</strong>. Both bounds are well known. The first can be proved using</p>
<p><span class="math display">\[\int_t^\infty e^{- {y^2 \over 2}} dy < \int_t^\infty {y \over t} e^{- {y^2 \over 2}} dy.\]</span></p>
<p>The second is shown using Chernoff bound. For any random variable <span class="math inline">\(\xi\)</span>,</p>
<p><span class="math display">\[\mathbb P(\xi > t) < {\mathbb E \exp(\lambda \xi) \over \exp(\lambda t)} = \exp(\kappa_\xi(\lambda) - \lambda t), \qquad (6.7)\]</span></p>
<p>where <span class="math inline">\(\kappa_\xi(\lambda) = \log \mathbb E \exp(\lambda \xi)\)</span> is the cumulant of <span class="math inline">\(\xi\)</span>. Since (6.7) holds for any <span class="math inline">\(\lambda\)</span>, we can get the best bound by minimising <span class="math inline">\(\kappa_\xi(\lambda) - \lambda t\)</span> (a.k.a. the Legendre transformation). When <span class="math inline">\(\xi\)</span> is standard normal, we get (6.5). <span class="math inline">\(\square\)</span></p>
<p><strong>Remark</strong>. We will use the Chernoff bound extensively in the second part of this post when considering Rényi differential privacy.</p>
<p><strong>Claim 9</strong>. The Gaussian mechanism on a query <span class="math inline">\(f\)</span> is <span class="math inline">\((\epsilon, \delta)\)</span>-dp, where</p>
<p><span class="math display">\[\delta = \exp(- (\epsilon \sigma / S_f - (2 \sigma / S_f)^{-1})^2 / 2). \qquad (6.8)\]</span></p>
<p>Conversely, to achieve give <span class="math inline">\((\epsilon, \delta)\)</span>-dp, we may set</p>
<p><span class="math display">\[\sigma > \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right) S_f \qquad (6.81)\]</span></p>
<p>or</p>
<p><span class="math display">\[\sigma > (\epsilon^{-1} (1 \vee \sqrt{(\log (2 \pi)^{-1} \delta^{-2})_+}) + (2 \epsilon)^{- {1 \over 2}}) S_f \qquad (6.82)\]</span></p>
<p>or</p>
<p><span class="math display">\[\sigma > \epsilon^{-1} \sqrt{\log e^\epsilon \delta^{-2}} S_f \qquad (6.83)\]</span></p>
<p>or</p>
<p><span class="math display">\[\sigma > \epsilon^{-1} (\sqrt{1 + \epsilon} \vee \sqrt{(\log e^\epsilon (2 \pi)^{-1} \delta^{-2})_+}) S_f. \qquad (6.84)\]</span></p>
<p><strong>Proof</strong>. As discussed before we only need to consider the case where <span class="math inline">\(S_f = 1\)</span>. Fix arbitrary <span class="math inline">\(x, x' \in X\)</span> with <span class="math inline">\(d(x, x') = 1\)</span>. Let <span class="math inline">\(\zeta = (\zeta_1, ..., \zeta_d) \sim N(0, I_d)\)</span>.</p>
<p>By Claim 6 it suffices to bound</p>
<p><span class="math display">\[\mathbb P(L(M(x) || M(x')) > \epsilon)\]</span></p>
<p>We have by the linear invariance of <span class="math inline">\(L\)</span>,</p>
<p><span class="math display">\[L(M(x) || M(x')) = L(f(x) + \sigma \zeta || f(x') + \sigma \zeta) \overset{d}{=} L(\zeta|| \zeta + \Delta / \sigma),\]</span></p>
<p>where <span class="math inline">\(\Delta := f(x') - f(x)\)</span>.</p>
<p>Plugging in the Gaussian density, we have</p>
<p><span class="math display">\[L(M(x) || M(x')) \overset{d}{=} \sum_i {\Delta_i \over \sigma} \zeta_i + \sum_i {\Delta_i^2 \over 2 \sigma^2} \overset{d}{=} {\|\Delta\|_2 \over \sigma} \xi + {\|\Delta\|_2^2 \over 2 \sigma^2}.\]</span></p>
<p>where <span class="math inline">\(\xi \sim N(0, 1)\)</span>.</p>
<p>Hence</p>
<p><span class="math display">\[\mathbb P(L(M(x) || M(x')) > \epsilon) = \mathbb P(\zeta > {\sigma \over \|\Delta\|_2} \epsilon - {\|\Delta\|_2 \over 2 \sigma}).\]</span></p>
<p>Since <span class="math inline">\(\|\Delta\|_2 \le S_f = 1\)</span>, we have</p>
<p><span class="math display">\[\mathbb P(L(M(x) || M(x')) > \epsilon) \le \mathbb P(\xi > \sigma \epsilon - (2 \sigma)^{-1}).\]</span></p>
<p>Thus the problem is reduced to the tail bound of a standard normal distribution, so we can use Claim 8. Note that we implicitly require <span class="math inline">\(\sigma > (2 \epsilon)^{- 1 / 2}\)</span> here so that <span class="math inline">\(\sigma \epsilon - (2 \sigma)^{-1} > 0\)</span> and we can use the tail bounds.</p>
<p>Using (6.3) we have</p>
<p><span class="math display">\[\mathbb P(L(M(x) || M(x')) > \epsilon) < \exp(- (\epsilon \sigma - (2 \sigma)^{-1})^2 / 2).\]</span></p>
<p>This gives us (6.8).</p>
<p>To bound the right hand by <span class="math inline">\(\delta\)</span>, we require</p>
<p><span class="math display">\[\epsilon \sigma - {1 \over 2 \sigma} > \sqrt{2 \log \delta^{-1}}. \qquad (6.91)\]</span></p>
<p>Solving this inequality we have</p>
<p><span class="math display">\[\sigma > {\sqrt{2 \log \delta^{-1}} + \sqrt{2 \log \delta^{-1} + 2 \epsilon} \over 2 \epsilon}.\]</span></p>
<p>Using <span class="math inline">\(\sqrt{2 \log \delta^{-1} + 2 \epsilon} \le \sqrt{2 \log \delta^{-1}} + \sqrt{2 \epsilon}\)</span>, we can achieve the above inequality by having</p>
<p><span class="math display">\[\sigma > \epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{-{1 \over 2}}.\]</span></p>
<p>This gives us (6.81).</p>
<p>Alternatively, we can use the concavity of <span class="math inline">\(\sqrt{\cdot}\)</span>:</p>
<p><span class="math display">\[(2 \epsilon)^{-1} (\sqrt{2 \log \delta^{-1}} + \sqrt{2 \log \delta^{-1} + 2 \epsilon}) \le \epsilon^{-1} \sqrt{\log e^\epsilon \delta^{-2}},\]</span></p>
<p>which gives us (6.83)</p>
<p>Back to (6.9), if we use (6.5) instead, we need</p>
<p><span class="math display">\[\log t + {t^2 \over 2} > \log {(2 \pi)^{- 1 / 2} \delta^{-1}}\]</span></p>
<p>where <span class="math inline">\(t = \epsilon \sigma - (2 \sigma)^{-1}\)</span>. This can be satisfied if</p>
<p><span class="math display">\[\begin{aligned}
t &> 1 \qquad (6.93)\\
t &> \sqrt{\log (2 \pi)^{-1} \delta^{-2}}. \qquad (6.95)
\end{aligned}\]</span></p>
<p>We can solve both inequalities as before and obtain</p>
<p><span class="math display">\[\sigma > \epsilon^{-1} (1 \vee \sqrt{(\log (2 \pi)^{-1} \delta^{-2})_+}) + (2 \epsilon)^{- {1 \over 2}},\]</span></p>
<p>or</p>
<p><span class="math display">\[\sigma > \epsilon^{-1}(\sqrt{1 + \epsilon} \vee \sqrt{(\log e^\epsilon (2 \pi)^{-1} \delta^{-2})_+}).\]</span></p>
<p>This gives us (6.82)(6.84). <span class="math inline">\(\square\)</span></p>
<p>When <span class="math inline">\(\epsilon \le \alpha\)</span> is bounded, by (6.83) (6.84) we can require either</p>
<p><span class="math display">\[\sigma > \epsilon^{-1} (\sqrt{\log e^\alpha \delta^{-2}}) S_f\]</span></p>
<p>or</p>
<p><span class="math display">\[\sigma > \epsilon^{-1} (\sqrt{1 + \alpha} \vee \sqrt{(\log (2 \pi)^{-1} e^\alpha \delta^{-2})_+}).\]</span></p>
<p>The second bound is similar to and slightly better than the one in Theorem A.1 of Dwork-Roth 2013, where <span class="math inline">\(\alpha = 1\)</span>:</p>
<p><span class="math display">\[\sigma > \epsilon^{-1} \left({3 \over 2} \vee \sqrt{(2 \log {5 \over 4} \delta^{-1})_+}\right) S_f.\]</span></p>
<p>Note that the lower bound of <span class="math inline">\({3 \over 2}\)</span> is implicitly required in the proof of Theorem A.1.</p>
<h2 id="composition-theorems">Composition theorems</h2>
<p>So far we have seen how a mechanism made of a single query plus a noise can be proved to be differentially private. But we need to understand the privacy when composing several mechanisms, combinatorially or sequentially. Let us first define the combinatorial case:</p>
<p><strong>Definition (Independent composition)</strong>. Let <span class="math inline">\(M_1, ..., M_k\)</span> be <span class="math inline">\(k\)</span> mechanisms with independent noises. The mechanism <span class="math inline">\(M = (M_1, ..., M_k)\)</span> is called the <em>independent composition</em> of <span class="math inline">\(M_{1 : k}\)</span>.</p>
<p>To define the adaptive composition, let us motivate it with an example of gradient descent. Consider the loss function <span class="math inline">\(\ell(x; \theta)\)</span> of a neural network, where <span class="math inline">\(\theta\)</span> is the parameter and <span class="math inline">\(x\)</span> the input, gradient descent updates its parameter <span class="math inline">\(\theta\)</span> at each time <span class="math inline">\(t\)</span>:</p>
<p><span class="math display">\[\theta_{t} = \theta_{t - 1} - \alpha m^{-1} \sum_{i = 1 : m} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}}.\]</span></p>
<p>We may add privacy by adding noise <span class="math inline">\(\zeta_t\)</span> at each step:</p>
<p><span class="math display">\[\theta_{t} = \theta_{t - 1} - \alpha m^{-1} \sum_{i = 1 : m} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}} + \zeta_t. \qquad (6.97)\]</span></p>
<p>Viewed as a sequence of mechanism, we have that at each time <span class="math inline">\(t\)</span>, the mechanism <span class="math inline">\(M_t\)</span> takes input <span class="math inline">\(x\)</span>, and outputs <span class="math inline">\(\theta_t\)</span>. But <span class="math inline">\(M_t\)</span> also depends on the output of the previous mechanism <span class="math inline">\(M_{t - 1}\)</span>. To this end, we define the adaptive composition.</p>
<p><strong>Definition (Adaptive composition)</strong>. Let <span class="math inline">\(({M_i(y_{1 : i - 1})})_{i = 1 : k}\)</span> be <span class="math inline">\(k\)</span> mechanisms with independent noises, where <span class="math inline">\(M_1\)</span> has no parameter, <span class="math inline">\(M_2\)</span> has one parameter in <span class="math inline">\(Y\)</span>, <span class="math inline">\(M_3\)</span> has two parameters in <span class="math inline">\(Y\)</span> and so on. For <span class="math inline">\(x \in X\)</span>, define <span class="math inline">\(\xi_i\)</span> recursively by</p>
<p><span class="math display">\[\begin{aligned}
\xi_1 &:= M_1(x)\\
\xi_i &:= M_i(\xi_1, \xi_2, ..., \xi_{i - 1}) (x).
\end{aligned}\]</span></p>
<p>The <em>adaptive composition</em> of <span class="math inline">\(M_{1 : k}\)</span> is defined by <span class="math inline">\(M(x) := (\xi_1, \xi_2, ..., \xi_k)\)</span>.</p>
<p>The definition of adaptive composition may look a bit complicated, but the point is to describe <span class="math inline">\(k\)</span> mechanisms such that for each <span class="math inline">\(i\)</span>, the output of the first, second, ..., <span class="math inline">\(i - 1\)</span>th mechanisms determine the <span class="math inline">\(i\)</span>th mechanism, like in the case of gradient descent.</p>
<p>It is not hard to write down the differentially private gradient descent as a sequential composition:</p>
<p><span class="math display">\[M_t(\theta_{1 : t - 1})(x) = \theta_{t - 1} - \alpha m^{-1} \sum_{i = 1 : m} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}} + \zeta_t.\]</span></p>
<p>In Dwork-Rothblum-Vadhan 2010 (see also Dwork-Roth 2013) the adaptive composition is defined in a more general way, but the definition is based on the same principle, and proofs in this post on adaptive compositions carry over.</p>
<p>It is not hard to see that the adaptive composition degenerates to independent composition when each <span class="math inline">\(M_i(y_{1 : i})\)</span> evaluates to the same mechanism regardless of <span class="math inline">\(y_{1 : i}\)</span>, in which case the <span class="math inline">\(\xi_i\)</span>s are independent.</p>
<p>In the following when discussing adaptive compositions we sometimes omit the parameters for convenience without risk of ambiguity, and write <span class="math inline">\(M_i(y_{1 : i})\)</span> as <span class="math inline">\(M_i\)</span>, but keep in mind of the dependence on the parameters.</p>
<p>It is time to state and prove the composition theorems. In this section we consider <span class="math inline">\(2 \times 2 \times 2 = 8\)</span> cases, i.e. situations of three dimensions, where there are two choices in each dimension:</p>
<ol type="1">
<li>Composition of <span class="math inline">\(\epsilon\)</span>-dp or more generally <span class="math inline">\((\epsilon, \delta)\)</span>-dp mechanisms</li>
<li>Composition of independent or more generally adaptive mechanisms</li>
<li>Basic or advanced compositions</li>
</ol>
<p>Note that in the first two dimensions the second choice is more general than the first.</p>
<p>The proofs of these composition theorems will be laid out as follows:</p>
<ol type="1">
<li>Claim 10 - Basic composition theorem for <span class="math inline">\((\epsilon, \delta)\)</span>-dp with adaptive mechanisms: by a direct proof with an induction argument</li>
<li>Claim 14 - Advanced composition theorem for <span class="math inline">\(\epsilon\)</span>-dp with independent mechanisms: by factorising privacy loss and using Hoeffding's Inequality</li>
<li>Claim 16 - Advanced composition theorem for <span class="math inline">\(\epsilon\)</span>-dp with adaptive mechanisms: by factorising privacy loss and using Azuma's Inequality</li>
<li>Claims 17 and 18 - Advanced composition theorem for <span class="math inline">\((\epsilon, \delta)\)</span>-dp with independent / adaptive mechanisms: by using characterisations of <span class="math inline">\((\epsilon, \delta)\)</span>-dp in Claims 4 and 5 as an approximation of <span class="math inline">\(\epsilon\)</span>-dp and then using Proofs in Item 2 or 3.</li>
</ol>
<p><strong>Claim 10 (Basic composition theorem).</strong> Let <span class="math inline">\(M_{1 : k}\)</span> be <span class="math inline">\(k\)</span> mechanisms with independent noises such that for each <span class="math inline">\(i\)</span> and <span class="math inline">\(y_{1 : i - 1}\)</span>, <span class="math inline">\(M_i(y_{1 : i - 1})\)</span> is <span class="math inline">\((\epsilon_i, \delta_i)\)</span>-dp. Then the adpative composition of <span class="math inline">\(M_{1 : k}\)</span> is <span class="math inline">\((\sum_i \epsilon_i, \sum_i \delta_i)\)</span>-dp.</p>
<p><strong>Proof (Dwork-Lei 2009, see also Dwork-Roth 2013 Appendix B.1)</strong>. Let <span class="math inline">\(x\)</span> and <span class="math inline">\(x'\)</span> be neighbouring points in <span class="math inline">\(X\)</span>. Let <span class="math inline">\(M\)</span> be the adaptive composition of <span class="math inline">\(M_{1 : k}\)</span>. Define</p>
<p><span class="math display">\[\xi_{1 : k} := M(x), \qquad \eta_{1 : k} := M(x').\]</span></p>
<p>Let <span class="math inline">\(p^i\)</span> and <span class="math inline">\(q^i\)</span> be the laws of <span class="math inline">\((\xi_{1 : i})\)</span> and <span class="math inline">\((\eta_{1 : i})\)</span> respectively.</p>
<p>Let <span class="math inline">\(S_1, ..., S_k \subset Y\)</span> and <span class="math inline">\(T_i := \prod_{j = 1 : i} S_j\)</span>. We use two tricks.</p>
<ol type="1">
<li><p>Since <span class="math inline">\(\xi_i | \xi_{< i} = y_{< i}\)</span> and <span class="math inline">\(\eta_i | \eta_{< i} = y_{< i}\)</span> are <span class="math inline">\((\epsilon_i, \delta_i)\)</span>-ind, and a probability is no greater than <span class="math inline">\(1\)</span>, <span class="math display">\[\begin{aligned}
\mathbb P(\xi_i \in S_i | \xi_{< i} = y_{< i}) &\le (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) + \delta_i) \wedge 1 \\
&\le (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) + \delta_i) \wedge (1 + \delta_i) \\
&= (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) \wedge 1) + \delta_i
\end{aligned}\]</span></p></li>
<li><p>Given <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> that are <span class="math inline">\((\epsilon, \delta)\)</span>-ind, define <span class="math display">\[\mu(x) = (p(x) - e^\epsilon q(x))_+.\]</span></p>
<p>We have <span class="math display">\[\mu(S) \le \delta, \forall S\]</span></p>
<p>In the following we define <span class="math inline">\(\mu^{i - 1} = (p^{i - 1} - e^\epsilon q^{i - 1})_+\)</span> for the same purpose.</p></li>
</ol>
<p>We use an inductive argument to prove the theorem:</p>
<p><span class="math display">\[\begin{aligned}
\mathbb P(\xi_{\le i} \in T_i) &= \int_{T_{i - 1}} \mathbb P(\xi_i \in S_i | \xi_{< i} = y_{< i}) p^{i - 1} (y_{< i}) dy_{< i} \\
&\le \int_{T_{i - 1}} (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) \wedge 1) p^{i - 1}(y_{< i}) dy_{< i} + \delta_i\\
&\le \int_{T_{i - 1}} (e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) \wedge 1) (e^{\epsilon_1 + ... + \epsilon_{i - 1}} q^{i - 1}(y_{< i}) + \mu^{i - 1} (y_{< i})) dy_{< i} + \delta_i\\
&\le \int_{T_{i - 1}} e^{\epsilon_i} \mathbb P(\eta_i \in S_i | \eta_{< i} = y_{< i}) e^{\epsilon_1 + ... + \epsilon_{i - 1}} q^{i - 1}(y_{< i}) dy_{< i} + \mu_{i - 1}(T_{i - 1}) + \delta_i\\
&\le e^{\epsilon_1 + ... + \epsilon_i} \mathbb P(\eta_{\le i} \in T_i) + \delta_1 + ... + \delta_{i - 1} + \delta_i.\\
\end{aligned}\]</span></p>
<p>In the second line we use Trick 1; in the third line we use the induction assumption; in the fourth line we multiply the first term in the first braket with first term in the second braket, and the second term (i.e. <span class="math inline">\(1\)</span>) in the first braket with the second term in the second braket (i.e. the <span class="math inline">\(\mu\)</span> term); in the last line we use Trick 2.</p>
<p>The base case <span class="math inline">\(i = 1\)</span> is true since <span class="math inline">\(M_1\)</span> is <span class="math inline">\((\epsilon_1, \delta_1)\)</span>-dp. <span class="math inline">\(\square\)</span></p>
<p>To prove the advanced composition theorem, we start with some lemmas.</p>
<p><strong>Claim 11</strong>. If <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are <span class="math inline">\(\epsilon\)</span>-ind, then</p>
<p><span class="math display">\[D(p || q) + D(q || p) \le \epsilon(e^\epsilon - 1).\]</span></p>
<p><strong>Proof</strong>. Since <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are <span class="math inline">\(\epsilon\)</span>-ind, we have <span class="math inline">\(|\log p(x) - \log q(x)| \le \epsilon\)</span> for all <span class="math inline">\(x\)</span>. Let <span class="math inline">\(S := \{x: p(x) > q(x)\}\)</span>. Then we have on</p>
<p><span class="math display">\[\begin{aligned}
D(p || q) + D(q || p) &= \int (p(x) - q(x)) (\log p(x) - \log q(x)) dx\\
&= \int_S (p(x) - q(x)) (\log p(x) - \log q(x)) dx + \int_{S^c} (q(x) - p(x)) (\log q(x) - \log p(x)) dx\\
&\le \epsilon(\int_S p(x) - q(x) dx + \int_{S^c} q(x) - p(x) dx)
\end{aligned}\]</span></p>
<p>Since on <span class="math inline">\(S\)</span> we have <span class="math inline">\(q(x) \le p(x) \le e^\epsilon q(x)\)</span>, and on <span class="math inline">\(S^c\)</span> we have <span class="math inline">\(p(x) \le q(x) \le e^\epsilon p(x)\)</span>, we obtain</p>
<p><span class="math display">\[D(p || q) + D(q || p) \le \epsilon \int_S (1 - e^{-\epsilon}) p(x) dx + \epsilon \int_{S^c} (e^{\epsilon} - 1) p(x) dx \le \epsilon (e^{\epsilon} - 1),\]</span></p>
<p>where in the last step we use <span class="math inline">\(e^\epsilon - 1 \ge 1 - e^{- \epsilon}\)</span> and <span class="math inline">\(p(S) + p(S^c) = 1\)</span>. <span class="math inline">\(\square\)</span></p>
<p><strong>Claim 12</strong>. If <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are <span class="math inline">\(\epsilon\)</span>-ind, then</p>
<p><span class="math display">\[D(p || q) \le a(\epsilon) \ge D(q || p),\]</span></p>
<p>where</p>
<p><span class="math display">\[a(\epsilon) = \epsilon (e^\epsilon - 1) 1_{\epsilon \le \log 2} + \epsilon 1_{\epsilon > \log 2} \le (\log 2)^{-1} \epsilon^2 1_{\epsilon \le \log 2} + \epsilon 1_{\epsilon > \log 2}. \qquad (6.98)\]</span></p>
<p><strong>Proof</strong>. Since <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> are <span class="math inline">\(\epsilon\)</span>-ind, we have</p>
<p><span class="math display">\[D(p || q) = \mathbb E_{\xi \sim p} \log {p(\xi) \over q(\xi)} \le \max_y {\log p(y) \over \log q(y)} \le \epsilon.\]</span></p>
<p>Comparing the quantity in Claim 11 (<span class="math inline">\(\epsilon(e^\epsilon - 1)\)</span>) with the quantity above (<span class="math inline">\(\epsilon\)</span>), we arrive at the conclusion. <span class="math inline">\(\square\)</span></p>
<p><strong>Claim 13 (Hoeffding's Inequality)</strong>. Let <span class="math inline">\(L_i\)</span> be independent random variables with <span class="math inline">\(|L_i| \le b\)</span>, and let <span class="math inline">\(L = L_1 + ... + L_k\)</span>, then for <span class="math inline">\(t > 0\)</span>,</p>
<p><span class="math display">\[\mathbb P(L - \mathbb E L \ge t) \le \exp(- {t^2 \over 2 k b^2}).\]</span></p>
<p><strong>Claim 14 (Advanced Independent Composition Theorem)</strong> (<span class="math inline">\(\delta = 0\)</span>). Fix <span class="math inline">\(0 < \beta < 1\)</span>. Let <span class="math inline">\(M_1, ..., M_k\)</span> be <span class="math inline">\(\epsilon\)</span>-dp, then the independent composition <span class="math inline">\(M\)</span> of <span class="math inline">\(M_{1 : k}\)</span> is <span class="math inline">\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} \epsilon, \beta)\)</span>-dp.</p>
<p><strong>Remark</strong>. By (6.98) we know that <span class="math inline">\(k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} \epsilon = \sqrt{2 k \log \beta^{-1}} \epsilon + k O(\epsilon^2)\)</span> when <span class="math inline">\(\epsilon\)</span> is sufficiently small, in which case the leading term is of order <span class="math inline">\(O(\sqrt k \epsilon)\)</span> and we save a <span class="math inline">\(\sqrt k\)</span> in the <span class="math inline">\(\epsilon\)</span>-part compared to the Basic Composition Theorem (Claim 10).</p>
<p><strong>Remark</strong>. In practice one can try different choices of <span class="math inline">\(\beta\)</span> and settle with the one that gives the best privacy guarantee. See the discussions at the end of <a href="/posts/2019-03-14-great-but-manageable-expectations.html">Part 2 of this post</a>.</p>
<p><strong>Proof</strong>. Let <span class="math inline">\(p_i\)</span>, <span class="math inline">\(q_i\)</span>, <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> be the laws of <span class="math inline">\(M_i(x)\)</span>, <span class="math inline">\(M_i(x')\)</span>, <span class="math inline">\(M(x)\)</span> and <span class="math inline">\(M(x')\)</span> respectively.</p>
<p><span class="math display">\[\mathbb E L_i = D(p_i || q_i) \le a(\epsilon),\]</span></p>
<p>where <span class="math inline">\(L_i := L(p_i || q_i)\)</span>. Due to <span class="math inline">\(\epsilon\)</span>-ind also have</p>
<p><span class="math display">\[|L_i| \le \epsilon.\]</span></p>
<p>Therefore, by Hoeffding's Inequality,</p>
<p><span class="math display">\[\mathbb P(L - k a(\epsilon) \ge t) \le \mathbb P(L - \mathbb E L \ge t) \le \exp(- t^2 / 2 k \epsilon^2),\]</span></p>
<p>where <span class="math inline">\(L := \sum_i L_i = L(p || q)\)</span>.</p>
<p>Plugging in <span class="math inline">\(t = \sqrt{2 k \epsilon^2 \log \beta^{-1}}\)</span>, we have</p>
<p><span class="math display">\[\mathbb P(L(p || q) \le k a(\epsilon) + \sqrt{2 k \epsilon^2 \log \beta^{-1}}) \ge 1 - \beta.\]</span></p>
<p>Similarly we also have</p>
<p><span class="math display">\[\mathbb P(L(q || p) \le k a(\epsilon) + \sqrt{2 k \epsilon^2 \log \beta^{-1}}) \ge 1 - \beta.\]</span></p>
<p>By Claim 1 we arrive at the conclusion. <span class="math inline">\(\square\)</span></p>
<p><strong>Claim 15 (Azuma's Inequality)</strong>. Let <span class="math inline">\(X_{0 : k}\)</span> be a supermartingale. If <span class="math inline">\(|X_i - X_{i - 1}| \le b\)</span>, then</p>
<p><span class="math display">\[\mathbb P(X_k - X_0 \ge t) \le \exp(- {t^2 \over 2 k b^2}).\]</span></p>
<p>Azuma's Inequality implies a slightly weaker version of Hoeffding's Inequality. To see this, let <span class="math inline">\(L_{1 : k}\)</span> be independent variables with <span class="math inline">\(|L_i| \le b\)</span>. Let <span class="math inline">\(X_i = \sum_{j = 1 : i} L_j - \mathbb E L_j\)</span>. Then <span class="math inline">\(X_{0 : k}\)</span> is a martingale, and</p>
<p><span class="math display">\[| X_i - X_{i - 1} | = | L_i - \mathbb E L_i | \le 2 b,\]</span></p>
<p>since <span class="math inline">\(\|L_i\|_1 \le \|L_i\|_\infty\)</span>. Hence by Azuma's Inequality,</p>
<p><span class="math display">\[\mathbb P(L - \mathbb E L \ge t) \le \exp(- {t^2 \over 8 k b^2}).\]</span></p>
<p>Of course here we have made no assumption on <span class="math inline">\(\mathbb E L_i\)</span>. If instead we have some bound for the expectation, say <span class="math inline">\(|\mathbb E L_i| \le a\)</span>, then by the same derivation we have</p>
<p><span class="math display">\[\mathbb P(L - \mathbb E L \ge t) \le \exp(- {t^2 \over 2 k (a + b)^2}).\]</span></p>
<p>It is not hard to see what Azuma is to Hoeffding is like adaptive composition to independent composition. Indeed, we can use Azuma's Inequality to prove the Advanced Adaptive Composition Theorem for <span class="math inline">\(\delta = 0\)</span>.</p>
<p><strong>Claim 16 (Advanced Adaptive Composition Theorem)</strong> (<span class="math inline">\(\delta = 0\)</span>). Let <span class="math inline">\(\beta > 0\)</span>. Let <span class="math inline">\(M_{1 : k}\)</span> be <span class="math inline">\(k\)</span> mechanisms with independent noises such that for each <span class="math inline">\(i\)</span> and <span class="math inline">\(y_{1 : i}\)</span>, <span class="math inline">\(M_i(y_{1 : i})\)</span> is <span class="math inline">\((\epsilon, 0)\)</span>-dp. Then the adpative composition of <span class="math inline">\(M_{1 : k}\)</span> is <span class="math inline">\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon)), \beta)\)</span>-dp.</p>
<p><strong>Proof</strong>. As before, let <span class="math inline">\(\xi_{1 : k} \overset{d}{=} M(x)\)</span> and <span class="math inline">\(\eta_{1 : k} \overset{d}{=} M(x')\)</span>, where <span class="math inline">\(M\)</span> is the adaptive composition of <span class="math inline">\(M_{1 : k}\)</span>. Let <span class="math inline">\(p_i\)</span> (resp. <span class="math inline">\(q_i\)</span>) be the law of <span class="math inline">\(\xi_i | \xi_{< i}\)</span> (resp. <span class="math inline">\(\eta_i | \eta_{< i}\)</span>). Let <span class="math inline">\(p^i\)</span> (resp. <span class="math inline">\(q^i\)</span>) be the law of <span class="math inline">\(\xi_{\le i}\)</span> (resp. <span class="math inline">\(\eta_{\le i}\)</span>). We want to construct supermartingale <span class="math inline">\(X\)</span>. To this end, let</p>
<p><span class="math display">\[X_i = \log {p^i(\xi_{\le i}) \over q^i(\xi_{\le i})} - i a(\epsilon) \]</span></p>
<p>We show that <span class="math inline">\((X_i)\)</span> is a supermartingale:</p>
<p><span class="math display">\[\begin{aligned}
\mathbb E(X_i - X_{i - 1} | X_{i - 1}) &= \mathbb E \left(\log {p_i (\xi_i | \xi_{< i}) \over q_i (\xi_i | \xi_{< i})} - a(\epsilon) | \log {p^{i - 1} (\xi_{< i}) \over q^{i - 1} (\xi_{< i})}\right) \\
&= \mathbb E \left( \mathbb E \left(\log {p_i (\xi_i | \xi_{< i}) \over q_i (\xi_i | \xi_{< i})} | \xi_{< i}\right) | \log {p^{i - 1} (\xi_{< i}) \over q^{i - 1} (\xi_{< i})}\right) - a(\epsilon) \\
&= \mathbb E \left( D(p_i (\cdot | \xi_{< i}) || q_i (\cdot | \xi_{< i})) | \log {p^{i - 1} (\xi_{< i}) \over q^{i - 1} (\xi_{< i})}\right) - a(\epsilon) \\
&\le 0,
\end{aligned}\]</span></p>
<p>since by Claim 12 <span class="math inline">\(D(p_i(\cdot | y_{< i}) || q_i(\cdot | y_{< i})) \le a(\epsilon)\)</span> for all <span class="math inline">\(y_{< i}\)</span>.</p>
<p>Since</p>
<p><span class="math display">\[| X_i - X_{i - 1} | = | \log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})} - a(\epsilon) | \le \epsilon + a(\epsilon),\]</span></p>
<p>by Azuma's Inequality,</p>
<p><span class="math display">\[\mathbb P(\log {p^k(\xi_{1 : k}) \over q^k(\xi_{1 : k})} \ge k a(\epsilon) + t) \le \exp(- {t^2 \over 2 k (\epsilon + a(\epsilon))^2}). \qquad(6.99)\]</span></p>
<p>Let <span class="math inline">\(t = \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon))\)</span> we are done. <span class="math inline">\(\square\)</span></p>
<p><strong>Claim 17 (Advanced Independent Composition Theorem)</strong>. Fix <span class="math inline">\(0 < \beta < 1\)</span>. Let <span class="math inline">\(M_1, ..., M_k\)</span> be <span class="math inline">\((\epsilon, \delta)\)</span>-dp, then the independent composition <span class="math inline">\(M\)</span> of <span class="math inline">\(M_{1 : k}\)</span> is <span class="math inline">\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} \epsilon, k \delta + \beta)\)</span>-dp.</p>
<p><strong>Proof</strong>. By Claim 4, there exist events <span class="math inline">\(E_{1 : k}\)</span> and <span class="math inline">\(F_{1 : k}\)</span> such that</p>
<ol type="1">
<li>The laws <span class="math inline">\(p_{i | E_i}\)</span> and <span class="math inline">\(q_{i | F_i}\)</span> are <span class="math inline">\(\epsilon\)</span>-ind.</li>
<li><span class="math inline">\(\mathbb P(E_i), \mathbb P(F_i) \ge 1 - \delta\)</span>.</li>
</ol>
<p>Let <span class="math inline">\(E := \bigcap E_i\)</span> and <span class="math inline">\(F := \bigcap F_i\)</span>, then they both have probability at least <span class="math inline">\(1 - k \delta\)</span>, and <span class="math inline">\(p_{i | E}\)</span> and <span class="math inline">\(q_{i | F}\)</span> are <span class="math inline">\(\epsilon\)</span>-ind.</p>
<p>By Claim 14, <span class="math inline">\(p_{|E}\)</span> and <span class="math inline">\(q_{|F}\)</span> are <span class="math inline">\((\epsilon' := k a(\epsilon) + \sqrt{2 k \epsilon^2 \log \beta^{-1}}, \beta)\)</span>-ind. Let us shrink the bigger event between <span class="math inline">\(E\)</span> and <span class="math inline">\(F\)</span> so that they have equal probabilities. Then</p>
<p><span class="math display">\[\begin{aligned}
p (S) &\le p_{|E}(S) \mathbb P(E) + \mathbb P(E^c) \\
&\le (e^{\epsilon'} q_{|F}(S) + \beta) \mathbb P(F) + k \delta\\
&\le e^{\epsilon'} q(S) + \beta + k \delta.
\end{aligned}\]</span></p>
<p><span class="math inline">\(\square\)</span></p>
<p><strong>Claim 18 (Advanced Adaptive Composition Theorem)</strong>. Fix <span class="math inline">\(0 < \beta < 1\)</span>. Let <span class="math inline">\(M_{1 : k}\)</span> be <span class="math inline">\(k\)</span> mechanisms with independent noises such that for each <span class="math inline">\(i\)</span> and <span class="math inline">\(y_{1 : i}\)</span>, <span class="math inline">\(M_i(y_{1 : i})\)</span> is <span class="math inline">\((\epsilon, \delta)\)</span>-dp. Then the adpative composition of <span class="math inline">\(M_{1 : k}\)</span> is <span class="math inline">\((k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon)), \beta + k \delta)\)</span>-dp.</p>
<p><strong>Proof</strong>. By Claim 5, there exist events <span class="math inline">\(E_{1 : k}\)</span> and <span class="math inline">\(F_{1 : k}\)</span> such that</p>
<ol type="1">
<li>The laws <span class="math inline">\(p_{i | E_i}(\cdot | y_{< i})\)</span> and <span class="math inline">\(q_{i | F_i}(\cdot | y_{< i})\)</span> are <span class="math inline">\(\epsilon\)</span>-ind for all <span class="math inline">\(y_{< i}\)</span>.</li>
<li><span class="math inline">\(\mathbb P(E_i | y_{< i}), \mathbb P(F_i | y_{< i}) \ge 1 - \delta\)</span> for all <span class="math inline">\(y_{< i}\)</span>.</li>
</ol>
<p>Let <span class="math inline">\(E := \bigcap E_i\)</span> and <span class="math inline">\(F := \bigcap F_i\)</span>, then they both have probability at least <span class="math inline">\(1 - k \delta\)</span>, and <span class="math inline">\(p_{i | E}(\cdot | y_{< i}\)</span> and <span class="math inline">\(q_{i | F}(\cdot | y_{< i})\)</span> are <span class="math inline">\(\epsilon\)</span>-ind.</p>
<p>By Advanced Adaptive Composition Theorem (<span class="math inline">\(\delta = 0\)</span>), <span class="math inline">\(p_{|E}\)</span> and <span class="math inline">\(q_{|F}\)</span> are <span class="math inline">\((\epsilon' := k a(\epsilon) + \sqrt{2 k \log \beta^{-1}} (\epsilon + a(\epsilon)), \beta)\)</span>-ind.</p>
<p>The rest is the same as in the proof of Claim 17. <span class="math inline">\(\square\)</span></p>
<h2 id="subsampling">Subsampling</h2>
<p>Stochastic gradient descent is like gradient descent, but with random subsampling.</p>
<p>Recall we have been considering databases in the space <span class="math inline">\(Z^m\)</span>. Let <span class="math inline">\(n < m\)</span> be a positive integer, <span class="math inline">\(\mathcal I := \{I \subset [m]: |I| = n\}\)</span> be the set of subsets of <span class="math inline">\([m]\)</span> of size <span class="math inline">\(n\)</span>, and <span class="math inline">\(\gamma\)</span> a random subset sampled uniformly from <span class="math inline">\(\mathcal I\)</span>. Let <span class="math inline">\(r = {n \over m}\)</span> which we call the subsampling rate. Then we may add a subsampling module to the noisy gradient descent algorithm (6.97) considered before</p>
<p><span class="math display">\[\theta_{t} = \theta_{t - 1} - \alpha n^{-1} \sum_{i \in \gamma} \nabla_\theta h_\theta(x_i) |_{\theta = \theta_{t - 1}} + \zeta_t. \qquad (7)\]</span></p>
<p>It turns out subsampling has an amplification effect on privacy.</p>
<p><strong>Claim 19 (Ullman 2017)</strong>. Fix <span class="math inline">\(r \in [0, 1]\)</span>. Let <span class="math inline">\(n \le m\)</span> be two nonnegative integers with <span class="math inline">\(n = r m\)</span>. Let <span class="math inline">\(N\)</span> be an <span class="math inline">\((\epsilon, \delta)\)</span>-dp machanism on <span class="math inline">\(X^n\)</span>. Define mechanism <span class="math inline">\(M\)</span> on <span class="math inline">\(X^m\)</span> by</p>
<p><span class="math display">\[M(x) = N(x_\gamma)\]</span></p>
<p>Then <span class="math inline">\(M\)</span> is <span class="math inline">\((\log (1 + r(e^\epsilon - 1)), r \delta)\)</span>-dp.</p>
<p><strong>Remark</strong>. Some seem to cite Kasiviswanathan-Lee-Nissim-Raskhodnikova-Smith 2005 for this result, but it is not clear to me how it appears there.</p>
<p><strong>Proof</strong>. Let <span class="math inline">\(x, x' \in X^n\)</span> such that they differ by one row <span class="math inline">\(x_i \neq x_i'\)</span>. Naturally we would like to consider the cases where the index <span class="math inline">\(i\)</span> is picked and the ones where it is not separately. Let <span class="math inline">\(\mathcal I_\in\)</span> and <span class="math inline">\(\mathcal I_\notin\)</span> be these two cases:</p>
<p><span class="math display">\[\begin{aligned}
\mathcal I_\in = \{J \subset \mathcal I: i \in J\}\\
\mathcal I_\notin = \{J \subset \mathcal I: i \notin J\}\\
\end{aligned}\]</span></p>
<p>We will use these notations later. Let <span class="math inline">\(A\)</span> be the event <span class="math inline">\(\{\gamma \ni i\}\)</span>.</p>
<p>Let <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span> be the laws of <span class="math inline">\(M(x)\)</span> and <span class="math inline">\(M(x')\)</span> respectively. We collect some useful facts about them. First due to <span class="math inline">\(N\)</span> being <span class="math inline">\((\epsilon, \delta)\)</span>-dp,</p>
<p><span class="math display">\[p_{|A}(S) \le e^\epsilon q_{|A}(S) + \delta.\]</span></p>
<p>Also,</p>
<p><span class="math display">\[p_{|A}(S) \le e^\epsilon p_{|A^c}(S) + \delta.\]</span></p>
<p>To see this, note that being conditional laws, <span class="math inline">\(p_A\)</span> and <span class="math inline">\(p_{A^c}\)</span> are averages of laws over <span class="math inline">\(\mathcal I_\in\)</span> and <span class="math inline">\(\mathcal I_\notin\)</span> respectively:</p>
<p><span class="math display">\[\begin{aligned}
p_{|A}(S) = |\mathcal I_\in|^{-1} \sum_{I \in \mathcal I_\in} \mathbb P(N(x_I) \in S)\\
p_{|A^c}(S) = |\mathcal I_\notin|^{-1} \sum_{J \in \mathcal I_\notin} \mathbb P(N(x_J) \in S).
\end{aligned}\]</span></p>
<p>Now we want to pair the <span class="math inline">\(I\)</span>'s in <span class="math inline">\(\mathcal I_\in\)</span> and <span class="math inline">\(J\)</span>'s in <span class="math inline">\(\mathcal I_\notin\)</span> so that they differ by one index only, which means <span class="math inline">\(d(x_I, x_J) = 1\)</span>. Formally, this means we want to consider the set:</p>
<p><span class="math display">\[\mathcal D := \{(I, J) \in \mathcal I_\in \times \mathcal I_\notin: |I \cap J| = n - 1\}.\]</span></p>
<p>We may observe by trying out some simple cases that every <span class="math inline">\(I \in \mathcal I_\in\)</span> is paired with <span class="math inline">\(n\)</span> elements in <span class="math inline">\(\mathcal I_\notin\)</span>, and every <span class="math inline">\(J \in \mathcal I_\notin\)</span> is paired with <span class="math inline">\(m - n\)</span> elements in <span class="math inline">\(\mathcal I_\in\)</span>. Therefore</p>
<p><span class="math display">\[p_{|A}(S) = |\mathcal D|^{-1} \sum_{(I, J) \in \mathcal D} \mathbb P(N(x_I \in S)) \le |\mathcal D|^{-1} \sum_{(I, J) \in \mathcal D} (e^\epsilon \mathbb P(N(x_J \in S)) + \delta) = e^\epsilon p_{|A^c} (S) + \delta.\]</span></p>
<p>Since each of the <span class="math inline">\(m\)</span> indices is picked independently with probability <span class="math inline">\(r\)</span>, we have</p>
<p><span class="math display">\[\mathbb P(A) = r.\]</span></p>
<p>Let <span class="math inline">\(t \in [0, 1]\)</span> to be determined. We may write</p>
<p><span class="math display">\[\begin{aligned}
p(S) &= r p_{|A} (S) + (1 - r) p_{|A^c} (S)\\
&\le r(t e^\epsilon q_{|A}(S) + (1 - t) e^\epsilon q_{|A^c}(S) + \delta) + (1 - r) q_{|A^c} (S)\\
&= rte^\epsilon q_{|A}(S) + (r(1 - t) e^\epsilon + (1 - r)) q_{|A^c} (S) + r \delta\\
&= te^\epsilon r q_{|A}(S) + \left({r \over 1 - r}(1 - t) e^\epsilon + 1\right) (1 - r) q_{|A^c} (S) + r \delta \\
&\le \left(t e^\epsilon \wedge \left({r \over 1 - r} (1 - t) e^\epsilon + 1\right)\right) q(S) + r \delta. \qquad (7.5)
\end{aligned}\]</span></p>
<p>We can see from the last line that the best bound we can get is when</p>
<p><span class="math display">\[t e^\epsilon = {r \over 1 - r} (1 - t) e^\epsilon + 1.\]</span></p>
<p>Solving this equation we obtain</p>
<p><span class="math display">\[t = r + e^{- \epsilon} - r e^{- \epsilon}\]</span></p>
<p>and plugging this in (7.5) we have</p>
<p><span class="math display">\[p(S) \le (1 + r(e^\epsilon - 1)) q(S) + r \delta.\]</span></p>
<p><span class="math inline">\(\square\)</span></p>
<p>Since <span class="math inline">\(\log (1 + x) < x\)</span> for <span class="math inline">\(x > 0\)</span>, we can rewrite the conclusion of the Claim to <span class="math inline">\((r(e^\epsilon - 1), r \delta)\)</span>-dp. Further more, if <span class="math inline">\(\epsilon < \alpha\)</span> for some <span class="math inline">\(\alpha\)</span>, we can rewrite it as <span class="math inline">\((r \alpha^{-1} (e^\alpha - 1) \epsilon, r \delta)\)</span>-dp or <span class="math inline">\((O(r \epsilon), r \delta)\)</span>-dp.</p>
<p>Let <span class="math inline">\(\epsilon < 1\)</span>. We see that if the mechanism <span class="math inline">\(N\)</span> is <span class="math inline">\((\epsilon, \delta)\)</span>-dp on <span class="math inline">\(Z^n\)</span>, then <span class="math inline">\(M\)</span> is <span class="math inline">\((2 r \epsilon, r \delta)\)</span>-dp, and if we run it over <span class="math inline">\(k / r\)</span> minibatches, by Advanced Adaptive Composition theorem, we have <span class="math inline">\((\sqrt{2 k r \log \beta^{-1}} \epsilon + 2 k r \epsilon^2, k \delta + \beta)\)</span>-dp.</p>
<p>This is better than the privacy guarantee without subsampling, where we run over <span class="math inline">\(k\)</span> iterations and obtain <span class="math inline">\((\sqrt{2 k \log \beta^{-1}} \epsilon + 2 k \epsilon^2, k \delta + \beta)\)</span>-dp. So with subsampling we gain an extra <span class="math inline">\(\sqrt r\)</span> in the <span class="math inline">\(\epsilon\)</span>-part of the privacy guarantee. But, smaller subsampling rate means smaller minibatch size, which would result in bigger variance, so there is a trade-off here.</p>
<p>Finally we define the differentially private stochastic gradient descent (DP-SGD) with the Gaussian mechanism (Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016), which is (7) with the noise specialised to Gaussian and an added clipping operation to bound to sensitivity of the query to a chosen <span class="math inline">\(C\)</span>:</p>
<p><span class="math display">\[\theta_{t} = \theta_{t - 1} - \alpha \left(n^{-1} \sum_{i \in \gamma} \nabla_\theta \ell(x_i; \theta) |_{\theta = \theta_{t - 1}}\right)_{\text{Clipped at }C / 2} + N(0, \sigma^2 C^2 I),\]</span></p>
<p>where</p>
<p><span class="math display">\[y_{\text{Clipped at } \alpha} := y / (1 \vee {\|y\|_2 \over \alpha})\]</span></p>
<p>is <span class="math inline">\(y\)</span> clipped to have norm at most <span class="math inline">\(\alpha\)</span>.</p>
<p>Note that the clipping in DP-SGD is much stronger than making the query have sensitivity <span class="math inline">\(C\)</span>. It makes the difference between the query results of two <em>arbitrary</em> inputs bounded by <span class="math inline">\(C\)</span>, rather than <em>neighbouring</em> inputs.</p>
<p>In <a href="/posts/2019-03-14-great-but-manageable-expectations.html">Part 2 of this post</a> we will use the tools developed above to discuss the privacy guarantee for DP-SGD, among other things.</p>
<h2 id="references">References</h2>
<ul>
<li>Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. “Deep Learning with Differential Privacy.” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security - CCS’16, 2016, 308–18. <a href="https://doi.org/10.1145/2976749.2978318" class="uri">https://doi.org/10.1145/2976749.2978318</a>.</li>
<li>Dwork, Cynthia, and Aaron Roth. “The Algorithmic Foundations of Differential Privacy.” Foundations and Trends® in Theoretical Computer Science 9, no. 3–4 (2013): 211–407. <a href="https://doi.org/10.1561/0400000042" class="uri">https://doi.org/10.1561/0400000042</a>.</li>
<li>Dwork, Cynthia, Guy N. Rothblum, and Salil Vadhan. “Boosting and Differential Privacy.” In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, 51–60. Las Vegas, NV, USA: IEEE, 2010. <a href="https://doi.org/10.1109/FOCS.2010.12" class="uri">https://doi.org/10.1109/FOCS.2010.12</a>.</li>
<li>Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. “What Can We Learn Privately?” In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05). Pittsburgh, PA, USA: IEEE, 2005. <a href="https://doi.org/10.1109/SFCS.2005.1" class="uri">https://doi.org/10.1109/SFCS.2005.1</a>.</li>
<li>Murtagh, Jack, and Salil Vadhan. “The Complexity of Computing the Optimal Composition of Differential Privacy.” In Theory of Cryptography, edited by Eyal Kushilevitz and Tal Malkin, 9562:157–75. Berlin, Heidelberg: Springer Berlin Heidelberg, 2016. <a href="https://doi.org/10.1007/978-3-662-49096-9_7" class="uri">https://doi.org/10.1007/978-3-662-49096-9_7</a>.</li>
<li>Ullman, Jonathan. “Solution to CS7880 Homework 1.”, 2017. <a href="http://www.ccs.neu.edu/home/jullman/cs7880s17/HW1sol.pdf" class="uri">http://www.ccs.neu.edu/home/jullman/cs7880s17/HW1sol.pdf</a></li>
<li>Vadhan, Salil. “The Complexity of Differential Privacy.” In Tutorials on the Foundations of Cryptography, edited by Yehuda Lindell, 347–450. Cham: Springer International Publishing, 2017. <a href="https://doi.org/10.1007/978-3-319-57048-8_7" class="uri">https://doi.org/10.1007/978-3-319-57048-8_7</a>.</li>
</ul>
</body>
</html>
|