posts/2019-03-14-great-but-manageable-expectations.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844

#+title: Great but Manageable Expectations

#+date: <2019-03-14>

This is Part 2 of a two-part blog post on differential privacy.
Continuing from [[file:2019-03-13-a-tail-of-two-densities.org][Part 1]], I discuss the Rényi differential privacy, corresponding to the
Rényi divergence, a study of the
[[https://en.wikipedia.org/wiki/Moment-generating_function][moment
generating functions]] of the divergence between probability measures in
order to derive the tail bounds.

Like in Part 1, I prove a composition theorem and a subsampling theorem.

I also attempt to reproduce a seemingly better moment bound for the
Gaussian mechanism with subsampling, with one intermediate step which I
am not able to prove.

After that I explain the Tensorflow implementation of differential
privacy in its
[[https://github.com/tensorflow/privacy/tree/master/privacy][Privacy]]
module, which focuses on the differentially private stochastic gradient
descent algorithm (DP-SGD).

Finally I use the results from both Part 1 and Part 2 to obtain some
privacy guarantees for composed subsampling queries in general, and for
DP-SGD in particular. I also compare these privacy guarantees.

/If you are confused by any notations, ask me or try
[[file:../notations.org][this]]./

** Rényi divergence and differential privacy
   :PROPERTIES:
   :CUSTOM_ID: rényi-divergence-and-differential-privacy
   :ID:       d1763dea-5e8f-4393-8f14-1d781147dcb5
   :END:
Recall in the proof of Gaussian mechanism privacy guarantee (Claim 8) we
used the Chernoff bound for the Gaussian noise. Why not use the Chernoff
bound for the divergence variable / privacy loss directly, since the
latter is closer to the core subject than the noise? This leads us to
the study of Rényi divergence.

So far we have seen several notions of divergence used in differential
privacy: the max divergence which is \(\epsilon\)-ind in disguise:

\[D_\infty(p || q) := \max_y \log {p(y) \over q(y)},\]

the \(\delta\)-approximate max divergence that defines the
\((\epsilon, \delta)\)-ind:

\[D_\infty^\delta(p || q) := \max_y \log{p(y) - \delta \over q(y)},\]

and the KL-divergence which is the expectation of the divergence
variable:

\[D(p || q) = \mathbb E L(p || q) = \int \log {p(y) \over q(y)} p(y) dy.\]

The Rényi divergence is an interpolation between the max divergence and
the KL-divergence, defined as the log moment generating function /
cumulants of the divergence variable:

\[D_\lambda(p || q) = (\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1) L(p || q)) = (\lambda - 1)^{-1} \log \int {p(y)^\lambda \over q(y)^{\lambda - 1}} dy.\]

Indeed, when \(\lambda \to \infty\) we recover the max divergence, and
when \(\lambda \to 1\), by recognising \(D_\lambda\) as a derivative in
\(\lambda\) at \(\lambda = 1\), we recover the KL divergence. In this post
we only consider \(\lambda > 1\).

Using the Rényi divergence we may define:

*Definition (Rényi differential privacy)* (Mironov 2017). An mechanism
\(M\) is \((\lambda, \rho)\)/-Rényi differentially private/
(\((\lambda, \rho)\)-rdp) if for all \(x\) and \(x'\) with distance \(1\),

\[D_\lambda(M(x) || M(x')) \le \rho.\]

For convenience we also define two related notions, \(G_\lambda (f || g)\)
and \(\kappa_{f, g} (t)\) for \(\lambda > 1\), \(t > 0\) and positive
functions \(f\) and \(g\):

\[G_\lambda(f || g) = \int f(y)^{\lambda} g(y)^{1 - \lambda} dy; \qquad \kappa_{f, g} (t) = \log G_{t + 1}(f || g).\]

For probability densities \(p\) and \(q\), \(G_{t + 1}(p || q)\) and
\(\kappa_{p, q}(t)\) are the \(t\)th moment generating function and
[[https://en.wikipedia.org/wiki/Cumulant][cumulant]] of the divergence
variable \(L(p || q)\), and

\[D_\lambda(p || q) = (\lambda - 1)^{-1} \kappa_{p, q}(\lambda - 1).\]

In the following, whenever you see \(t\), think of it as \(\lambda - 1\).

*Example 1 (RDP for the Gaussian mechanism)*. Using the scaling and
translation invariance of \(L\) (6.1), we have that the divergence
variable for two Gaussians with the same variance is

\[L(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) \overset{d}{=} L(N(0, I) || N((\mu_2 - \mu_1) / \sigma, I)).\]

With this we get

\[D_\lambda(N(\mu_1, \sigma^2 I) || N(\mu_2, \sigma^2 I)) = {\lambda \|\mu_2 - \mu_1\|_2^2 \over 2 \sigma^2} = D_\lambda(N(\mu_2, \sigma^2 I) || N(\mu_1, \sigma^2 I)).\]

Again due to the scaling invariance of \(L\), we only need to consider \(f\)
with sensitivity \(1\), see the discussion under (6.1). The Gaussian
mechanism on query \(f\) is thus \((\lambda, \lambda / 2 \sigma^2)\)-rdp for
any \(\lambda > 1\).

From the example of Gaussian mechanism, we see that the relation between
\(\lambda\) and \(\rho\) is like that between \(\epsilon\) and \(\delta\). Given
\(\lambda\) (resp. \(\rho\)) and parameters like variance of the noise and
the sensitivity of the query, we can write \(\rho = \rho(\lambda)\) (resp.
\(\lambda = \lambda(\rho)\)).

Using the Chernoff bound (6.7), we can bound the divergence variable:

\[\mathbb P(L(p || q) \ge \epsilon) \le {\mathbb E \exp(t L(p || q)) \over \exp(t \epsilon))} =  \exp (\kappa_{p, q}(t) - \epsilon t). \qquad (7.7)\]

For a function \(f: I \to \mathbb R\), denote its
[[https://en.wikipedia.org/wiki/Legendre_transformation][Legendre
transform]] by

\[f^*(\epsilon) := \sup_{t \in I} (\epsilon t - f(t)).\]

By taking infimum on the RHS of (7.7), we obtain

*Claim 20*. Two probability densities \(p\) and \(q\) are
\((\epsilon, \exp(-\kappa_{p, q}^*(\epsilon)))\)-ind.

Given a mechanism \(M\), let \(\kappa_M(t)\) denote an upper bound for the
cumulant of its privacy loss:

\[\log \mathbb E \exp(t L(M(x) || M(x'))) \le \kappa_M(t), \qquad \forall x, x'\text{ with } d(x, x') = 1.\]

For example, we can set \(\kappa_M(t) = t \rho(t + 1)\). Using the same
argument we have the following:

*Claim 21*. If \(M\) is \((\lambda, \rho)\)-rdp, then

1. it is also \((\epsilon, \exp((\lambda - 1) (\rho - \epsilon)))\)-dp for
   any \(\epsilon \ge \rho\).
2. Alternatively, \(M\) is \((\epsilon, - \exp(\kappa_M^*(\epsilon)))\)-dp
   for any \(\epsilon > 0\).
3. Alternatively, for any \(0 < \delta \le 1\), \(M\) is
   \((\rho + (\lambda - 1)^{-1} \log \delta^{-1}, \delta)\)-dp.

*Example 2 (Gaussian mechanism)*. We can apply the above argument to the
Gaussian mechanism on query \(f\) and get:

\[\delta \le \inf_{\lambda > 1} \exp((\lambda - 1) ({\lambda \over 2 \sigma^2} - \epsilon))\]

By assuming \(\sigma^2 > (2 \epsilon)^{-1}\) we have that the infimum is
achieved when \(\lambda = (1 + 2 \epsilon / \sigma^2) / 2\) and

\[\delta \le \exp(- ((2 \sigma)^{-1} - \epsilon \sigma)^2 / 2)\]

which is the same result as (6.8), obtained using the Chernoff bound of
the noise.

However, as we will see later, compositions will yield different results
from those obtained from methods in
[[/posts/2019-03-13-a-tail-of-two-densities.html][Part 1]] when
considering Rényi dp.

*** Moment Composition
    :PROPERTIES:
    :CUSTOM_ID: moment-composition
    :ID:       d5e94e5a-236d-4c41-96a4-4a93341f249a
    :END:
*Claim 22 (Moment Composition Theorem)*. Let \(M\) be the adaptive
composition of \(M_{1 : k}\). Suppose for any \(y_{< i}\), \(M_i(y_{< i})\) is
\((\lambda, \rho)\)-rdp. Then \(M\) is \((\lambda, k\rho)\)-rdp.

*Proof*. Rather straightforward. As before let \(p_i\) and \(q_i\) be the
conditional laws of adpative composition of \(M_{1 : i}\) at \(x\) and \(x'\)
respectively, and \(p^i\) and \(q^i\) be the joint laws of \(M_{1 : i}\) at
\(x\) and \(x'\) respectively. Denote

\[D_i = \mathbb E \exp((\lambda - 1)\log {p^i(\xi_{1 : i}) \over q^i(\xi_{1 : i})})\]

Then

$$\begin{aligned}
D_i &= \mathbb E\mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) \exp((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}) \big| \xi_{< i}\right) \\
&= \mathbb E \mathbb E \left(\exp((\lambda - 1)\log {p_i(\xi_i | \xi_{< i}) \over q_i(\xi_i | \xi_{< i})}) | \xi_{< i}\right) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\
&\le \mathbb E \exp((\lambda - 1) \rho) \exp\left((\lambda - 1)\log {p^{i - 1}(\xi_{< i}) \over q^{i - 1}(\xi_{< i})}\right)\\
&= \exp((\lambda - 1) \rho) D_{i - 1}.
\end{aligned}$$

Applying this recursively we have

\[D_k \le \exp(k(\lambda - 1) \rho),\]

and so

\[(\lambda - 1)^{-1} \log \mathbb E \exp((\lambda - 1)\log {p^k(\xi_{1 : i}) \over q^k(\xi_{1 : i})}) = (\lambda - 1)^{-1} \log D_k \le k \rho.\]

Since this holds for all \(x\) and \(x'\), we are done. \(\square\)

This, together with the scaling property of the legendre transformation:

\[(k f)^*(x) = k f^*(x / k)\]

yields

*Claim 23*. The \(k\)-fold adaptive composition of
\((\lambda, \rho(\lambda))\)-rdp mechanisms is
\((\epsilon, \exp(- k \kappa^*(\epsilon / k)))\)-dp, where
\(\kappa(t) := t \rho(t + 1)\).

*Example 3 (Gaussian mechanism)*. We can apply the above claim to
Gaussian mechanism. Again, without loss of generality we assume
\(S_f = 1\). But let us do it manually to get the same results. If we
apply the Moment Composition Theorem to the an adaptive composition of
Gaussian mechanisms on the same query, then since each \(M_i\) is
\((\lambda, (2 \sigma^2)^{-1} \lambda)\)-rdp, the composition \(M\) is
\((\lambda, (2 \sigma^2)^{-1} k \lambda)\)-rdp. Processing this using the
Chernoff bound as in the previous example, we have

\[\delta = \exp(- ((2 \sigma / \sqrt k)^{-1} - \epsilon \sigma / \sqrt k)^2 / 2),\]

Substituting \(\sigma\) with \(\sigma / \sqrt k\) in (6.81), we conclude
that if

\[\sigma > \sqrt k \left(\epsilon^{-1} \sqrt{2 \log \delta^{-1}} + (2 \epsilon)^{- {1 \over 2}}\right)\]

then the composition \(M\) is \((\epsilon, \delta)\)-dp.

As we will see in the discussions at the end of this post, this result
is different from (and probably better than) the one obtained by using
the Advanced Composition Theorem (Claim 18).

*** Subsampling
    :PROPERTIES:
    :CUSTOM_ID: subsampling
    :ID:       25cd27ac-fcb6-462f-9a3b-da861124d7b2
    :END:
We also have a subsampling theorem for the Rényi dp.

*Claim 24*. Fix \(r \in [0, 1]\). Let \(m \le n\) be two nonnegative
integers with \(m = r n\). Let \(N\) be a \((\lambda, \rho)\)-rdp machanism on
\(X^m\). Let \(\mathcal I := \{J \subset [n]: |J| = m\}\) be the set of
subsets of \([n]\) of size \(m\). Define mechanism \(M\) on \(X^n\) by

\[M(x) = N(x_\gamma)\]

where \(\gamma\) is sampled uniformly from \(\mathcal I\). Then \(M\) is
\((\lambda, {1 \over \lambda - 1} \log (1 + r(e^{(\lambda - 1) \rho} - 1)))\)-rdp.

To prove Claim 24, we need a useful lemma:

*Claim 25*. Let \(p_{1 : n}\) and \(q_{1 : n}\) be nonnegative integers, and
\(\lambda > 1\). Then

\[{(\sum p_i)^\lambda \over (\sum q_i)^{\lambda - 1}} \le \sum_i {p_i^\lambda \over q_i^{\lambda - 1}}. \qquad (8)\]

*Proof*. Let

\[r(i) := p_i / P, \qquad u(i) := q_i / Q\]

where

\[P := \sum p_i, \qquad Q := \sum q_i\]

then \(r\) and \(u\) are probability mass functions. Plugging in
\(p_i = r(i) P\) and \(q_i = u(i) Q\) into the objective (8), it suffices to
show

\[1 \le \sum_i {r(i)^\lambda \over u(i)^{\lambda - 1}} = \mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda\]

This is true due to Jensen's Inequality:

\[\mathbb E_{\xi \sim u} \left({r(\xi) \over u(\xi)}\right)^\lambda \ge \left(\mathbb E_{\xi \sim u} {r(\xi) \over u(\xi)} \right)^\lambda = 1.\]

\(\square\)

*Proof of Claim 24*. Define \(\mathcal I\) as before.

Let \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively. For any
\(I \in \mathcal I\), let \(p_I\) and \(q_I\) be the laws of \(N(x_I)\) and
\(N(x_I')\) respectively. Then we have

$$\begin{aligned}
p(y) &= n^{-1} \sum_{I \in \mathcal I} p_I(y) \\
q(y) &= n^{-1} \sum_{I \in \mathcal I} q_I(y),
\end{aligned}$$

where \(n = |\mathcal I|\).

The MGF of \(L(p || q)\) is thus

\[\mathbb E((\lambda - 1) L(p || q)) = n^{-1} \int {(\sum_I p_I(y))^\lambda \over (\sum_I q_I(y))^{\lambda - 1}} dy \le n^{-1} \sum_I \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy \qquad (9)\]

where in the last step we used Claim 25. As in the proof of Claim 19, we
divide \(\mathcal I\) into disjoint sets \(\mathcal I_\in\) and
\(\mathcal I_\notin\). Furthermore we denote by \(n_\in\) and \(n_\notin\)
their cardinalities. Then the right hand side of (9) becomes

\[n^{-1} \sum_{I \in \mathcal I_\in} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy + n^{-1} \sum_{I \in \mathcal I_\notin} \int {p_I(y)^\lambda \over q_I(y)^{\lambda - 1}} dy\]

The summands in the first are the MGF of \(L(p_I || q_I)\), and the
summands in the second term are \(1\), so

$$\begin{aligned}
\mathbb E((\lambda - 1) L(p || q)) &\le n^{-1} \sum_{I \in \mathcal I_\in} \mathbb E \exp((\lambda - 1) L(p_I || q_I)) + (1 - r) \\
&\le n^{-1} \sum_{I \in \mathcal I_\in} \exp((\lambda - 1) D_\lambda(p_I || q_I)) + (1 - r) \\
&\le r \exp((\lambda - 1) \rho) + (1 - r).
\end{aligned}$$

Taking log and dividing by \((\lambda - 1)\) on both sides we have

\[D_\lambda(p || q) \le (\lambda - 1)^{-1} \log (1 + r(\exp((\lambda - 1) \rho) - 1)).\]

\(\square\)

As before, we can rewrite the conclusion of Lemma 6 using
\(1 + z \le e^z\) and obtain
\((\lambda, (\lambda - 1)^{-1} r (e^{(\lambda - 1) \rho} - 1))\)-rdp,
which further gives \((\lambda, \alpha^{-1} (e^\alpha - 1) r \rho)\)-rdp
(or \((\lambda, O(r \rho))\)-rdp) if \((\lambda - 1) \rho < \alpha\) for
some \(\alpha\).

It is not hard to see that the subsampling theorem in moment method,
even though similar to the results of that in the usual method, does not
help due to lack of an analogue of advanced composition theorem of the
moments.

*Example 4 (Gaussian mechanism)*. Applying the moment subsampling
theorem to the Gaussian mechanism, we obtain
\((\lambda, O(r \lambda / \sigma^2))\)-rdp for a subsampled Gaussian
mechanism with rate \(r\).
Abadi-Chu-Goodfellow-McMahan-Mironov-Talwar-Zhang 2016 (ACGMMTZ16 in the
following), however, gains an extra \(r\) in the bound given certain
assumptions.

** ACGMMTZ16
   :PROPERTIES:
   :CUSTOM_ID: acgmmtz16
   :ID:       8b85cce3-01ad-4404-80c0-b73076d183a9
   :END:
What follows is my understanding of this result. I call it a conjecture
because there is a gap which I am not able to reproduce their proof or
prove it myself. This does not mean the result is false. On the
contrary, I am inclined to believe it is true.

*Claim 26*. Assuming Conjecture 1 (see below) is true. For a subsampled
Gaussian mechanism with ratio \(r\), if \(r = O(\sigma^{-1})\) and
\(\lambda = O(\sigma^2)\), then we have
\((\lambda, O(r^2 \lambda / \sigma^2))\)-rdp.

Wait, why is there a conjecture? Well, I have tried but not been able to
prove the following, which is a hidden assumption in the original proof:

*Conjecture 1*. Let \(M\) be the Gaussian mechanism with subsampling rate
\(r\), and \(p\) and \(q\) be the laws of \(M(x)\) and \(M(x')\) respectively,
where \(d(x, x') = 1\). Then

\[D_\lambda (p || q) \le D_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0)\]

where \(\mu_i = N(i, \sigma^2)\).

*Remark*. Conjecture 1 is heuristically reasonable. To see this, let us
use the notations \(p_I\) and \(q_I\) to be \(q\) and \(p\) conditioned on the
subsampling index \(I\), just like in the proof of the subsampling
theorems (Claim 19 and 24). Then for \(I \in \mathcal I_\in\),

\[D_\lambda(p_I || q_I) \le D_\lambda(\mu_0 || \mu_1),\]

and for \(I \in \mathcal I_\notin\),

\[D_\lambda(p_I || q_I) = 0 = D_\lambda(\mu_0 || \mu_0).\]

Since we are taking an average over \(\mathcal I\), of which
\(r |\mathcal I|\) are in \(\mathcal I_\in\) and \((1 - r) |\mathcal I|\) are
in \(\mathcal I_\notin\), (9.3) says "the inequalities carry over
averaging".

[[https://math.stackexchange.com/a/3152296/149540][A more general
version of Conjecture 1 has been proven false]]. The counter example for
the general version does not apply here, so it is still possible
Conjecture 1 is true.

Let \(p_\in\) (resp. \(q_\in\)) be the average of \(p_I\) (resp. \(q_I\)) over
\(I \in \mathcal I_\in\), and \(p_\notin\) (resp. \(q_\notin\)) be the average
of \(p_I\) (resp. \(q_I\)) over \(I \in \mathcal I_\notin\).

Immediately we have \(p_\notin = q_\notin\), hence

\[D_\lambda(p_\notin || q_\notin) = 0 = D_\lambda(\mu_0 || \mu_0). \qquad(9.7)\]

By Claim 25, we have

\[D_\lambda(p_\in || q_\in) \le D_\lambda (\mu_1 || \mu_0). \qquad(9.9) \]

So one way to prove Conjecture 1 is perhaps prove a more specialised
comparison theorem than the false conjecture:

Given (9.7) and (9.9), show that

\[D_\lambda(r p_\in + (1 - r) p_\notin || r q_\in + (1 - r) q_\notin) \le D_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0).\]

[End of Remark]

#+begin_html
  <!---
  **Conjecture 1** \[Probably [FALSE](https://math.stackexchange.com/a/3152296/149540), to be removed\]. Let \(p_i\), \(q_i\), \(\mu_i\), \(\nu_i\) be
  probability densities on the same space for \(i = 1 : n\). If
  \(D_\lambda(p_i || q_i) \le D_\lambda(\mu_i || \nu_i)\) for all \(i\), then

  \[D_\lambda(n^{-1} \sum_i p_i || n^{-1} \sum_i q_i) \le D_\lambda(n^{-1} \sum_i \mu_i || n^{-1} \sum_i \nu_i).\]

  Basically, it is saying \"if for each \(i\), \(p_i\) and \(q_i\) are closer to
  each other than \(\mu_i\) and \(\nu_i\), then so are their averages over
  \(i\)\".
  So it is heuristically reasonable.
  But it is probably [**FALSE**](https://math.stackexchange.com/a/3152296/149540).
  This does not mean Claim 26 is false, as it might still be possible that Conjecture 2 (see below) is true.

  This conjecture is equivalent to its special case when \(n = 2\) by an induction argument
  (replacing one pair of densities at a time).
  -->
#+end_html

Recall the definition of \(G_\lambda\) under the definition of Rényi
differential privacy. The following Claim will be useful.

*Claim 27*. Let \(\lambda\) be a positive integer, then

\[G_\lambda(r p + (1 - r) q || q) = \sum_{k = 1 : \lambda} {\lambda \choose k} r^k (1 - r)^{\lambda - k} G_k(p || q).\]

*Proof*. Quite straightforward, by expanding the numerator
\((r p + (1 - r) q)^\lambda\) using binomial expansion. \(\square\)

*Proof of Claim 26*. By Conjecture 1, it suffices to prove the
following:

If \(r \le c_1 \sigma^{-1}\) and \(\lambda \le c_2 \sigma^2\) for some
positive constant \(c_1\) and \(c_2\), then there exists \(C = C(c_1, c_2)\)
such that \(G_\lambda (r \mu_1 + (1 - r) \mu_0 || \mu_0) \le C\) (since
\(O(r^2 \lambda^2 / \sigma^2) = O(1)\)).

*Remark in the proof*. Note that the choice of \(c_1\), \(c_2\) and the
function \(C(c_1, c_2)\) are important to the practicality and usefulness
of Claim 26.

#+begin_html
  <!---
  Part 1 can be derived using Conjecture 1, but since Conjecture 1 is probably false,
  let us rename Part 1 itself _Conjecture 2_, which needs to be verified by other means.
  We use the notations \(p_I\) and \(q_I\) to be \(q\) and \(p\) conditioned on
  the subsampling index \(I\), just like in the proof of the subsampling theorems (Claim 19 and 24).
  Then

  $$D_\lambda(q_I || p_I) = D_\lambda(p_I || q_I) 
  \begin{cases}
  \le D_\lambda(\mu_0 || \mu_1) = D_\lambda(\mu_1 || \mu_0), & I \in \mathcal I_\in\\
  = D_\lambda(\mu_0 || \mu_0) = D_\lambda(\mu_1 || \mu_1) = 0 & I \in \mathcal I_\notin
  \end{cases}$$

  Since \(p = |\mathcal I|^{-1} \sum_{I \in \mathcal I} p_I\) and
  \(q = |\mathcal I|^{-1} \sum_{I \in \mathcal I} q_I\) and
  \(|\mathcal I_\in| = r |\mathcal I|\), by Conjecture 1, we have Part 1.

  **Remark in the proof**. As we can see here, instead of trying to prove Conjecture 1,
  it suffices to prove a weaker version of it, by specialising on mixture of Gaussians,
  in order to have a Claim 26 without any conjectural assumptions.
  I have in fact posted the Conjecture on [Stackexchange](https://math.stackexchange.com/questions/3147963/an-inequality-related-to-the-renyi-divergence).

  Now let us verify Part 2.
  -->
#+end_html

Using Claim 27 and Example 1, we have

$$\begin{aligned}
G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0)) &= \sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} G_j(\mu_1 || \mu_0)\\
&=\sum_{j = 0 : \lambda} {\lambda \choose j} r^j (1 - r)^{\lambda - j} \exp(j (j - 1) / 2 \sigma^2). \qquad (9.5)
\end{aligned}$$

Denote by \(n = \lceil \sigma^2 \rceil\). It suffices to show

\[\sum_{j = 0 : n} {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j (j - 1) / 2 n) \le C\]

Note that we can discard the linear term \(- c_2 j / \sigma^2\) in the
exponential term since we want to bound the sum from above.

We examine the asymptotics of this sum when \(n\) is large, and treat the
sum as an approximation to an integration of a function
\(\phi: [0, 1] \to \mathbb R\). For \(j = x n\), where \(x \in (0, 1)\),
\(\phi\) is thus defined as (note we multiply the summand with \(n\) to
compensate the uniform measure on \(1, ..., n\):

$$\begin{aligned}
\phi_n(x) &:= n {n \choose j} (c_1 n^{- 1 / 2})^j (1 - c_1 n^{- 1 / 2})^{n - j} \exp(c_2 j^2 / 2 n) \\
&= n {n \choose x n} (c_1 n^{- 1 / 2})^{x n} (1 - c_1 n^{- 1 / 2})^{(1 - x) n} \exp(c_2 x^2 n / 2)
\end{aligned}$$

Using Stirling's approximation

\[n! \approx \sqrt{2 \pi n} n^n e^{- n},\]

we can approach the binomial coefficient:

\[{n \choose x n} \approx (\sqrt{2 \pi x (1 - x)} x^{x n} (1 - x)^{(1 - x) n})^{-1}.\]

We also approximate

\[(1 - c_1 n^{- 1 / 2})^{(1 - x) n} \approx \exp(- c_1 \sqrt{n} (1 - x)).\]

With these we have

\[\phi_n(x) \approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left(- {1 \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x) + {1 \over 2} c_2 x^2) n + {1 \over 2} \log n\right).\]

This vanishes as \(n \to \infty\), and since \(\phi_n(x)\) is bounded above
by the integrable function \({1 \over \sqrt{2 \pi x (1 - x)}}\) (c.f. the
arcsine law), and below by \(0\), we may invoke the dominant convergence
theorem and exchange the integral with the limit and get

$$\begin{aligned}
\lim_{n \to \infty} &G_n (r \mu_1 + (1 - r) \mu_0 || \mu_0)) \\
&\le \lim_{n \to \infty} \int \phi_n(x) dx = \int \lim_{n \to \infty} \phi_n(x) dx = 0.
\end{aligned}$$

Thus we have that the generating function of the divergence variable
\(L(r \mu_1 + (1 - r) \mu_0 || \mu_0)\) is bounded.

Can this be true for better orders

\[r \le c_1 \sigma^{- d_r},\qquad \lambda \le c_2 \sigma^{d_\lambda}\]

for some \(d_r \in (0, 1]\) and \(d_\lambda \in [2, \infty)\)? If we follow
the same approximation using these exponents, then letting
\(n = c_2 \sigma^{d_\lambda}\),

$$\begin{aligned}
{n \choose j} &r^j (1 - r)^{n - j} G_j(\mu_0 || \mu_1) \le \phi_n(x) \\
&\approx {1 \over \sqrt{2 \pi x (1 - x)}} \exp\left({1 \over 2} c_2^{2 \over d_\lambda} x^2 n^{2 - {2 \over d_\lambda}} - {d_r \over 2} x n \log n + (x \log c_1 - x \log x - (1 - x) \log (1 - x)) n + {1 \over 2} \log n\right).
\end{aligned}$$

So we see that to keep the divergence moments bounded it is possible to
have any \(r = O(\sigma^{- d_r})\) for \(d_r \in (0, 1)\), but relaxing
\(\lambda\) may not be safe.

If we relax \(r\), then we get

\[G_\lambda(r \mu_1 + (1 - r) \mu_0 || \mu_0) = O(r^{2 / d_r} \lambda^2 \sigma^{-2}) = O(1).\]

Note that now the constant \(C\) depends on \(d_r\) as well. Numerical
experiments seem to suggest that \(C\) can increase quite rapidly as \(d_r\)
decreases from \(1\). \(\square\)

In the following for consistency we retain \(k\) as the number of epochs,
and use \(T := k / r\) to denote the number of compositions / steps /
minibatches. With Claim 26 we have:

*Claim 28*. Assuming Conjecture 1 is true. Let \(\epsilon, c_1, c_2 > 0\),
\(r \le c_1 \sigma^{-1}\),
\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\). then the DP-SGD with
subsampling rate \(r\), and \(T\) steps is \((\epsilon, \delta)\)-dp for

\[\delta = \exp(- {1 \over 2} c_2 \sigma^2 \epsilon).\]

In other words, for

\[\sigma \ge \sqrt{2 c_2^{-1}} \epsilon^{- {1 \over 2}} \sqrt{\log \delta^{-1}},\]

we can achieve \((\epsilon, \delta)\)-dp.

*Proof*. By Claim 26 and the Moment Composition Theorem (Claim 22), for
\(\lambda = c_2 \sigma^2\), substituting
\(T = {c_2 \over 2 C(c_1, c_2)} \epsilon \sigma^2\), we have

\[\mathbb P(L(p || q) \ge \epsilon) \le \exp(k C(c_1, c_2) - \lambda \epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right).\]

\(\square\)

*Remark*. Claim 28 is my understanding / version of Theorem 1 in
[ACGMMTZ16], by using the same proof technique. Here I quote the
original version of theorem with notions and notations altered for
consistency with this post:

#+begin_quote
  There exists constants \(c_1', c_2' > 0\) so that for any
  \(\epsilon < c_1' r^2 T\), DP-SGD is \((\epsilon, \delta)\)-differentially
  private for any \(\delta > 0\) if we choose
#+end_quote

\[\sigma \ge c_2' {r \sqrt{T \log (1 / \delta)} \over \epsilon}. \qquad (10)\]

I am however unable to reproduce this version, assuming Conjecture 1 is
true, for the following reasons:

1. In the proof in the paper, we have \(\epsilon = c_1' r^2 T\) instead of
   "less than" in the statement of the Theorem. If we change it to
   \(\epsilon < c_1' r^2 T\) then the direction of the inequality becomes
   opposite to the direction we want to prove:
   \[\exp(k C(c_1, c_2) - \lambda \epsilon) \ge ...\]

2. The condition \(r = O(\sigma^{-1})\) of Claim 26 whose result is used
   in the proof of this theorem is not mentioned in the statement of the
   proof. The implication is that (10) becomes an ill-formed condition
   as the right hand side also depends on \(\sigma\).

** Tensorflow implementation
   :PROPERTIES:
   :CUSTOM_ID: tensorflow-implementation
   :ID:       f856ad67-4f78-46b4-8c98-fda07a0dc670
   :END:
The DP-SGD is implemented in
[[https://github.com/tensorflow/privacy][TensorFlow Privacy]]. In the
following I discuss the package in the current state (2019-03-11). It is
divided into two parts:
[[https://github.com/tensorflow/privacy/tree/master/privacy/optimizers][=optimizers=]]
which implements the actual differentially private algorithms, and
[[https://github.com/tensorflow/privacy/tree/master/privacy/analysis][=analysis=]]
which computes the privacy guarantee.

The =analysis= part implements a privacy ledger that "keeps a record of
all queries executed over a given dataset for the purpose of computing
privacy guarantees". On the other hand, all the computation is done in
[[https://github.com/tensorflow/privacy/blob/7e2d796bdee9b60dce21a82a397eefda35b0ac10/privacy/analysis/rdp_accountant.py][=rdp_accountant.py=]].
At this moment, =rdp_accountant.py= only implements the computation of
the privacy guarantees for DP-SGD with Gaussian mechanism. In the
following I will briefly explain the code in this file.

Some notational correspondences: their =alpha= is our \(\lambda\), their
=q= is our \(r\), their =A_alpha= (in the comments) is our
\(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2)} (\lambda - 1)\), at
least when \(\lambda\) is an integer.

- The function =_compute_log_a= presumably computes the cumulants
  \(\kappa_{r N(1, \sigma^2) + (1 - r) N(0, \sigma^2), N(0, \sigma^2)}(\lambda - 1)\).
  It calls =_compute_log_a_int= or =_compute_log_a_frac= depending on
  whether \(\lambda\) is an integer.
- The function =_compute_log_a_int= computes the cumulant using (9.5).
- When \(\lambda\) is not an integer, we can't use (9.5). I have yet to
  decode how =_compute_log_a_frac= computes the cumulant (or an upper
  bound of it) in this case
- The function =_compute_delta= computes \(\delta\)s for a list of
  \(\lambda\)s and \(\kappa\)s using Item 1 of Claim 25 and return the
  smallest one, and the function =_compute_epsilon= computes epsilon
  uses Item 3 in Claim 25 in the same way.

In =optimizers=, among other things, the DP-SGD with Gaussian mechanism
is implemented in =dp_optimizer.py= and =gaussian_query.py=. See the
definition of =DPGradientDescentGaussianOptimizer= in =dp_optimizer.py=
and trace the calls therein.

At this moment, the privacy guarantee computation part and the optimizer
part are separated, with =rdp_accountant.py= called in
=compute_dp_sgd_privacy.py= with user-supplied parameters. I think this
is due to the lack of implementation in =rdp_accountant.py= of any
non-DPSGD-with-Gaussian privacy guarantee computation. There is already
[[https://github.com/tensorflow/privacy/issues/23][an issue on this]],
so hopefully it won't be long before the privacy guarantees can be
automatically computed given a DP-SGD instance.

** Comparison among different methods
   :PROPERTIES:
   :CUSTOM_ID: comparison-among-different-methods
   :ID:       30502f53-d9ba-48ea-868a-dd4db995a6d4
   :END:
So far we have seen three routes to compute the privacy guarantees for
DP-SGD with the Gaussian mechanism:

1. Claim 9 (single Gaussian mechanism privacy guarantee) -> Claim 19
   (Subsampling theorem) -> Claim 18 (Advanced Adaptive Composition
   Theorem)
2. Example 1 (RDP for the Gaussian mechanism) -> Claim 22 (Moment
   Composition Theorem) -> Example 3 (Moment composition applied to the
   Gaussian mechanism)
3. Claim 26 (RDP for Gaussian mechanism with specific magnitudes for
   subsampling rate) -> Claim 28 (Moment Composition Theorem and
   translation to conventional DP)

Which one is the best?

To make fair comparison, we may use one parameter as the metric and set
all others to be the same. For example, we can

1. Given the same \(\epsilon\), \(r\) (in Route 1 and 3), \(k\), \(\sigma\),
   compare the \(\delta\)s
2. Given the same \(\epsilon\), \(r\) (in Route 1 and 3), \(k\), \(\delta\),
   compare the \(\sigma\)s
3. Given the same \(\delta\), \(r\) (in Route 1 and 3), \(k\), \(\sigma\),
   compare the \(\epsilon\)s.

I find that the first one, where \(\delta\) is used as a metric, the best.
This is because we have the tightest bounds and the cleanest formula
when comparing the \(\delta\). For example, the Azuma and Chernoff bounds
are both expressed as a bound for \(\delta\). On the other hand, the
inversion of these bounds either requires a cost in the tightness (Claim
9, bounds on \(\sigma\)) or in the complexity of the formula (Claim 16
Advanced Adaptive Composition Theorem, bounds on \(\epsilon\)).

So if we use \(\sigma\) or \(\epsilon\) as a metric, either we get a less
fair comparison, or have to use a much more complicated formula as the
bounds.

Let us first compare Route 1 and Route 2 without specialising to the
Gaussian mechanism.

*Warning*. What follows is a bit messy.

Suppose each mechanism \(N_i\) satisfies
\((\epsilon', \delta(\epsilon'))\)-dp. Let
\(\tilde \epsilon := \log (1 + r (e^{\epsilon'} - 1))\), then we have the
subsampled mechanism \(M_i(x) = N_i(x_\gamma)\) is
\((\tilde \epsilon, r \tilde \delta(\tilde \epsilon))\)-dp, where

\[\tilde \delta(\tilde \epsilon) = \delta(\log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1))\]

Using the Azuma bound in the proof of Advanced Adaptive Composition
Theorem (6.99):

\[\mathbb P(L(p^k || q^k) \ge \epsilon) \le \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}).\]

So we have the final bound for Route 1:

\[\delta_1(\epsilon) = \min_{\tilde \epsilon: \epsilon > r^{-1} k a(\tilde \epsilon)} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \tilde \delta(\tilde \epsilon).\]

As for Route 2, since we do not gain anything from subsampling in RDP,
we do not subsample at all.

By Claim 23, we have the bound for Route 2:

\[\delta_2(\epsilon) = \exp(- k \kappa^* (\epsilon / k)).\]

On one hand, one can compare \(\delta_1\) and \(\delta_2\) with numerical
experiments. On the other hand, if we further specify
\(\delta(\epsilon')\) in Route 1 as the Chernoff bound for the cumulants
of divergence variable, i.e.

\[\delta(\epsilon') = \exp(- \kappa^* (\epsilon')),\]

we have

\[\delta_1 (\epsilon) = \min_{\tilde \epsilon: a(\tilde \epsilon) < r k^{-1} \epsilon} \exp(- {(\epsilon - r^{-1} k a(\tilde\epsilon))^2 \over 2 r^{-1} k (\tilde\epsilon + a(\tilde\epsilon))^2}) + k \exp(- \kappa^* (b(\tilde\epsilon))),\]

where

\[b(\tilde \epsilon) := \log (r^{-1} (\exp(\tilde \epsilon) - 1) + 1) \le r^{-1} \tilde\epsilon.\]

We note that since
\(a(\tilde \epsilon) = \tilde\epsilon(e^{\tilde \epsilon} - 1) 1_{\tilde\epsilon < \log 2} + \tilde\epsilon 1_{\tilde\epsilon \ge \log 2}\),
we may compare the two cases separately.

Note that we have \(\kappa^*\) is a monotonously increasing function,
therefore

\[\kappa^* (b(\tilde\epsilon)) \le \kappa^*(r^{-1} \tilde\epsilon).\]

So for \(\tilde \epsilon \ge \log 2\), we have

\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(r^{-1} \tilde \epsilon)) \ge k \exp(- \kappa^*(k^{-1} \epsilon)) \ge \delta_2(\epsilon).\]

For \(\tilde\epsilon < \log 2\), it is harder to compare, as now

\[k \exp(- \kappa^*(b(\tilde\epsilon))) \ge k \exp(- \kappa^*(\epsilon / \sqrt{r k})).\]

It is tempting to believe that this should also be greater than
\(\delta_2(\epsilon)\). But I can not say for sure. At least in the
special case of Gaussian, we have

\[k \exp(- \kappa^*(\epsilon / \sqrt{r k})) = k \exp(- (\sigma \sqrt{\epsilon / k r} - (2 \sigma)^{-1})^2) \ge \exp(- k ({\sigma \epsilon \over k} - (2 \sigma)^{-1})^2) = \delta_2(\epsilon)\]

when \(\epsilon\) is sufficiently small. However we still need to consider
the case where \(\epsilon\) is not too small. But overall it seems most
likely Route 2 is superior than Route 1.

So let us compare Route 2 with Route 3:

Given the condition to obtain the Chernoff bound

\[{\sigma \epsilon \over k} > (2 \sigma)^{-1}\]

we have

\[\delta_2(\epsilon) > \exp(- k (\sigma \epsilon / k)^2) = \exp(- \sigma^2 \epsilon^2 / k).\]

For this to achieve the same bound

\[\delta_3(\epsilon) = \exp\left(- {1 \over 2} c_2 \sigma^2 \epsilon\right)\]

we need \(k < {2 \epsilon \over c_2}\). This is only possible if \(c_2\) is
small or \(\epsilon\) is large, since \(k\) is a positive integer.

So taking at face value, Route 3 seems to achieve the best results.
However, it also has some similar implicit conditions that need to be
satisfied: First \(T\) needs to be at least \(1\), meaning

\[{c_2 \over C(c_1, c_2)} \epsilon \sigma^2 \ge 1.\]

Second, \(k\) needs to be at least \(1\) as well, i.e.

\[k = r T \ge {c_1 c_2 \over C(c_1, c_2)} \epsilon \sigma \ge 1.\]

Both conditions rely on the magnitudes of \(\epsilon\), \(\sigma\), \(c_1\),
\(c_2\), and the rate of growth of \(C(c_1, c_2)\). The biggest problem in
this list is the last, because if we know how fast \(C\) grows then we'll
have a better idea what are the constraints for the parameters to
achieve the result in Route 3.

** Further questions
   :PROPERTIES:
   :CUSTOM_ID: further-questions
   :ID:       277e8a8c-cc34-4ba9-84fb-d8950f6dc9de
   :END:
Here is a list of what I think may be interesting topics or potential
problems to look at, with no guarantee that they are all awesome
untouched research problems:

1. Prove Conjecture 1
2. Find a theoretically definitive answer whether the methods in Part 1
   or Part 2 yield better privacy guarantees.
3. Study the non-Gaussian cases, general or specific. Let \(p\) be some
   probability density, what is the tail bound of
   \(L(p(y) || p(y + \alpha))\) for \(|\alpha| \le 1\)? Can you find
   anything better than Gaussian? For a start, perhaps the nice tables
   of Rényi divergence in Gil-Alajaji-Linder 2013 may be useful?
4. Find out how useful Claim 26 is. Perhaps start with computing the
   constant \(C\) nemerically.
5. Help with [[https://github.com/tensorflow/privacy/issues/23][the
   aforementioned issue]] in the Tensorflow privacy package.

** References
   :PROPERTIES:
   :CUSTOM_ID: references
   :ID:       708aa715-dc2c-49ac-b7bb-f85ac168d8b3
   :END:

- Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya
  Mironov, Kunal Talwar, and Li Zhang. "Deep Learning with Differential
  Privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer
  and Communications Security - CCS'16, 2016, 308--18.
  [[https://doi.org/10.1145/2976749.2978318]].
- Erven, Tim van, and Peter Harremoës. "R\'enyi Divergence and
  Kullback-Leibler Divergence." IEEE Transactions on Information Theory
  60, no. 7 (July 2014): 3797--3820.
  [[https://doi.org/10.1109/TIT.2014.2320500]].
- Gil, M., F. Alajaji, and T. Linder. "Rényi Divergence Measures for
  Commonly Used Univariate Continuous Distributions." Information
  Sciences 249 (November 2013): 124--31.
  [[https://doi.org/10.1016/j.ins.2013.06.018]].
- Mironov, Ilya. "Renyi Differential Privacy." 2017 IEEE 30th Computer
  Security Foundations Symposium (CSF), August 2017, 263--75.
  [[https://doi.org/10.1109/CSF.2017.11]].