diff options
Diffstat (limited to 'posts')
-rw-r--r-- | posts/2017-08-07-mathematical_bazaar.org | 4 | ||||
-rw-r--r-- | posts/2018-04-10-update-open-research.org | 3 | ||||
-rw-r--r-- | posts/2018-12-02-lime-shapley.org | 6 | ||||
-rw-r--r-- | posts/2019-01-03-discriminant-analysis.org | 9 | ||||
-rw-r--r-- | posts/2019-02-14-raise-your-elbo.org | 18 | ||||
-rw-r--r-- | posts/2019-03-13-a-tail-of-two-densities.org | 8 | ||||
-rw-r--r-- | posts/2019-03-14-great-but-manageable-expectations.org | 10 |
7 files changed, 57 insertions, 1 deletions
diff --git a/posts/2017-08-07-mathematical_bazaar.org b/posts/2017-08-07-mathematical_bazaar.org index 64bf335..11aa727 100644 --- a/posts/2017-08-07-mathematical_bazaar.org +++ b/posts/2017-08-07-mathematical_bazaar.org @@ -23,6 +23,7 @@ Before I start I should point out that ** problems of academia :PROPERTIES: :CUSTOM_ID: problems-of-academia + :ID: 2d61ed4d-abb3-415a-b21a-e051efe18499 :END: Open source projects are characterised by publicly available source codes as well as open invitations for public collaborations, whereas @@ -110,6 +111,7 @@ source collaboration works on Github. ** open source collaborations on Github :PROPERTIES: :CUSTOM_ID: open-source-collaborations-on-github + :ID: 7d634bb9-9e64-4e71-a84f-f09b18a6fa1d :END: On [[https://github.com][Github]], every project is publicly available in a repository (we do not consider private repos). The owner can update @@ -153,6 +155,7 @@ mention them by =@someone=. ** open research in mathematics :PROPERTIES: :CUSTOM_ID: open-research-in-mathematics + :ID: 70f64ecd-f261-4ce0-9e1d-39713cd789d6 :END: All this points to a promising direction of open research. A maths project may have a wiki / collection of notes, the paper being written, @@ -200,6 +203,7 @@ advantages: ** related readings :PROPERTIES: :CUSTOM_ID: related-readings + :ID: d2e4c2b9-a325-42a8-898c-bdf2588a6b0f :END: - [[http://www.catb.org/esr/writings/cathedral-bazaar/][The Cathedral diff --git a/posts/2018-04-10-update-open-research.org b/posts/2018-04-10-update-open-research.org index 4b078d5..7cc1781 100644 --- a/posts/2018-04-10-update-open-research.org +++ b/posts/2018-04-10-update-open-research.org @@ -45,6 +45,7 @@ research. *** Freedom and community :PROPERTIES: :CUSTOM_ID: freedom-and-community + :ID: f0ec8170-e86e-49c6-bd34-3904f31317eb :END: Ideals matter. Stallman's struggles stemmed from the frustration of denied request of source code (a frustration I shared in academia except @@ -130,6 +131,7 @@ community: *** Tools for open research :PROPERTIES: :CUSTOM_ID: tools-for-open-research + :ID: 07f852e1-c95d-407f-972b-8395ac7903a4 :END: The open research workshop revolved around how to lead academia towards a more open culture. There were discussions on open research tools, @@ -174,6 +176,7 @@ afterwards I was also made aware by more of them, like the following: *** An anecdote from the workshop :PROPERTIES: :CUSTOM_ID: an-anecdote-from-the-workshop + :ID: 857899dd-f3f8-4eac-a14a-779604066da4 :END: In a conversation during the workshop, one of the participants called open science "normal science", because reproducibility, open access, diff --git a/posts/2018-12-02-lime-shapley.org b/posts/2018-12-02-lime-shapley.org index 05ef4ee..0eaacf2 100644 --- a/posts/2018-12-02-lime-shapley.org +++ b/posts/2018-12-02-lime-shapley.org @@ -17,6 +17,7 @@ under CC BY-SA and GNU FDL./ ** Shapley values :PROPERTIES: :CUSTOM_ID: shapley-values + :ID: 1f4de00d-669e-421a-bd48-f4f1da8400f3 :END: A coalitional game $(v, N)$ of $n$ players involves @@ -69,6 +70,7 @@ together with $v(\emptyset) = 0$ defines the Shapley values. ** LIME :PROPERTIES: :CUSTOM_ID: lime + :ID: b34aabbc-2f50-4969-bf7b-87678fd577e6 :END: LIME (Ribeiro et. al. 2016) is a model that offers a way to explain feature contributions of supervised learning models locally. @@ -139,6 +141,7 @@ of dimension $n$. ** Shapley values and LIME :PROPERTIES: :CUSTOM_ID: shapley-values-and-lime + :ID: bb960a6a-be79-44ff-968f-000990673738 :END: The connection between the Shapley values and LIME is noted in Lundberg-Lee (2017), but the underlying connection goes back to 1988 @@ -243,6 +246,7 @@ Plugging this back into (6) we get the desired result. $\square$ ** SHAP :PROPERTIES: :CUSTOM_ID: shap + :ID: 11af7fd2-a4f9-42da-a275-5d15cde0ca48 :END: The paper that coined the term "SHAP values" (Lundberg-Lee 2017) is not clear in its definition of the "SHAP values" and its relation to LIME, @@ -304,6 +308,7 @@ general, how do we evaluate models of interpretation? ** Evaluating SHAP :PROPERTIES: :CUSTOM_ID: evaluating-shap + :ID: a7249300-814f-499e-b808-c66a17a4d32e :END: The quest of the SHAP paper can be decoupled into two independent components: showing the niceties of Shapley values and choosing the @@ -332,6 +337,7 @@ justifications of such choices. ** References :PROPERTIES: :CUSTOM_ID: references + :ID: 7ba9546e-2485-43e4-be48-16762fc447b2 :END: - Charnes, A., B. Golany, M. Keane, and J. Rousseau. "Extremal Principle diff --git a/posts/2019-01-03-discriminant-analysis.org b/posts/2019-01-03-discriminant-analysis.org index 34c16bf..a0ada73 100644 --- a/posts/2019-01-03-discriminant-analysis.org +++ b/posts/2019-01-03-discriminant-analysis.org @@ -23,6 +23,7 @@ under CC BY-SA and GNU FDL./ ** Theory :PROPERTIES: :CUSTOM_ID: theory + :ID: 69be3baf-7f60-42f2-9184-ee8840eea554 :END: Quadratic discriminant analysis (QDA) is a classical classification algorithm. It assumes that the data is generated by Gaussian @@ -69,6 +70,7 @@ be independent. *** QDA :PROPERTIES: :CUSTOM_ID: qda + :ID: f6e95892-01cf-4569-b01e-22ed238d0577 :END: We look at QDA. @@ -94,6 +96,7 @@ sample for each class. *** Vanilla LDA :PROPERTIES: :CUSTOM_ID: vanilla-lda + :ID: 5a6ca0ca-f385-4054-9b19-9cac69b1a59a :END: Now let us look at LDA. @@ -127,6 +130,7 @@ nearest neighbour classifier. *** Nearest neighbour classifier :PROPERTIES: :CUSTOM_ID: nearest-neighbour-classifier + :ID: 8880764c-6fbe-4023-97dd-9711c7c50ea9 :END: More specifically, we want to transform the first term of (0) to a norm to get a classifier based on nearest neighbour modulo $\log \pi_i$: @@ -160,6 +164,7 @@ $A \mu_i$ (again, modulo $\log \pi_i$) and label the input with $i$. *** Dimensionality reduction :PROPERTIES: :CUSTOM_ID: dimensionality-reduction + :ID: 70e1afc1-9c45-4a35-a842-48573e077b36 :END: We can further simplify the prediction by dimensionality reduction. Assume $n_c \le n$. Then the centroid spans an affine space of dimension @@ -195,6 +200,7 @@ words, the prediction does not change regardless of =n_components=. *** Fisher discriminant analysis :PROPERTIES: :CUSTOM_ID: fisher-discriminant-analysis + :ID: 05ff25da-8c52-4f20-a0ac-4422f19e10ce :END: The Fisher discriminant analysis involves finding an $n$-dimensional vector $a$ that maximises between-class covariance with respect to @@ -232,6 +238,7 @@ $a = c V_x D_x^{-1} V_m$ with $p = 1$. *** Linear model :PROPERTIES: :CUSTOM_ID: linear-model + :ID: feb827b6-0064-4192-b96b-86a942c8839e :END: The model is called linear discriminant analysis because it is a linear model. To see this, let $B = V_m^T D_x^{-1} V_x^T$ be the matrix of @@ -256,6 +263,7 @@ This is how scikit-learn implements LDA, by inheriting from ** Implementation :PROPERTIES: :CUSTOM_ID: implementation + :ID: b567283c-20ee-41a8-8216-7392066a5ac5 :END: This is where things get interesting. How do I validate my understanding of the theory? By implementing and testing the algorithm. @@ -279,6 +287,7 @@ The result is *** Fun facts about LDA :PROPERTIES: :CUSTOM_ID: fun-facts-about-lda + :ID: f1d47f43-27f6-49dd-bd0d-2e685c38e241 :END: One property that can be used to test the LDA implementation is the fact that the scatter matrix $B(X - \bar x)^T (X - \bar X) B^T$ of the diff --git a/posts/2019-02-14-raise-your-elbo.org b/posts/2019-02-14-raise-your-elbo.org index 9e15552..f0de7d1 100644 --- a/posts/2019-02-14-raise-your-elbo.org +++ b/posts/2019-02-14-raise-your-elbo.org @@ -47,6 +47,7 @@ under CC BY-SA and GNU FDL./ ** KL divergence and ELBO :PROPERTIES: :CUSTOM_ID: kl-divergence-and-elbo + :ID: 2bb0d405-f6b4-483f-9f2d-c0e945faa3ac :END: Let $p$ and $q$ be two probability measures. The Kullback-Leibler (KL) divergence is defined as @@ -120,6 +121,7 @@ Bayesian version. ** EM :PROPERTIES: :CUSTOM_ID: em + :ID: 6d694b38-56c2-4e10-8a1f-1f82e309073f :END: To illustrate the EM algorithms, we first define the mixture model. @@ -198,6 +200,7 @@ model is: *** GMM :PROPERTIES: :CUSTOM_ID: gmm + :ID: 5d5265f6-c2b9-42f1-a4a1-0d87417f0b02 :END: Gaussian mixture model (GMM) is an example of mixture models. @@ -240,6 +243,7 @@ $\epsilon I$ is called elliptical k-means algorithm. *** SMM :PROPERTIES: :CUSTOM_ID: smm + :ID: f4b3a462-8ae7-44f2-813c-58b007eaa047 :END: As a transition to the next models to study, let us consider a simpler mixture model obtained by making one modification to GMM: change @@ -275,6 +279,7 @@ Dirichlet allocation (LDA), not to be confused with the other LDA *** pLSA :PROPERTIES: :CUSTOM_ID: plsa + :ID: d4f58158-dcb6-4ba1-a9e2-bf53bff6012e :END: The pLSA model (Hoffman 2000) is a mixture model, where the dataset is now pairs $(d_i, x_i)_{i = 1 : m}$. In natural language processing, $x$ @@ -294,6 +299,7 @@ corresponds to type 2. **** pLSA1 :PROPERTIES: :CUSTOM_ID: plsa1 + :ID: 969f470e-5bbe-464e-a3b7-f996c8f04de3 :END: The pLSA1 model (Hoffman 2000) is basically SMM with $x_i$ substituted with $(d_i, x_i)$, which conditioned on $z_i$ are independently @@ -340,6 +346,7 @@ dimensional embeddings $D_{u, \cdot}$ and $X_{w, \cdot}$. **** pLSA2 :PROPERTIES: :CUSTOM_ID: plsa2 + :ID: eef3249a-c45d-4a07-876f-68b2a2e957e5 :END: Let us turn to pLSA2 (Hoffman 2004), corresponding to (2.92). We rewrite it as @@ -392,6 +399,7 @@ $$\begin{aligned} *** HMM :PROPERTIES: :CUSTOM_ID: hmm + :ID: 16d00eda-7136-49f5-8427-c775c7a91317 :END: The hidden markov model (HMM) is a sequential version of SMM, in the same sense that recurrent neural networks are sequential versions of @@ -518,6 +526,7 @@ as ${(7) \over (8)}$ and ${(9) \over (8)}$ respectively. ** Fully Bayesian EM / MFA :PROPERTIES: :CUSTOM_ID: fully-bayesian-em-mfa + :ID: 77f1d7ae-3785-45d4-b88f-18478e41f3b9 :END: Let us now venture into the realm of full Bayesian. @@ -567,6 +576,7 @@ e.g. Section 10.1 of Bishop 2006. *** Application to mixture models :PROPERTIES: :CUSTOM_ID: application-to-mixture-models + :ID: 52bf6025-1180-44dc-8272-e6af6e228bf3 :END: *Definition (Fully Bayesian mixture model)*. The relations between $\pi$, $\eta$, $x$, $z$ are the same as in the definition of mixture @@ -658,6 +668,7 @@ until convergence. *** Fully Bayesian GMM :PROPERTIES: :CUSTOM_ID: fully-bayesian-gmm + :ID: 814289c0-2527-42a0-914b-d64ad62ecd05 :END: A typical example of fully Bayesian mixture models is the fully Bayesian Gaussian mixture model (Attias 2000, also called variational GMM in the @@ -684,6 +695,7 @@ Chapter 10.2 of Bishop 2006 or Attias 2000. *** LDA :PROPERTIES: :CUSTOM_ID: lda + :ID: 7d752891-ef33-4b58-9dc3-d6a61325bfa6 :END: As the second example of fully Bayesian mixture models, Latent Dirichlet allocation (LDA) (Blei-Ng-Jordan 2003) is the fully Bayesian version of @@ -747,6 +759,7 @@ So the algorithm iterates over (10) and (11)(12) until convergence. *** DPMM :PROPERTIES: :CUSTOM_ID: dpmm + :ID: 187cb168-b3f8-428e-962a-80ad5966f844 :END: The Dirichlet process mixture model (DPMM) is like the fully Bayesian mixture model except $n_z = \infty$, i.e. $z$ can take any positive @@ -900,6 +913,7 @@ $$\begin{aligned} ** SVI :PROPERTIES: :CUSTOM_ID: svi + :ID: 47efee6c-67ac-44eb-92fb-4d576ae2ec99 :END: In variational inference, the computation of some parameters are more expensive than others. @@ -969,6 +983,7 @@ for some $\kappa \in (.5, 1]$ and $\tau \ge 0$. ** AEVB :PROPERTIES: :CUSTOM_ID: aevb + :ID: a196df8f-1574-4390-83a4-dd22d8fcecaf :END: SVI adds to variational inference stochastic updates similar to stochastic gradient descent. Why not just use neural networks with @@ -1048,6 +1063,7 @@ approximation of $U(x, \phi, \theta)$ itself can be done similarly. *** VAE :PROPERTIES: :CUSTOM_ID: vae + :ID: 59e07ae5-a4d3-4b95-949f-0b4348f2b70b :END: As an example of AEVB, the paper introduces variational autoencoder (VAE), with the following instantiations: @@ -1069,6 +1085,7 @@ With this, one can use backprop to maximise the ELBO. *** Fully Bayesian AEVB :PROPERTIES: :CUSTOM_ID: fully-bayesian-aevb + :ID: 0fb4f75b-4b62-440f-adc7-996b2d7f718a :END: Let us turn to fully Bayesian version of AEVB. Again, we first recall the ELBO of the fully Bayesian mixture models: @@ -1117,6 +1134,7 @@ Again, one may use Monte-Carlo to approximate this expetation. ** References :PROPERTIES: :CUSTOM_ID: references + :ID: df1567c9-b0e1-499f-a9d1-c0c915b2b98d :END: - Attias, Hagai. "A variational baysian framework for graphical models." diff --git a/posts/2019-03-13-a-tail-of-two-densities.org b/posts/2019-03-13-a-tail-of-two-densities.org index 783e0c5..f1b6b15 100644 --- a/posts/2019-03-13-a-tail-of-two-densities.org +++ b/posts/2019-03-13-a-tail-of-two-densities.org @@ -80,6 +80,7 @@ BY-SA]] and [[https://www.gnu.org/licenses/fdl.html][GNU FDL]]./ ** The gist of differential privacy :PROPERTIES: :CUSTOM_ID: the-gist-of-differential-privacy + :ID: 91bf2eb5-8509-4180-b471-939280dc1438 :END: If you only have one minute, here is what differential privacy is about: @@ -118,6 +119,7 @@ Now if you have an hour... ** \(\epsilon\)-dp :PROPERTIES: :CUSTOM_ID: epsilon-dp + :ID: d29da3db-8b9a-4bad-811e-4af1cd9f856d :END: *Definition (Mechanisms)*. Let \(X\) be a space with a metric \(d: X \times X \to \mathbb N\). A /mechanism/ \(M\) is a function that @@ -188,6 +190,7 @@ where in the last step we use the condition (1.5). \(\square\) ** Approximate differential privacy :PROPERTIES: :CUSTOM_ID: approximate-differential-privacy + :ID: c48c68f8-d749-47f8-b5de-c92cc53f8cea :END: Unfortunately, \(\epsilon\)-dp does not apply to the most commonly used noise, the Gaussian noise. To fix this, we need to relax the definition @@ -205,6 +208,7 @@ if \(\delta < 1\). *** Indistinguishability :PROPERTIES: :CUSTOM_ID: indistinguishability + :ID: 7875ad81-326b-4eaa-a3ae-9e09df96ea1b :END: To understand \((\epsilon, \delta)\)-dp, it is helpful to study \((\epsilon, \delta)\)-indistinguishability. @@ -535,6 +539,7 @@ The rest of the proof is almost the same as the proof of Claim 4. *** Back to approximate differential privacy :PROPERTIES: :CUSTOM_ID: back-to-approximate-differential-privacy + :ID: 706c037d-ea44-4ade-8007-7f1f41d394e8 :END: By Claim 0 and 1 we have @@ -741,6 +746,7 @@ proof of Theorem A.1. ** Composition theorems :PROPERTIES: :CUSTOM_ID: composition-theorems + :ID: b672a060-d886-4f07-92d2-1d92f5f4c0c8 :END: So far we have seen how a mechanism made of a single query plus a noise can be proved to be differentially private. But we need to understand @@ -1108,6 +1114,7 @@ The rest is the same as in the proof of Claim 17. \(\square\) ** Subsampling :PROPERTIES: :CUSTOM_ID: subsampling + :ID: eeda51d4-9370-49c6-9710-9c9ab88f91e2 :END: Stochastic gradient descent is like gradient descent, but with random subsampling. @@ -1257,6 +1264,7 @@ guarantee for DP-SGD, among other things. ** References :PROPERTIES: :CUSTOM_ID: references + :ID: 65686625-6bd1-4e42-b00d-5f1744945884 :END: - Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya diff --git a/posts/2019-03-14-great-but-manageable-expectations.org b/posts/2019-03-14-great-but-manageable-expectations.org index 6438090..68e757a 100644 --- a/posts/2019-03-14-great-but-manageable-expectations.org +++ b/posts/2019-03-14-great-but-manageable-expectations.org @@ -26,11 +26,12 @@ privacy guarantees for composed subsampling queries in general, and for DP-SGD in particular. I also compare these privacy guarantees. /If you are confused by any notations, ask me or try -[[/notations.html][this]]./ +[[file:/notations.html][this]]./ ** Rényi divergence and differential privacy :PROPERTIES: :CUSTOM_ID: rényi-divergence-and-differential-privacy + :ID: d1763dea-5e8f-4393-8f14-1d781147dcb5 :END: Recall in the proof of Gaussian mechanism privacy guarantee (Claim 8) we used the Chernoff bound for the Gaussian noise. Why not use the Chernoff @@ -161,6 +162,7 @@ considering Rényi dp. *** Moment Composition :PROPERTIES: :CUSTOM_ID: moment-composition + :ID: d5e94e5a-236d-4c41-96a4-4a93341f249a :END: *Claim 22 (Moment Composition Theorem)*. Let \(M\) be the adaptive composition of \(M_{1 : k}\). Suppose for any \(y_{< i}\), \(M_i(y_{< i})\) is @@ -228,6 +230,7 @@ the Advanced Composition Theorem (Claim 18). *** Subsampling :PROPERTIES: :CUSTOM_ID: subsampling + :ID: 25cd27ac-fcb6-462f-9a3b-da861124d7b2 :END: We also have a subsampling theorem for the Rényi dp. @@ -330,6 +333,7 @@ assumptions. ** ACGMMTZ16 :PROPERTIES: :CUSTOM_ID: acgmmtz16 + :ID: 8b85cce3-01ad-4404-80c0-b73076d183a9 :END: What follows is my understanding of this result. I call it a conjecture because there is a gap which I am not able to reproduce their proof or @@ -597,6 +601,7 @@ true, for the following reasons: ** Tensorflow implementation :PROPERTIES: :CUSTOM_ID: tensorflow-implementation + :ID: f856ad67-4f78-46b4-8c98-fda07a0dc670 :END: The DP-SGD is implemented in [[https://github.com/tensorflow/privacy][TensorFlow Privacy]]. In the @@ -650,6 +655,7 @@ automatically computed given a DP-SGD instance. ** Comparison among different methods :PROPERTIES: :CUSTOM_ID: comparison-among-different-methods + :ID: 30502f53-d9ba-48ea-868a-dd4db995a6d4 :END: So far we have seen three routes to compute the privacy guarantees for DP-SGD with the Gaussian mechanism: @@ -795,6 +801,7 @@ achieve the result in Route 3. ** Further questions :PROPERTIES: :CUSTOM_ID: further-questions + :ID: 277e8a8c-cc34-4ba9-84fb-d8950f6dc9de :END: Here is a list of what I think may be interesting topics or potential problems to look at, with no guarantee that they are all awesome @@ -816,6 +823,7 @@ untouched research problems: ** References :PROPERTIES: :CUSTOM_ID: references + :ID: 708aa715-dc2c-49ac-b7bb-f85ac168d8b3 :END: - Abadi, Martín, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya |