8 files changed, 269 insertions, 11 deletions
diff --git a/site/blog-feed.xml b/site/blog-feed.xml
index fac8ce3..82d8c31 100644
--- a/site/blog-feed.xml
+++ b/site/blog-feed.xml
@@ -2,7 +2,7 @@
 <feed xmlns="http://www.w3.org/2005/Atom">
   <title type="text">Yuchen Pei's Blog</title>
   <id>https://ypei.me/blog-feed.xml</id>
-  <updated>2018-04-29T00:00:00Z</updated>
+  <updated>2018-06-03T00:00:00Z</updated>
   <link href="https://ypei.me" />
   <link href="https://ypei.me/blog-feed.xml" rel="self" />
   <author>
@@ -10,6 +10,48 @@
   </author>
   <generator>PyAtom</generator>
   <entry xml:base="https://ypei.me/blog-feed.xml">
+    <title type="text">Automatic differentiation</title>
+    <id>posts/2018-06-03-automatic_differentiation.html</id>
+    <updated>2018-06-03T00:00:00Z</updated>
+    <link href="posts/2018-06-03-automatic_differentiation.html" />
+    <author>
+      <name>Yuchen Pei</name>
+    </author>
+    <content type="html">&lt;p&gt;This post is meant as a documentation of my understanding of autodiff. I benefited a lot from &lt;a href="http://www.cs.toronto.edu/%7Ergrosse/courses/csc321_2018/slides/lec10.pdf"&gt;Toronto CSC321 slides&lt;/a&gt; and the &lt;a href="https://github.com/mattjj/autodidact/"&gt;autodidact&lt;/a&gt; project which is a pedagogical implementation of &lt;a href="https://github.com/hips/autograd"&gt;Autograd&lt;/a&gt;. That said, any mistakes in this note are mine (especially since some of the knowledge is obtained from interpreting slides!), and if you do spot any I would be grateful if you can let me know.&lt;/p&gt;
+&lt;p&gt;Automatic differentiation (AD) is a way to compute derivatives. It does so by traversing through a computational graph using the chain rule.&lt;/p&gt;
+&lt;p&gt;There are the forward mode AD and reverse mode AD, which are kind of symmetric to each other and understanding one of them results in little to no difficulty in understanding the other.&lt;/p&gt;
+&lt;p&gt;In the language of neural networks, one can say that the forward mode AD is used when one wants to compute the derivatives of functions at all layers with respect to input layer weights, whereas the reverse mode AD is used to compute the derivatives of output functions with respect to weights at all layers. Therefore reverse mode AD (rmAD) is the one to use for gradient descent, which is the one we focus in this post.&lt;/p&gt;
+&lt;p&gt;Basically rmAD requires the computation to be sufficiently decomposed, so that in the computational graph, each node as a function of its parent nodes is an elementary function that the AD engine has knowledge about.&lt;/p&gt;
+&lt;p&gt;For example, the Sigmoid activation &lt;span class="math inline"&gt;\(a&amp;#39; = \sigma(w a + b)\)&lt;/span&gt; is quite simple, but it should be decomposed to simpler computations:&lt;/p&gt;
+&lt;ul&gt;
+&lt;li&gt;&lt;span class="math inline"&gt;\(a&amp;#39; = 1 / t_1\)&lt;/span&gt;&lt;/li&gt;
+&lt;li&gt;&lt;span class="math inline"&gt;\(t_1 = 1 + t_2\)&lt;/span&gt;&lt;/li&gt;
+&lt;li&gt;&lt;span class="math inline"&gt;\(t_2 = \exp(t_3)\)&lt;/span&gt;&lt;/li&gt;
+&lt;li&gt;&lt;span class="math inline"&gt;\(t_3 = - t_4\)&lt;/span&gt;&lt;/li&gt;
+&lt;li&gt;&lt;span class="math inline"&gt;\(t_4 = t_5 + b\)&lt;/span&gt;&lt;/li&gt;
+&lt;li&gt;&lt;span class="math inline"&gt;\(t_5 = w a\)&lt;/span&gt;&lt;/li&gt;
+&lt;/ul&gt;
+&lt;p&gt;Thus the function &lt;span class="math inline"&gt;\(a&amp;#39;(a)\)&lt;/span&gt; is decomposed to elementary operations like addition, subtraction, multiplication, reciprocitation, exponentiation, logarithm etc. And the rmAD engine stores the Jacobian of these elementary operations.&lt;/p&gt;
+&lt;p&gt;Since in neural networks we want to find derivatives of a single loss function &lt;span class="math inline"&gt;\(L(x; \theta)\)&lt;/span&gt;, we can omit &lt;span class="math inline"&gt;\(L\)&lt;/span&gt; when writing derivatives and denote, say &lt;span class="math inline"&gt;\(\bar \theta_k := \partial_{\theta_k} L\)&lt;/span&gt;.&lt;/p&gt;
+&lt;p&gt;In implementations of rmAD, one can represent the Jacobian as a transformation &lt;span class="math inline"&gt;\(j: (x \to y) \to (y, \bar y, x) \to \bar x\)&lt;/span&gt;. &lt;span class="math inline"&gt;\(j\)&lt;/span&gt; is called the &lt;em&gt;Vector Jacobian Product&lt;/em&gt; (VJP). For example, &lt;span class="math inline"&gt;\(j(\exp)(y, \bar y, x) = y \bar y\)&lt;/span&gt; since given &lt;span class="math inline"&gt;\(y = \exp(x)\)&lt;/span&gt;,&lt;/p&gt;
+&lt;p&gt;&lt;span class="math inline"&gt;\(\partial_x L = \partial_x y \cdot \partial_y L = \partial_x \exp(x) \cdot \partial_y L = y \bar y\)&lt;/span&gt;&lt;/p&gt;
+&lt;p&gt;as another example, &lt;span class="math inline"&gt;\(j(+)(y, \bar y, x_1, x_2) = (\bar y, \bar y)\)&lt;/span&gt; since given &lt;span class="math inline"&gt;\(y = x_1 + x_2\)&lt;/span&gt;, &lt;span class="math inline"&gt;\(\bar{x_1} = \bar{x_2} = \bar y\)&lt;/span&gt;.&lt;/p&gt;
+&lt;p&gt;Similarly,&lt;/p&gt;
+&lt;ol type="1"&gt;
+&lt;li&gt;&lt;span class="math inline"&gt;\(j(/)(y, \bar y, x_1, x_2) = (\bar y / x_2, - \bar y x_1 / x_2^2)\)&lt;/span&gt;&lt;/li&gt;
+&lt;li&gt;&lt;span class="math inline"&gt;\(j(\log)(y, \bar y, x) = \bar y / x\)&lt;/span&gt;&lt;/li&gt;
+&lt;li&gt;&lt;span class="math inline"&gt;\(j((A, \beta) \mapsto A \beta)(y, \bar y, A, \beta) = (\bar y \otimes \beta, A^T \bar y)\)&lt;/span&gt;.&lt;/li&gt;
+&lt;li&gt;etc...&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;In the third one, the function is a matrix &lt;span class="math inline"&gt;\(A\)&lt;/span&gt; multiplied on the right by a column vector &lt;span class="math inline"&gt;\(\beta\)&lt;/span&gt;, and &lt;span class="math inline"&gt;\(\bar y \otimes \beta\)&lt;/span&gt; is the tensor product which is a fancy way of writing &lt;span class="math inline"&gt;\(\bar y \beta^T\)&lt;/span&gt;. See &lt;a href="https://github.com/mattjj/autodidact/blob/master/autograd/numpy/numpy_vjps.py"&gt;numpy_vjps.py&lt;/a&gt; for the implementation in autodidact.&lt;/p&gt;
+&lt;p&gt;So, given a node say &lt;span class="math inline"&gt;\(y = y(x_1, x_2, ..., x_n)\)&lt;/span&gt;, and given the value of &lt;span class="math inline"&gt;\(y\)&lt;/span&gt;, &lt;span class="math inline"&gt;\(x_{1 : n}\)&lt;/span&gt; and &lt;span class="math inline"&gt;\(\bar y\)&lt;/span&gt;, rmAD computes the values of &lt;span class="math inline"&gt;\(\bar x_{1 : n}\)&lt;/span&gt; by using the Jacobians.&lt;/p&gt;
+&lt;p&gt;This is the gist of rmAD. It stores the values of each node in a forward pass, and computes the derivatives of each node exactly once in a backward pass.&lt;/p&gt;
+&lt;p&gt;It is a nice exercise to derive the backpropagation in the fully connected feedforward neural networks (e.g. &lt;a href="http://neuralnetworksanddeeplearning.com/chap2.html#the_four_fundamental_equations_behind_backpropagation"&gt;the one for MNIST in Neural Networks and Deep Learning&lt;/a&gt;) using rmAD.&lt;/p&gt;
+&lt;p&gt;AD is an approach lying between the extremes of numerical approximation (e.g. finite difference) and symbolic evaluation. It uses exact formulas (VJP) at each elementary operation like symbolic evaluation, while evaluates each VJP numerically rather than lumping all the VJPs into an unwieldy symbolic formula.&lt;/p&gt;
+&lt;p&gt;Things to look further into: the higher-order functional currying form &lt;span class="math inline"&gt;\(j: (x \to y) \to (y, \bar y, x) \to \bar x\)&lt;/span&gt; begs for a functional programming implementation.&lt;/p&gt;
+</content>
+  </entry>
+  <entry xml:base="https://ypei.me/blog-feed.xml">
     <title type="text">Updates on open research</title>
     <id>posts/2018-04-10-update-open-research.html</id>
     <updated>2018-04-29T00:00:00Z</updated>
@@ -181,7 +223,7 @@
       <name>Yuchen Pei</name>
     </author>
     <content type="html">&lt;p&gt;In &lt;a href="https://arxiv.org/abs/1506.04294"&gt;this paper&lt;/a&gt; with &lt;a href="http://homepages.lboro.ac.uk/~marh3/"&gt;Robin&lt;/a&gt; we study the family of causal double product integrals \[ \prod_{a &amp;lt; x &amp;lt; y &amp;lt; b}\left(1 + i{\lambda \over 2}(dP_x dQ_y - dQ_x dP_y) + i {\mu \over 2}(dP_x dP_y + dQ_x dQ_y)\right) \]&lt;/p&gt;
-&lt;p&gt;where &lt;span class="math inline"&gt;&lt;em&gt;P&lt;/em&gt;&lt;/span&gt; and &lt;span class="math inline"&gt;&lt;em&gt;Q&lt;/em&gt;&lt;/span&gt; are the mutually noncommuting momentum and position Brownian motions of quantum stochastic calculus. The evaluation is motivated heuristically by approximating the continuous double product by a discrete product in which infinitesimals are replaced by finite increments. The latter is in turn approximated by the second quantisation of a discrete double product of rotation-like operators in different planes due to a result in &lt;a href="http://www.actaphys.uj.edu.pl/findarticle?series=Reg&amp;amp;vol=46&amp;amp;page=1851"&gt;(Hudson-Pei2015)&lt;/a&gt;. The main problem solved in this paper is the explicit evaluation of the continuum limit &lt;span class="math inline"&gt;&lt;em&gt;W&lt;/em&gt;&lt;/span&gt; of the latter, and showing that &lt;span class="math inline"&gt;&lt;em&gt;W&lt;/em&gt;&lt;/span&gt; is a unitary operator. The kernel of &lt;span class="math inline"&gt;&lt;em&gt;W&lt;/em&gt;&lt;/span&gt; is written in terms of Bessel functions, and the evaluation is achieved by working on a lattice path model and enumerating linear extensions of related partial orderings, where the enumeration turns out to be heavily related to Dyck paths and generalisations of Catalan numbers.&lt;/p&gt;
+&lt;p&gt;where &lt;span class="math inline"&gt;\(P\)&lt;/span&gt; and &lt;span class="math inline"&gt;\(Q\)&lt;/span&gt; are the mutually noncommuting momentum and position Brownian motions of quantum stochastic calculus. The evaluation is motivated heuristically by approximating the continuous double product by a discrete product in which infinitesimals are replaced by finite increments. The latter is in turn approximated by the second quantisation of a discrete double product of rotation-like operators in different planes due to a result in &lt;a href="http://www.actaphys.uj.edu.pl/findarticle?series=Reg&amp;amp;vol=46&amp;amp;page=1851"&gt;(Hudson-Pei2015)&lt;/a&gt;. The main problem solved in this paper is the explicit evaluation of the continuum limit &lt;span class="math inline"&gt;\(W\)&lt;/span&gt; of the latter, and showing that &lt;span class="math inline"&gt;\(W\)&lt;/span&gt; is a unitary operator. The kernel of &lt;span class="math inline"&gt;\(W\)&lt;/span&gt; is written in terms of Bessel functions, and the evaluation is achieved by working on a lattice path model and enumerating linear extensions of related partial orderings, where the enumeration turns out to be heavily related to Dyck paths and generalisations of Catalan numbers.&lt;/p&gt;
 </content>
   </entry>
   <entry xml:base="https://ypei.me/blog-feed.xml">
diff --git a/site/blog.html b/site/blog.html
index 0d83120..3222e3a 100644
--- a/site/blog.html
+++ b/site/blog.html
@@ -19,6 +19,13 @@
 
         <div class="main">
             <div class="bodyitem">
+    <a href="posts/2018-06-03-automatic_differentiation.html"><h2> Automatic differentiation </h2></a>
+    <p>Posted on 2018-06-03</p>
+        <p>This post is meant as a documentation of my understanding of autodiff. I benefited a lot from <a href="http://www.cs.toronto.edu/%7Ergrosse/courses/csc321_2018/slides/lec10.pdf">Toronto CSC321 slides</a> and the <a href="https://github.com/mattjj/autodidact/">autodidact</a> project which is a pedagogical implementation of <a href="https://github.com/hips/autograd">Autograd</a>. That said, any mistakes in this note are mine (especially since some of the knowledge is obtained from interpreting slides!), and if you do spot any I would be grateful if you can let me know.</p>
+
+    <a href="posts/2018-06-03-automatic_differentiation.html">Continue reading</a>
+</div>
+<div class="bodyitem">
     <a href="posts/2018-04-10-update-open-research.html"><h2> Updates on open research </h2></a>
     <p>Posted on 2018-04-29</p>
         <p>It has been 9 months since I last wrote about open (maths) research. Since then two things happened which prompted me to write an update.</p>
@@ -46,13 +53,6 @@
 
     <a href="posts/2016-10-13-q-robinson-schensted-knuth-polymer.html">Continue reading</a>
 </div>
-<div class="bodyitem">
-    <a href="posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.html"><h2> AMS review of 'Double Macdonald polynomials as the stable limit of Macdonald superpolynomials' by Blondeau-Fournier, Lapointe and Mathieu </h2></a>
-    <p>Posted on 2015-07-15</p>
-        <p>A Macdonald superpolynomial (introduced in [O. Blondeau-Fournier et al., Lett. Math. Phys. <span class="bf">101</span> (2012), no. 1, 27–47; <a href="http://www.ams.org/mathscinet/search/publdoc.html?pg1=MR&amp;s1=2935476&amp;loc=fromrevtext">MR2935476</a>; J. Comb. <span class="bf">3</span> (2012), no. 3, 495–561; <a href="http://www.ams.org/mathscinet/search/publdoc.html?pg1=MR&amp;s1=3029444&amp;loc=fromrevtext">MR3029444</a>]) in \(N\) Grassmannian variables indexed by a superpartition \(\Lambda\) is said to be stable if \({m (m + 1) \over 2} \ge |\Lambda|\) and \(N \ge |\Lambda| - {m (m - 3) \over 2}\) , where \(m\) is the fermionic degree. A stable Macdonald superpolynomial (corresponding to a bisymmetric polynomial) is also called a double Macdonald polynomial (dMp). The main result of this paper is the factorisation of a dMp into plethysms of two classical Macdonald polynomials (Theorem 5). Based on this result, this paper</p>
-
-    <a href="posts/2015-07-15-double-macdonald-polynomials-macdonald-superpolynomials.html">Continue reading</a>
-</div>
 
             <div class="bodyitem">
                 <p><a href="postlist.html">older posts</a></p>
diff --git a/site/links.html b/site/links.html
index 9b53d13..fdff77a 100644
--- a/site/links.html
+++ b/site/links.html
@@ -21,6 +21,7 @@
             <div class="bodyitem">
                 <p>Here are some links I find interesting or helpful, or both. Listed in no particular order.</p>
 <ul>
+<li><a href="https://competitions.codalab.org/competitions/">CodaLab</a></li>
 <li><a href="https://haskellformaths.blogspot.com/">HaskellForMaths</a></li>
 <li><a href="http://www.openproblemgarden.org/">Open Problem Garden</a></li>
 <li><a href="http://www.ams.org/open-math-notes">AMS open notes</a></li>
diff --git a/site/microblog-feed.xml b/site/microblog-feed.xml
index a6578bc..4563861 100644
--- a/site/microblog-feed.xml
+++ b/site/microblog-feed.xml
@@ -2,7 +2,7 @@
 <feed xmlns="http://www.w3.org/2005/Atom">
   <title type="text">Yuchen Pei's Microblog</title>
   <id>https://ypei.me/microblog-feed.xml</id>
-  <updated>2018-05-11T00:00:00Z</updated>
+  <updated>2018-05-30T00:00:00Z</updated>
   <link href="https://ypei.me" />
   <link href="https://ypei.me/microblog-feed.xml" rel="self" />
   <author>
@@ -10,6 +10,95 @@
   </author>
   <generator>PyAtom</generator>
   <entry xml:base="https://ypei.me/microblog-feed.xml">
+    <title type="text">2018-05-30</title>
+    <id>microblog.html</id>
+    <updated>2018-05-30T00:00:00Z</updated>
+    <link href="microblog.html" />
+    <author>
+      <name>Yuchen Pei</name>
+    </author>
+    <content type="html">&lt;p&gt;Roger Grosse’s post &lt;a href="https://metacademy.org/roadmaps/rgrosse/learn_on_your_own"&gt;How to learn on your own (2015)&lt;/a&gt; is an excellent modern guide on how to learn and research technical stuff (especially machine learning and maths) on one’s own.&lt;/p&gt;
+</content>
+  </entry>
+  <entry xml:base="https://ypei.me/microblog-feed.xml">
+    <title type="text">2018-05-25</title>
+    <id>microblog.html</id>
+    <updated>2018-05-25T00:00:00Z</updated>
+    <link href="microblog.html" />
+    <author>
+      <name>Yuchen Pei</name>
+    </author>
+    <content type="html">&lt;p&gt;&lt;a href="http://jdlm.info/articles/2018/03/18/markov-decision-process-2048.html"&gt;This post&lt;/a&gt; models 2048 as an MDP and solves it using policy iteration and backward induction.&lt;/p&gt;
+</content>
+  </entry>
+  <entry xml:base="https://ypei.me/microblog-feed.xml">
+    <title type="text">2018-05-22</title>
+    <id>microblog.html</id>
+    <updated>2018-05-22T00:00:00Z</updated>
+    <link href="microblog.html" />
+    <author>
+      <name>Yuchen Pei</name>
+    </author>
+    <content type="html">&lt;blockquote&gt;
+&lt;p&gt;ATS (Applied Type System) is a programming language designed to unify programming with formal specification. ATS has support for combining theorem proving with practical programming through the use of advanced type systems. A past version of The Computer Language Benchmarks Game has demonstrated that the performance of ATS is comparable to that of the C and C++ programming languages. By using theorem proving and strict type checking, the compiler can detect and prove that its implemented functions are not susceptible to bugs such as division by zero, memory leaks, buffer overflow, and other forms of memory corruption by verifying pointer arithmetic and reference counting before the program compiles. Additionally, by using the integrated theorem-proving system of ATS (ATS/LF), the programmer may make use of static constructs that are intertwined with the operative code to prove that a function attains its specification.&lt;/p&gt;
+&lt;/blockquote&gt;
+&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/ATS_(programming_language)"&gt;Wikipedia entry on ATS&lt;/a&gt;&lt;/p&gt;
+</content>
+  </entry>
+  <entry xml:base="https://ypei.me/microblog-feed.xml">
+    <title type="text">2018-05-20</title>
+    <id>microblog.html</id>
+    <updated>2018-05-20T00:00:00Z</updated>
+    <link href="microblog.html" />
+    <author>
+      <name>Yuchen Pei</name>
+    </author>
+    <content type="html">&lt;p&gt;(5-second fame) I sent a picture of my kitchen sink to BBC and got mentioned in the &lt;a href="https://www.bbc.co.uk/programmes/w3cswg8c"&gt;latest Boston Calling episode&lt;/a&gt; (listen at 25:54).&lt;/p&gt;
+</content>
+  </entry>
+  <entry xml:base="https://ypei.me/microblog-feed.xml">
+    <title type="text">2018-05-18</title>
+    <id>microblog.html</id>
+    <updated>2018-05-18T00:00:00Z</updated>
+    <link href="microblog.html" />
+    <author>
+      <name>Yuchen Pei</name>
+    </author>
+    <content type="html">&lt;p&gt;&lt;a href="https://colah.github.io/"&gt;colah’s blog&lt;/a&gt; has a cool feature that allows you to comment on any paragraph of a blog post. Here’s an &lt;a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/"&gt;example&lt;/a&gt;. If it is doable on a static site hosted on Github pages, I suppose it shouldn’t be too hard to implement. This also seems to work more seamlessly than &lt;a href="https://fermatslibrary.com/"&gt;Fermat’s Library&lt;/a&gt;, because the latter has to embed pdfs in webpages. Now fantasy time: imagine that one day arXiv shows html versions of papers (through author uploading or conversion from TeX) with this feature.&lt;/p&gt;
+</content>
+  </entry>
+  <entry xml:base="https://ypei.me/microblog-feed.xml">
+    <title type="text">2018-05-15</title>
+    <id>microblog.html</id>
+    <updated>2018-05-15T00:00:00Z</updated>
+    <link href="microblog.html" />
+    <author>
+      <name>Yuchen Pei</name>
+    </author>
+    <content type="html">&lt;h3 id="notes-on-random-froests"&gt;Notes on random froests&lt;/h3&gt;
+&lt;p&gt;&lt;a href="https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/info"&gt;Stanford Lagunita’s statistical learning course&lt;/a&gt; has some excellent lectures on random forests. It starts with explanations of decision trees, followed by bagged trees and random forests, and ends with boosting. From these lectures it seems that:&lt;/p&gt;
+&lt;ol type="1"&gt;
+&lt;li&gt;The term “predictors” in statistical learning = “features” in machine learning.&lt;/li&gt;
+&lt;li&gt;The main idea of random forests of dropping predictors for individual trees and aggregate by majority or average is the same as the idea of dropout in neural networks, where a proportion of neurons in the hidden layers are dropped temporarily during different minibatches of training, effectively averaging over an emsemble of subnetworks. Both tricks are used as regularisations, i.e. to reduce the variance. The only difference is: in random forests, all but a square root number of the total number of features are dropped, whereas the dropout ratio in neural networks is usually a half.&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;By the way, here’s a comparison between statistical learning and machine learning from the slides of the Statistcal Learning course:&lt;/p&gt;
+&lt;p&gt;&lt;a href="../assets/resources/sl-vs-ml.png"&gt;&lt;img src="../assets/resources/sl-vs-ml.png" alt="SL vs ML" style="width:38em" /&gt;&lt;/a&gt;&lt;/p&gt;
+</content>
+  </entry>
+  <entry xml:base="https://ypei.me/microblog-feed.xml">
+    <title type="text">2018-05-14</title>
+    <id>microblog.html</id>
+    <updated>2018-05-14T00:00:00Z</updated>
+    <link href="microblog.html" />
+    <author>
+      <name>Yuchen Pei</name>
+    </author>
+    <content type="html">&lt;h3 id="open-peer-review"&gt;Open peer review&lt;/h3&gt;
+&lt;p&gt;Open peer review means peer review process where communications e.g. comments and responses are public.&lt;/p&gt;
+&lt;p&gt;Like &lt;a href="https://scipost.org/"&gt;SciPost&lt;/a&gt; mentioned in &lt;a href="/posts/2018-04-10-update-open-research.html"&gt;my post&lt;/a&gt;, &lt;a href="https://openreview.net"&gt;OpenReview.net&lt;/a&gt; is an example of open peer review in research. It looks like their focus is machine learning. Their &lt;a href="https://openreview.net/about"&gt;about page&lt;/a&gt; states their mission, and here’s &lt;a href="https://openreview.net/group?id=ICLR.cc/2018/Conference"&gt;an example&lt;/a&gt; where you can click on each entry to see what it is like. We definitely need this in the maths research community.&lt;/p&gt;
+</content>
+  </entry>
+  <entry xml:base="https://ypei.me/microblog-feed.xml">
     <title type="text">2018-05-11</title>
     <id>microblog.html</id>
     <updated>2018-05-11T00:00:00Z</updated>
diff --git a/site/microblog.html b/site/microblog.html
index c551725..2444f82 100644
--- a/site/microblog.html
+++ b/site/microblog.html
@@ -19,6 +19,53 @@
 
         <div class="main">
             <div class="bodyitem">
+    <span id=how-to-learn-on-your-own><p><a href="#how-to-learn-on-your-own">2018-05-30</a></p></span>
+        <p>Roger Grosse’s post <a href="https://metacademy.org/roadmaps/rgrosse/learn_on_your_own">How to learn on your own (2015)</a> is an excellent modern guide on how to learn and research technical stuff (especially machine learning and maths) on one’s own.</p>
+
+</div>
+<div class="bodyitem">
+    <span id=2048-mdp><p><a href="#2048-mdp">2018-05-25</a></p></span>
+        <p><a href="http://jdlm.info/articles/2018/03/18/markov-decision-process-2048.html">This post</a> models 2048 as an MDP and solves it using policy iteration and backward induction.</p>
+
+</div>
+<div class="bodyitem">
+    <span id=ats><p><a href="#ats">2018-05-22</a></p></span>
+        <blockquote>
+<p>ATS (Applied Type System) is a programming language designed to unify programming with formal specification. ATS has support for combining theorem proving with practical programming through the use of advanced type systems. A past version of The Computer Language Benchmarks Game has demonstrated that the performance of ATS is comparable to that of the C and C++ programming languages. By using theorem proving and strict type checking, the compiler can detect and prove that its implemented functions are not susceptible to bugs such as division by zero, memory leaks, buffer overflow, and other forms of memory corruption by verifying pointer arithmetic and reference counting before the program compiles. Additionally, by using the integrated theorem-proving system of ATS (ATS/LF), the programmer may make use of static constructs that are intertwined with the operative code to prove that a function attains its specification.</p>
+</blockquote>
+<p><a href="https://en.wikipedia.org/wiki/ATS_(programming_language)">Wikipedia entry on ATS</a></p>
+
+</div>
+<div class="bodyitem">
+    <span id=bostoncalling><p><a href="#bostoncalling">2018-05-20</a></p></span>
+        <p>(5-second fame) I sent a picture of my kitchen sink to BBC and got mentioned in the <a href="https://www.bbc.co.uk/programmes/w3cswg8c">latest Boston Calling episode</a> (listen at 25:54).</p>
+
+</div>
+<div class="bodyitem">
+    <span id=colah-blog><p><a href="#colah-blog">2018-05-18</a></p></span>
+        <p><a href="https://colah.github.io/">colah’s blog</a> has a cool feature that allows you to comment on any paragraph of a blog post. Here’s an <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">example</a>. If it is doable on a static site hosted on Github pages, I suppose it shouldn’t be too hard to implement. This also seems to work more seamlessly than <a href="https://fermatslibrary.com/">Fermat’s Library</a>, because the latter has to embed pdfs in webpages. Now fantasy time: imagine that one day arXiv shows html versions of papers (through author uploading or conversion from TeX) with this feature.</p>
+
+</div>
+<div class="bodyitem">
+    <span id=random-forests><p><a href="#random-forests">2018-05-15</a></p></span>
+        <h3 id="notes-on-random-froests">Notes on random froests</h3>
+<p><a href="https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/info">Stanford Lagunita’s statistical learning course</a> has some excellent lectures on random forests. It starts with explanations of decision trees, followed by bagged trees and random forests, and ends with boosting. From these lectures it seems that:</p>
+<ol type="1">
+<li>The term “predictors” in statistical learning = “features” in machine learning.</li>
+<li>The main idea of random forests of dropping predictors for individual trees and aggregate by majority or average is the same as the idea of dropout in neural networks, where a proportion of neurons in the hidden layers are dropped temporarily during different minibatches of training, effectively averaging over an emsemble of subnetworks. Both tricks are used as regularisations, i.e. to reduce the variance. The only difference is: in random forests, all but a square root number of the total number of features are dropped, whereas the dropout ratio in neural networks is usually a half.</li>
+</ol>
+<p>By the way, here’s a comparison between statistical learning and machine learning from the slides of the Statistcal Learning course:</p>
+<p><a href="../assets/resources/sl-vs-ml.png"><img src="../assets/resources/sl-vs-ml.png" alt="SL vs ML" style="width:38em" /></a></p>
+
+</div>
+<div class="bodyitem">
+    <span id=open-review-net><p><a href="#open-review-net">2018-05-14</a></p></span>
+        <h3 id="open-peer-review">Open peer review</h3>
+<p>Open peer review means peer review process where communications e.g. comments and responses are public.</p>
+<p>Like <a href="https://scipost.org/">SciPost</a> mentioned in <a href="/posts/2018-04-10-update-open-research.html">my post</a>, <a href="https://openreview.net">OpenReview.net</a> is an example of open peer review in research. It looks like their focus is machine learning. Their <a href="https://openreview.net/about">about page</a> states their mission, and here’s <a href="https://openreview.net/group?id=ICLR.cc/2018/Conference">an example</a> where you can click on each entry to see what it is like. We definitely need this in the maths research community.</p>
+
+</div>
+<div class="bodyitem">
     <span id=rnn-fsm><p><a href="#rnn-fsm">2018-05-11</a></p></span>
         <h3 id="some-notes-on-rnn-fsm-fa-tm-and-utm">Some notes on RNN, FSM / FA, TM and UTM</h3>
 <p>Related to <a href="#neural-turing-machine">a previous micropost</a>.</p>
diff --git a/site/postlist.html b/site/postlist.html
index b58f39e..0ee5d77 100644
--- a/site/postlist.html
+++ b/site/postlist.html
@@ -21,6 +21,9 @@
             <div class="bodyitem">
                 <ul class="postlist">
                     <li class="postlistitem">
+    <a href="posts/2018-06-03-automatic_differentiation.html">Automatic differentiation</a> - 2018-06-03
+</li>
+<li class="postlistitem">
     <a href="posts/2018-04-10-update-open-research.html">Updates on open research</a> - 2018-04-29
 </li>
 <li class="postlistitem">
diff --git a/site/posts/2015-07-01-causal-quantum-product-levy-area.html b/site/posts/2015-07-01-causal-quantum-product-levy-area.html
index cda8121..3fdaa72 100644
--- a/site/posts/2015-07-01-causal-quantum-product-levy-area.html
+++ b/site/posts/2015-07-01-causal-quantum-product-levy-area.html
@@ -22,7 +22,7 @@
                 <h2> On a causal quantum double product integral related to Lévy stochastic area. </h2>
                 <p>Posted on 2015-07-01</p>
                     <p>In <a href="https://arxiv.org/abs/1506.04294">this paper</a> with <a href="http://homepages.lboro.ac.uk/~marh3/">Robin</a> we study the family of causal double product integrals \[ \prod_{a &lt; x &lt; y &lt; b}\left(1 + i{\lambda \over 2}(dP_x dQ_y - dQ_x dP_y) + i {\mu \over 2}(dP_x dP_y + dQ_x dQ_y)\right) \]</p>
-<p>where <span class="math inline"><em>P</em></span> and <span class="math inline"><em>Q</em></span> are the mutually noncommuting momentum and position Brownian motions of quantum stochastic calculus. The evaluation is motivated heuristically by approximating the continuous double product by a discrete product in which infinitesimals are replaced by finite increments. The latter is in turn approximated by the second quantisation of a discrete double product of rotation-like operators in different planes due to a result in <a href="http://www.actaphys.uj.edu.pl/findarticle?series=Reg&amp;vol=46&amp;page=1851">(Hudson-Pei2015)</a>. The main problem solved in this paper is the explicit evaluation of the continuum limit <span class="math inline"><em>W</em></span> of the latter, and showing that <span class="math inline"><em>W</em></span> is a unitary operator. The kernel of <span class="math inline"><em>W</em></span> is written in terms of Bessel functions, and the evaluation is achieved by working on a lattice path model and enumerating linear extensions of related partial orderings, where the enumeration turns out to be heavily related to Dyck paths and generalisations of Catalan numbers.</p>
+<p>where <span class="math inline">\(P\)</span> and <span class="math inline">\(Q\)</span> are the mutually noncommuting momentum and position Brownian motions of quantum stochastic calculus. The evaluation is motivated heuristically by approximating the continuous double product by a discrete product in which infinitesimals are replaced by finite increments. The latter is in turn approximated by the second quantisation of a discrete double product of rotation-like operators in different planes due to a result in <a href="http://www.actaphys.uj.edu.pl/findarticle?series=Reg&amp;vol=46&amp;page=1851">(Hudson-Pei2015)</a>. The main problem solved in this paper is the explicit evaluation of the continuum limit <span class="math inline">\(W\)</span> of the latter, and showing that <span class="math inline">\(W\)</span> is a unitary operator. The kernel of <span class="math inline">\(W\)</span> is written in terms of Bessel functions, and the evaluation is achieved by working on a lattice path model and enumerating linear extensions of related partial orderings, where the enumeration turns out to be heavily related to Dyck paths and generalisations of Catalan numbers.</p>
 
             </div>
         </div>
diff --git a/site/posts/2018-06-03-automatic_differentiation.html b/site/posts/2018-06-03-automatic_differentiation.html
new file mode 100644
index 0000000..8c2b97a
--- /dev/null
+++ b/site/posts/2018-06-03-automatic_differentiation.html
@@ -0,0 +1,76 @@
+<!doctype html>
+<html lang="en">
+    <head>
+        <meta charset="utf-8">
+        <title>Automatic differentiation</title>
+        <link rel="stylesheet" href="../assets/css/default.css" />
+        <script data-isso="/comments/"
+                data-isso-css="true"
+                data-isso-lang="en"
+                data-isso-reply-to-self="false"
+                data-isso-require-author="true"
+                data-isso-require-email="true"
+                data-isso-max-comments-top="10"
+                data-isso-max-comments-nested="5"
+                data-isso-reveal-on-click="5"
+                data-isso-avatar="true"
+                data-isso-avatar-bg="#f0f0f0"
+                data-isso-avatar-fg="#9abf88 #5698c4 #e279a3 #9163b6 ..."
+                data-isso-vote="true"
+                data-vote-levels=""
+                src="/comments/js/embed.min.js"></script>
+        <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
+        <script src="../assets/js/analytics.js" type="text/javascript"></script>
+    </head>
+    <body>
+        <header>
+            <span class="logo">
+                <a href="../blog.html">Yuchen's Blog</a>
+            </span>
+            <nav>
+                <a href="../index.html">About</a><a href="../postlist.html">All posts</a><a href="../blog-feed.xml">Feed</a>
+            </nav>
+        </header>
+
+        <div class="main">
+            <div class="bodyitem">
+                <h2> Automatic differentiation </h2>
+                <p>Posted on 2018-06-03 | <a href="/posts/2018-06-03-automatic_differentiation.html#isso-thread">Comments</a> </p>
+                    <p>This post is meant as a documentation of my understanding of autodiff. I benefited a lot from <a href="http://www.cs.toronto.edu/%7Ergrosse/courses/csc321_2018/slides/lec10.pdf">Toronto CSC321 slides</a> and the <a href="https://github.com/mattjj/autodidact/">autodidact</a> project which is a pedagogical implementation of <a href="https://github.com/hips/autograd">Autograd</a>. That said, any mistakes in this note are mine (especially since some of the knowledge is obtained from interpreting slides!), and if you do spot any I would be grateful if you can let me know.</p>
+<p>Automatic differentiation (AD) is a way to compute derivatives. It does so by traversing through a computational graph using the chain rule.</p>
+<p>There are the forward mode AD and reverse mode AD, which are kind of symmetric to each other and understanding one of them results in little to no difficulty in understanding the other.</p>
+<p>In the language of neural networks, one can say that the forward mode AD is used when one wants to compute the derivatives of functions at all layers with respect to input layer weights, whereas the reverse mode AD is used to compute the derivatives of output functions with respect to weights at all layers. Therefore reverse mode AD (rmAD) is the one to use for gradient descent, which is the one we focus in this post.</p>
+<p>Basically rmAD requires the computation to be sufficiently decomposed, so that in the computational graph, each node as a function of its parent nodes is an elementary function that the AD engine has knowledge about.</p>
+<p>For example, the Sigmoid activation <span class="math inline">\(a&#39; = \sigma(w a + b)\)</span> is quite simple, but it should be decomposed to simpler computations:</p>
+<ul>
+<li><span class="math inline">\(a&#39; = 1 / t_1\)</span></li>
+<li><span class="math inline">\(t_1 = 1 + t_2\)</span></li>
+<li><span class="math inline">\(t_2 = \exp(t_3)\)</span></li>
+<li><span class="math inline">\(t_3 = - t_4\)</span></li>
+<li><span class="math inline">\(t_4 = t_5 + b\)</span></li>
+<li><span class="math inline">\(t_5 = w a\)</span></li>
+</ul>
+<p>Thus the function <span class="math inline">\(a&#39;(a)\)</span> is decomposed to elementary operations like addition, subtraction, multiplication, reciprocitation, exponentiation, logarithm etc. And the rmAD engine stores the Jacobian of these elementary operations.</p>
+<p>Since in neural networks we want to find derivatives of a single loss function <span class="math inline">\(L(x; \theta)\)</span>, we can omit <span class="math inline">\(L\)</span> when writing derivatives and denote, say <span class="math inline">\(\bar \theta_k := \partial_{\theta_k} L\)</span>.</p>
+<p>In implementations of rmAD, one can represent the Jacobian as a transformation <span class="math inline">\(j: (x \to y) \to (y, \bar y, x) \to \bar x\)</span>. <span class="math inline">\(j\)</span> is called the <em>Vector Jacobian Product</em> (VJP). For example, <span class="math inline">\(j(\exp)(y, \bar y, x) = y \bar y\)</span> since given <span class="math inline">\(y = \exp(x)\)</span>,</p>
+<p><span class="math inline">\(\partial_x L = \partial_x y \cdot \partial_y L = \partial_x \exp(x) \cdot \partial_y L = y \bar y\)</span></p>
+<p>as another example, <span class="math inline">\(j(+)(y, \bar y, x_1, x_2) = (\bar y, \bar y)\)</span> since given <span class="math inline">\(y = x_1 + x_2\)</span>, <span class="math inline">\(\bar{x_1} = \bar{x_2} = \bar y\)</span>.</p>
+<p>Similarly,</p>
+<ol type="1">
+<li><span class="math inline">\(j(/)(y, \bar y, x_1, x_2) = (\bar y / x_2, - \bar y x_1 / x_2^2)\)</span></li>
+<li><span class="math inline">\(j(\log)(y, \bar y, x) = \bar y / x\)</span></li>
+<li><span class="math inline">\(j((A, \beta) \mapsto A \beta)(y, \bar y, A, \beta) = (\bar y \otimes \beta, A^T \bar y)\)</span>.</li>
+<li>etc...</li>
+</ol>
+<p>In the third one, the function is a matrix <span class="math inline">\(A\)</span> multiplied on the right by a column vector <span class="math inline">\(\beta\)</span>, and <span class="math inline">\(\bar y \otimes \beta\)</span> is the tensor product which is a fancy way of writing <span class="math inline">\(\bar y \beta^T\)</span>. See <a href="https://github.com/mattjj/autodidact/blob/master/autograd/numpy/numpy_vjps.py">numpy_vjps.py</a> for the implementation in autodidact.</p>
+<p>So, given a node say <span class="math inline">\(y = y(x_1, x_2, ..., x_n)\)</span>, and given the value of <span class="math inline">\(y\)</span>, <span class="math inline">\(x_{1 : n}\)</span> and <span class="math inline">\(\bar y\)</span>, rmAD computes the values of <span class="math inline">\(\bar x_{1 : n}\)</span> by using the Jacobians.</p>
+<p>This is the gist of rmAD. It stores the values of each node in a forward pass, and computes the derivatives of each node exactly once in a backward pass.</p>
+<p>It is a nice exercise to derive the backpropagation in the fully connected feedforward neural networks (e.g. <a href="http://neuralnetworksanddeeplearning.com/chap2.html#the_four_fundamental_equations_behind_backpropagation">the one for MNIST in Neural Networks and Deep Learning</a>) using rmAD.</p>
+<p>AD is an approach lying between the extremes of numerical approximation (e.g. finite difference) and symbolic evaluation. It uses exact formulas (VJP) at each elementary operation like symbolic evaluation, while evaluates each VJP numerically rather than lumping all the VJPs into an unwieldy symbolic formula.</p>
+<p>Things to look further into: the higher-order functional currying form <span class="math inline">\(j: (x \to y) \to (y, \bar y, x) \to \bar x\)</span> begs for a functional programming implementation.</p>
+
+            </div>
+        <section id="isso-thread"></section>
+        </div>
+    </body>
+</html>