From d4d048e66b16a3713caec957e94e8d7e80e39368 Mon Sep 17 00:00:00 2001 From: Yuchen Pei Date: Sun, 3 Jun 2018 22:22:43 +0200 Subject: fixed mathjax conversion from md --- site/blog-feed.xml | 46 ++++++++++- site/blog.html | 14 ++-- site/links.html | 1 + site/microblog-feed.xml | 91 +++++++++++++++++++++- site/microblog.html | 47 +++++++++++ site/postlist.html | 3 + ...015-07-01-causal-quantum-product-levy-area.html | 2 +- .../2018-06-03-automatic_differentiation.html | 76 ++++++++++++++++++ 8 files changed, 269 insertions(+), 11 deletions(-) create mode 100644 site/posts/2018-06-03-automatic_differentiation.html (limited to 'site') diff --git a/site/blog-feed.xml b/site/blog-feed.xml index fac8ce3..82d8c31 100644 --- a/site/blog-feed.xml +++ b/site/blog-feed.xml @@ -2,13 +2,55 @@ Yuchen Pei's Blog https://ypei.me/blog-feed.xml - 2018-04-29T00:00:00Z + 2018-06-03T00:00:00Z Yuchen Pei PyAtom + + Automatic differentiation + posts/2018-06-03-automatic_differentiation.html + 2018-06-03T00:00:00Z + + + Yuchen Pei + + <p>This post is meant as a documentation of my understanding of autodiff. I benefited a lot from <a href="http://www.cs.toronto.edu/%7Ergrosse/courses/csc321_2018/slides/lec10.pdf">Toronto CSC321 slides</a> and the <a href="https://github.com/mattjj/autodidact/">autodidact</a> project which is a pedagogical implementation of <a href="https://github.com/hips/autograd">Autograd</a>. That said, any mistakes in this note are mine (especially since some of the knowledge is obtained from interpreting slides!), and if you do spot any I would be grateful if you can let me know.</p> +<p>Automatic differentiation (AD) is a way to compute derivatives. It does so by traversing through a computational graph using the chain rule.</p> +<p>There are the forward mode AD and reverse mode AD, which are kind of symmetric to each other and understanding one of them results in little to no difficulty in understanding the other.</p> +<p>In the language of neural networks, one can say that the forward mode AD is used when one wants to compute the derivatives of functions at all layers with respect to input layer weights, whereas the reverse mode AD is used to compute the derivatives of output functions with respect to weights at all layers. Therefore reverse mode AD (rmAD) is the one to use for gradient descent, which is the one we focus in this post.</p> +<p>Basically rmAD requires the computation to be sufficiently decomposed, so that in the computational graph, each node as a function of its parent nodes is an elementary function that the AD engine has knowledge about.</p> +<p>For example, the Sigmoid activation <span class="math inline">\(a&#39; = \sigma(w a + b)\)</span> is quite simple, but it should be decomposed to simpler computations:</p> +<ul> +<li><span class="math inline">\(a&#39; = 1 / t_1\)</span></li> +<li><span class="math inline">\(t_1 = 1 + t_2\)</span></li> +<li><span class="math inline">\(t_2 = \exp(t_3)\)</span></li> +<li><span class="math inline">\(t_3 = - t_4\)</span></li> +<li><span class="math inline">\(t_4 = t_5 + b\)</span></li> +<li><span class="math inline">\(t_5 = w a\)</span></li> +</ul> +<p>Thus the function <span class="math inline">\(a&#39;(a)\)</span> is decomposed to elementary operations like addition, subtraction, multiplication, reciprocitation, exponentiation, logarithm etc. And the rmAD engine stores the Jacobian of these elementary operations.</p> +<p>Since in neural networks we want to find derivatives of a single loss function <span class="math inline">\(L(x; \theta)\)</span>, we can omit <span class="math inline">\(L\)</span> when writing derivatives and denote, say <span class="math inline">\(\bar \theta_k := \partial_{\theta_k} L\)</span>.</p> +<p>In implementations of rmAD, one can represent the Jacobian as a transformation <span class="math inline">\(j: (x \to y) \to (y, \bar y, x) \to \bar x\)</span>. <span class="math inline">\(j\)</span> is called the <em>Vector Jacobian Product</em> (VJP). For example, <span class="math inline">\(j(\exp)(y, \bar y, x) = y \bar y\)</span> since given <span class="math inline">\(y = \exp(x)\)</span>,</p> +<p><span class="math inline">\(\partial_x L = \partial_x y \cdot \partial_y L = \partial_x \exp(x) \cdot \partial_y L = y \bar y\)</span></p> +<p>as another example, <span class="math inline">\(j(+)(y, \bar y, x_1, x_2) = (\bar y, \bar y)\)</span> since given <span class="math inline">\(y = x_1 + x_2\)</span>, <span class="math inline">\(\bar{x_1} = \bar{x_2} = \bar y\)</span>.</p> +<p>Similarly,</p> +<ol type="1"> +<li><span class="math inline">\(j(/)(y, \bar y, x_1, x_2) = (\bar y / x_2, - \bar y x_1 / x_2^2)\)</span></li> +<li><span class="math inline">\(j(\log)(y, \bar y, x) = \bar y / x\)</span></li> +<li><span class="math inline">\(j((A, \beta) \mapsto A \beta)(y, \bar y, A, \beta) = (\bar y \otimes \beta, A^T \bar y)\)</span>.</li> +<li>etc...</li> +</ol> +<p>In the third one, the function is a matrix <span class="math inline">\(A\)</span> multiplied on the right by a column vector <span class="math inline">\(\beta\)</span>, and <span class="math inline">\(\bar y \otimes \beta\)</span> is the tensor product which is a fancy way of writing <span class="math inline">\(\bar y \beta^T\)</span>. See <a href="https://github.com/mattjj/autodidact/blob/master/autograd/numpy/numpy_vjps.py">numpy_vjps.py</a> for the implementation in autodidact.</p> +<p>So, given a node say <span class="math inline">\(y = y(x_1, x_2, ..., x_n)\)</span>, and given the value of <span class="math inline">\(y\)</span>, <span class="math inline">\(x_{1 : n}\)</span> and <span class="math inline">\(\bar y\)</span>, rmAD computes the values of <span class="math inline">\(\bar x_{1 : n}\)</span> by using the Jacobians.</p> +<p>This is the gist of rmAD. It stores the values of each node in a forward pass, and computes the derivatives of each node exactly once in a backward pass.</p> +<p>It is a nice exercise to derive the backpropagation in the fully connected feedforward neural networks (e.g. <a href="http://neuralnetworksanddeeplearning.com/chap2.html#the_four_fundamental_equations_behind_backpropagation">the one for MNIST in Neural Networks and Deep Learning</a>) using rmAD.</p> +<p>AD is an approach lying between the extremes of numerical approximation (e.g. finite difference) and symbolic evaluation. It uses exact formulas (VJP) at each elementary operation like symbolic evaluation, while evaluates each VJP numerically rather than lumping all the VJPs into an unwieldy symbolic formula.</p> +<p>Things to look further into: the higher-order functional currying form <span class="math inline">\(j: (x \to y) \to (y, \bar y, x) \to \bar x\)</span> begs for a functional programming implementation.</p> + + Updates on open research posts/2018-04-10-update-open-research.html @@ -181,7 +223,7 @@ Yuchen Pei <p>In <a href="https://arxiv.org/abs/1506.04294">this paper</a> with <a href="http://homepages.lboro.ac.uk/~marh3/">Robin</a> we study the family of causal double product integrals \[ \prod_{a &lt; x &lt; y &lt; b}\left(1 + i{\lambda \over 2}(dP_x dQ_y - dQ_x dP_y) + i {\mu \over 2}(dP_x dP_y + dQ_x dQ_y)\right) \]</p> -<p>where <span class="math inline"><em>P</em></span> and <span class="math inline"><em>Q</em></span> are the mutually noncommuting momentum and position Brownian motions of quantum stochastic calculus. The evaluation is motivated heuristically by approximating the continuous double product by a discrete product in which infinitesimals are replaced by finite increments. The latter is in turn approximated by the second quantisation of a discrete double product of rotation-like operators in different planes due to a result in <a href="http://www.actaphys.uj.edu.pl/findarticle?series=Reg&amp;vol=46&amp;page=1851">(Hudson-Pei2015)</a>. The main problem solved in this paper is the explicit evaluation of the continuum limit <span class="math inline"><em>W</em></span> of the latter, and showing that <span class="math inline"><em>W</em></span> is a unitary operator. The kernel of <span class="math inline"><em>W</em></span> is written in terms of Bessel functions, and the evaluation is achieved by working on a lattice path model and enumerating linear extensions of related partial orderings, where the enumeration turns out to be heavily related to Dyck paths and generalisations of Catalan numbers.</p> +<p>where <span class="math inline">\(P\)</span> and <span class="math inline">\(Q\)</span> are the mutually noncommuting momentum and position Brownian motions of quantum stochastic calculus. The evaluation is motivated heuristically by approximating the continuous double product by a discrete product in which infinitesimals are replaced by finite increments. The latter is in turn approximated by the second quantisation of a discrete double product of rotation-like operators in different planes due to a result in <a href="http://www.actaphys.uj.edu.pl/findarticle?series=Reg&amp;vol=46&amp;page=1851">(Hudson-Pei2015)</a>. The main problem solved in this paper is the explicit evaluation of the continuum limit <span class="math inline">\(W\)</span> of the latter, and showing that <span class="math inline">\(W\)</span> is a unitary operator. The kernel of <span class="math inline">\(W\)</span> is written in terms of Bessel functions, and the evaluation is achieved by working on a lattice path model and enumerating linear extensions of related partial orderings, where the enumeration turns out to be heavily related to Dyck paths and generalisations of Catalan numbers.</p> diff --git a/site/blog.html b/site/blog.html index 0d83120..3222e3a 100644 --- a/site/blog.html +++ b/site/blog.html @@ -19,6 +19,13 @@
+

Automatic differentiation

+

Posted on 2018-06-03

+

This post is meant as a documentation of my understanding of autodiff. I benefited a lot from Toronto CSC321 slides and the autodidact project which is a pedagogical implementation of Autograd. That said, any mistakes in this note are mine (especially since some of the knowledge is obtained from interpreting slides!), and if you do spot any I would be grateful if you can let me know.

+ + Continue reading +
+

Updates on open research

Posted on 2018-04-29

It has been 9 months since I last wrote about open (maths) research. Since then two things happened which prompted me to write an update.

@@ -46,13 +53,6 @@ Continue reading
-
-

AMS review of 'Double Macdonald polynomials as the stable limit of Macdonald superpolynomials' by Blondeau-Fournier, Lapointe and Mathieu

-

Posted on 2015-07-15

-

A Macdonald superpolynomial (introduced in [O. Blondeau-Fournier et al., Lett. Math. Phys. 101 (2012), no. 1, 27–47; MR2935476; J. Comb. 3 (2012), no. 3, 495–561; MR3029444]) in \(N\) Grassmannian variables indexed by a superpartition \(\Lambda\) is said to be stable if \({m (m + 1) \over 2} \ge |\Lambda|\) and \(N \ge |\Lambda| - {m (m - 3) \over 2}\) , where \(m\) is the fermionic degree. A stable Macdonald superpolynomial (corresponding to a bisymmetric polynomial) is also called a double Macdonald polynomial (dMp). The main result of this paper is the factorisation of a dMp into plethysms of two classical Macdonald polynomials (Theorem 5). Based on this result, this paper

- - Continue reading -

older posts

diff --git a/site/links.html b/site/links.html index 9b53d13..fdff77a 100644 --- a/site/links.html +++ b/site/links.html @@ -21,6 +21,7 @@

Here are some links I find interesting or helpful, or both. Listed in no particular order.

    +
  • CodaLab
  • HaskellForMaths
  • Open Problem Garden
  • AMS open notes
  • diff --git a/site/microblog-feed.xml b/site/microblog-feed.xml index a6578bc..4563861 100644 --- a/site/microblog-feed.xml +++ b/site/microblog-feed.xml @@ -2,13 +2,102 @@ Yuchen Pei's Microblog https://ypei.me/microblog-feed.xml - 2018-05-11T00:00:00Z + 2018-05-30T00:00:00Z Yuchen Pei PyAtom + + 2018-05-30 + microblog.html + 2018-05-30T00:00:00Z + + + Yuchen Pei + + <p>Roger Grosse’s post <a href="https://metacademy.org/roadmaps/rgrosse/learn_on_your_own">How to learn on your own (2015)</a> is an excellent modern guide on how to learn and research technical stuff (especially machine learning and maths) on one’s own.</p> + + + + 2018-05-25 + microblog.html + 2018-05-25T00:00:00Z + + + Yuchen Pei + + <p><a href="http://jdlm.info/articles/2018/03/18/markov-decision-process-2048.html">This post</a> models 2048 as an MDP and solves it using policy iteration and backward induction.</p> + + + + 2018-05-22 + microblog.html + 2018-05-22T00:00:00Z + + + Yuchen Pei + + <blockquote> +<p>ATS (Applied Type System) is a programming language designed to unify programming with formal specification. ATS has support for combining theorem proving with practical programming through the use of advanced type systems. A past version of The Computer Language Benchmarks Game has demonstrated that the performance of ATS is comparable to that of the C and C++ programming languages. By using theorem proving and strict type checking, the compiler can detect and prove that its implemented functions are not susceptible to bugs such as division by zero, memory leaks, buffer overflow, and other forms of memory corruption by verifying pointer arithmetic and reference counting before the program compiles. Additionally, by using the integrated theorem-proving system of ATS (ATS/LF), the programmer may make use of static constructs that are intertwined with the operative code to prove that a function attains its specification.</p> +</blockquote> +<p><a href="https://en.wikipedia.org/wiki/ATS_(programming_language)">Wikipedia entry on ATS</a></p> + + + + 2018-05-20 + microblog.html + 2018-05-20T00:00:00Z + + + Yuchen Pei + + <p>(5-second fame) I sent a picture of my kitchen sink to BBC and got mentioned in the <a href="https://www.bbc.co.uk/programmes/w3cswg8c">latest Boston Calling episode</a> (listen at 25:54).</p> + + + + 2018-05-18 + microblog.html + 2018-05-18T00:00:00Z + + + Yuchen Pei + + <p><a href="https://colah.github.io/">colah’s blog</a> has a cool feature that allows you to comment on any paragraph of a blog post. Here’s an <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">example</a>. If it is doable on a static site hosted on Github pages, I suppose it shouldn’t be too hard to implement. This also seems to work more seamlessly than <a href="https://fermatslibrary.com/">Fermat’s Library</a>, because the latter has to embed pdfs in webpages. Now fantasy time: imagine that one day arXiv shows html versions of papers (through author uploading or conversion from TeX) with this feature.</p> + + + + 2018-05-15 + microblog.html + 2018-05-15T00:00:00Z + + + Yuchen Pei + + <h3 id="notes-on-random-froests">Notes on random froests</h3> +<p><a href="https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/info">Stanford Lagunita’s statistical learning course</a> has some excellent lectures on random forests. It starts with explanations of decision trees, followed by bagged trees and random forests, and ends with boosting. From these lectures it seems that:</p> +<ol type="1"> +<li>The term “predictors” in statistical learning = “features” in machine learning.</li> +<li>The main idea of random forests of dropping predictors for individual trees and aggregate by majority or average is the same as the idea of dropout in neural networks, where a proportion of neurons in the hidden layers are dropped temporarily during different minibatches of training, effectively averaging over an emsemble of subnetworks. Both tricks are used as regularisations, i.e. to reduce the variance. The only difference is: in random forests, all but a square root number of the total number of features are dropped, whereas the dropout ratio in neural networks is usually a half.</li> +</ol> +<p>By the way, here’s a comparison between statistical learning and machine learning from the slides of the Statistcal Learning course:</p> +<p><a href="../assets/resources/sl-vs-ml.png"><img src="../assets/resources/sl-vs-ml.png" alt="SL vs ML" style="width:38em" /></a></p> + + + + 2018-05-14 + microblog.html + 2018-05-14T00:00:00Z + + + Yuchen Pei + + <h3 id="open-peer-review">Open peer review</h3> +<p>Open peer review means peer review process where communications e.g. comments and responses are public.</p> +<p>Like <a href="https://scipost.org/">SciPost</a> mentioned in <a href="/posts/2018-04-10-update-open-research.html">my post</a>, <a href="https://openreview.net">OpenReview.net</a> is an example of open peer review in research. It looks like their focus is machine learning. Their <a href="https://openreview.net/about">about page</a> states their mission, and here’s <a href="https://openreview.net/group?id=ICLR.cc/2018/Conference">an example</a> where you can click on each entry to see what it is like. We definitely need this in the maths research community.</p> + + 2018-05-11 microblog.html diff --git a/site/microblog.html b/site/microblog.html index c551725..2444f82 100644 --- a/site/microblog.html +++ b/site/microblog.html @@ -19,6 +19,53 @@
    +

    2018-05-30

    +

    Roger Grosse’s post How to learn on your own (2015) is an excellent modern guide on how to learn and research technical stuff (especially machine learning and maths) on one’s own.

    + +
    +
    +

    2018-05-25

    +

    This post models 2048 as an MDP and solves it using policy iteration and backward induction.

    + +
    +
    +

    2018-05-22

    +
    +

    ATS (Applied Type System) is a programming language designed to unify programming with formal specification. ATS has support for combining theorem proving with practical programming through the use of advanced type systems. A past version of The Computer Language Benchmarks Game has demonstrated that the performance of ATS is comparable to that of the C and C++ programming languages. By using theorem proving and strict type checking, the compiler can detect and prove that its implemented functions are not susceptible to bugs such as division by zero, memory leaks, buffer overflow, and other forms of memory corruption by verifying pointer arithmetic and reference counting before the program compiles. Additionally, by using the integrated theorem-proving system of ATS (ATS/LF), the programmer may make use of static constructs that are intertwined with the operative code to prove that a function attains its specification.

    +
    +

    Wikipedia entry on ATS

    + +
    +
    +

    2018-05-20

    +

    (5-second fame) I sent a picture of my kitchen sink to BBC and got mentioned in the latest Boston Calling episode (listen at 25:54).

    + +
    +
    +

    2018-05-18

    +

    colah’s blog has a cool feature that allows you to comment on any paragraph of a blog post. Here’s an example. If it is doable on a static site hosted on Github pages, I suppose it shouldn’t be too hard to implement. This also seems to work more seamlessly than Fermat’s Library, because the latter has to embed pdfs in webpages. Now fantasy time: imagine that one day arXiv shows html versions of papers (through author uploading or conversion from TeX) with this feature.

    + +
    +
    +

    2018-05-15

    +

    Notes on random froests

    +

    Stanford Lagunita’s statistical learning course has some excellent lectures on random forests. It starts with explanations of decision trees, followed by bagged trees and random forests, and ends with boosting. From these lectures it seems that:

    +
      +
    1. The term “predictors” in statistical learning = “features” in machine learning.
    2. +
    3. The main idea of random forests of dropping predictors for individual trees and aggregate by majority or average is the same as the idea of dropout in neural networks, where a proportion of neurons in the hidden layers are dropped temporarily during different minibatches of training, effectively averaging over an emsemble of subnetworks. Both tricks are used as regularisations, i.e. to reduce the variance. The only difference is: in random forests, all but a square root number of the total number of features are dropped, whereas the dropout ratio in neural networks is usually a half.
    4. +
    +

    By the way, here’s a comparison between statistical learning and machine learning from the slides of the Statistcal Learning course:

    +

    SL vs ML

    + +
    +
    +

    2018-05-14

    +

    Open peer review

    +

    Open peer review means peer review process where communications e.g. comments and responses are public.

    +

    Like SciPost mentioned in my post, OpenReview.net is an example of open peer review in research. It looks like their focus is machine learning. Their about page states their mission, and here’s an example where you can click on each entry to see what it is like. We definitely need this in the maths research community.

    + +
    +

    2018-05-11

    Some notes on RNN, FSM / FA, TM and UTM

    Related to a previous micropost.

    diff --git a/site/postlist.html b/site/postlist.html index b58f39e..0ee5d77 100644 --- a/site/postlist.html +++ b/site/postlist.html @@ -21,6 +21,9 @@
    • + Automatic differentiation - 2018-06-03 +
    • +
    • Updates on open research - 2018-04-29
    • diff --git a/site/posts/2015-07-01-causal-quantum-product-levy-area.html b/site/posts/2015-07-01-causal-quantum-product-levy-area.html index cda8121..3fdaa72 100644 --- a/site/posts/2015-07-01-causal-quantum-product-levy-area.html +++ b/site/posts/2015-07-01-causal-quantum-product-levy-area.html @@ -22,7 +22,7 @@

      On a causal quantum double product integral related to Lévy stochastic area.

      Posted on 2015-07-01

      In this paper with Robin we study the family of causal double product integrals \[ \prod_{a < x < y < b}\left(1 + i{\lambda \over 2}(dP_x dQ_y - dQ_x dP_y) + i {\mu \over 2}(dP_x dP_y + dQ_x dQ_y)\right) \]

      -

      where P and Q are the mutually noncommuting momentum and position Brownian motions of quantum stochastic calculus. The evaluation is motivated heuristically by approximating the continuous double product by a discrete product in which infinitesimals are replaced by finite increments. The latter is in turn approximated by the second quantisation of a discrete double product of rotation-like operators in different planes due to a result in (Hudson-Pei2015). The main problem solved in this paper is the explicit evaluation of the continuum limit W of the latter, and showing that W is a unitary operator. The kernel of W is written in terms of Bessel functions, and the evaluation is achieved by working on a lattice path model and enumerating linear extensions of related partial orderings, where the enumeration turns out to be heavily related to Dyck paths and generalisations of Catalan numbers.

      +

      where \(P\) and \(Q\) are the mutually noncommuting momentum and position Brownian motions of quantum stochastic calculus. The evaluation is motivated heuristically by approximating the continuous double product by a discrete product in which infinitesimals are replaced by finite increments. The latter is in turn approximated by the second quantisation of a discrete double product of rotation-like operators in different planes due to a result in (Hudson-Pei2015). The main problem solved in this paper is the explicit evaluation of the continuum limit \(W\) of the latter, and showing that \(W\) is a unitary operator. The kernel of \(W\) is written in terms of Bessel functions, and the evaluation is achieved by working on a lattice path model and enumerating linear extensions of related partial orderings, where the enumeration turns out to be heavily related to Dyck paths and generalisations of Catalan numbers.

    diff --git a/site/posts/2018-06-03-automatic_differentiation.html b/site/posts/2018-06-03-automatic_differentiation.html new file mode 100644 index 0000000..8c2b97a --- /dev/null +++ b/site/posts/2018-06-03-automatic_differentiation.html @@ -0,0 +1,76 @@ + + + + + Automatic differentiation + + + + + + +
    + + +
    + +
    +
    +

    Automatic differentiation

    +

    Posted on 2018-06-03 | Comments

    +

    This post is meant as a documentation of my understanding of autodiff. I benefited a lot from Toronto CSC321 slides and the autodidact project which is a pedagogical implementation of Autograd. That said, any mistakes in this note are mine (especially since some of the knowledge is obtained from interpreting slides!), and if you do spot any I would be grateful if you can let me know.

    +

    Automatic differentiation (AD) is a way to compute derivatives. It does so by traversing through a computational graph using the chain rule.

    +

    There are the forward mode AD and reverse mode AD, which are kind of symmetric to each other and understanding one of them results in little to no difficulty in understanding the other.

    +

    In the language of neural networks, one can say that the forward mode AD is used when one wants to compute the derivatives of functions at all layers with respect to input layer weights, whereas the reverse mode AD is used to compute the derivatives of output functions with respect to weights at all layers. Therefore reverse mode AD (rmAD) is the one to use for gradient descent, which is the one we focus in this post.

    +

    Basically rmAD requires the computation to be sufficiently decomposed, so that in the computational graph, each node as a function of its parent nodes is an elementary function that the AD engine has knowledge about.

    +

    For example, the Sigmoid activation \(a' = \sigma(w a + b)\) is quite simple, but it should be decomposed to simpler computations:

    +
      +
    • \(a' = 1 / t_1\)
    • +
    • \(t_1 = 1 + t_2\)
    • +
    • \(t_2 = \exp(t_3)\)
    • +
    • \(t_3 = - t_4\)
    • +
    • \(t_4 = t_5 + b\)
    • +
    • \(t_5 = w a\)
    • +
    +

    Thus the function \(a'(a)\) is decomposed to elementary operations like addition, subtraction, multiplication, reciprocitation, exponentiation, logarithm etc. And the rmAD engine stores the Jacobian of these elementary operations.

    +

    Since in neural networks we want to find derivatives of a single loss function \(L(x; \theta)\), we can omit \(L\) when writing derivatives and denote, say \(\bar \theta_k := \partial_{\theta_k} L\).

    +

    In implementations of rmAD, one can represent the Jacobian as a transformation \(j: (x \to y) \to (y, \bar y, x) \to \bar x\). \(j\) is called the Vector Jacobian Product (VJP). For example, \(j(\exp)(y, \bar y, x) = y \bar y\) since given \(y = \exp(x)\),

    +

    \(\partial_x L = \partial_x y \cdot \partial_y L = \partial_x \exp(x) \cdot \partial_y L = y \bar y\)

    +

    as another example, \(j(+)(y, \bar y, x_1, x_2) = (\bar y, \bar y)\) since given \(y = x_1 + x_2\), \(\bar{x_1} = \bar{x_2} = \bar y\).

    +

    Similarly,

    +
      +
    1. \(j(/)(y, \bar y, x_1, x_2) = (\bar y / x_2, - \bar y x_1 / x_2^2)\)
    2. +
    3. \(j(\log)(y, \bar y, x) = \bar y / x\)
    4. +
    5. \(j((A, \beta) \mapsto A \beta)(y, \bar y, A, \beta) = (\bar y \otimes \beta, A^T \bar y)\).
    6. +
    7. etc...
    8. +
    +

    In the third one, the function is a matrix \(A\) multiplied on the right by a column vector \(\beta\), and \(\bar y \otimes \beta\) is the tensor product which is a fancy way of writing \(\bar y \beta^T\). See numpy_vjps.py for the implementation in autodidact.

    +

    So, given a node say \(y = y(x_1, x_2, ..., x_n)\), and given the value of \(y\), \(x_{1 : n}\) and \(\bar y\), rmAD computes the values of \(\bar x_{1 : n}\) by using the Jacobians.

    +

    This is the gist of rmAD. It stores the values of each node in a forward pass, and computes the derivatives of each node exactly once in a backward pass.

    +

    It is a nice exercise to derive the backpropagation in the fully connected feedforward neural networks (e.g. the one for MNIST in Neural Networks and Deep Learning) using rmAD.

    +

    AD is an approach lying between the extremes of numerical approximation (e.g. finite difference) and symbolic evaluation. It uses exact formulas (VJP) at each elementary operation like symbolic evaluation, while evaluates each VJP numerically rather than lumping all the VJPs into an unwieldy symbolic formula.

    +

    Things to look further into: the higher-order functional currying form \(j: (x \to y) \to (y, \bar y, x) \to \bar x\) begs for a functional programming implementation.

    + +
    +
    +
    + + -- cgit v1.2.3