From 147a19e84a743f1379f05bf2f444143b4afd7bd6 Mon Sep 17 00:00:00 2001
From: Yuchen Pei <me@ypei.me>
Date: Fri, 18 Jun 2021 12:58:44 +1000
Subject: Updated.

---
 posts/2018-06-03-automatic_differentiation.org | 100 +++++++++++++++++++++++++
 1 file changed, 100 insertions(+)
 create mode 100644 posts/2018-06-03-automatic_differentiation.org

(limited to 'posts/2018-06-03-automatic_differentiation.org')

diff --git a/posts/2018-06-03-automatic_differentiation.org b/posts/2018-06-03-automatic_differentiation.org
new file mode 100644
index 0000000..cebcf8c
--- /dev/null
+++ b/posts/2018-06-03-automatic_differentiation.org
@@ -0,0 +1,100 @@
+#+title: Automatic differentiation
+
+#+date: <2018-06-03>
+
+This post serves as a note and explainer of autodiff. It is licensed
+under [[https://www.gnu.org/licenses/fdl.html][GNU FDL]].
+
+For my learning I benefited a lot from
+[[http://www.cs.toronto.edu/%7Ergrosse/courses/csc321_2018/slides/lec10.pdf][Toronto
+CSC321 slides]] and the
+[[https://github.com/mattjj/autodidact/][autodidact]] project which is a
+pedagogical implementation of
+[[https://github.com/hips/autograd][Autograd]]. That said, any mistakes
+in this note are mine (especially since some of the knowledge is
+obtained from interpreting slides!), and if you do spot any I would be
+grateful if you can let me know.
+
+Automatic differentiation (AD) is a way to compute derivatives. It does
+so by traversing through a computational graph using the chain rule.
+
+There are the forward mode AD and reverse mode AD, which are kind of
+symmetric to each other and understanding one of them results in little
+to no difficulty in understanding the other.
+
+In the language of neural networks, one can say that the forward mode AD
+is used when one wants to compute the derivatives of functions at all
+layers with respect to input layer weights, whereas the reverse mode AD
+is used to compute the derivatives of output functions with respect to
+weights at all layers. Therefore reverse mode AD (rmAD) is the one to
+use for gradient descent, which is the one we focus in this post.
+
+Basically rmAD requires the computation to be sufficiently decomposed,
+so that in the computational graph, each node as a function of its
+parent nodes is an elementary function that the AD engine has knowledge
+about.
+
+For example, the Sigmoid activation $a' = \sigma(w a + b)$ is quite
+simple, but it should be decomposed to simpler computations:
+
+- $a' = 1 / t_1$
+- $t_1 = 1 + t_2$
+- $t_2 = \exp(t_3)$
+- $t_3 = - t_4$
+- $t_4 = t_5 + b$
+- $t_5 = w a$
+
+Thus the function $a'(a)$ is decomposed to elementary operations like
+addition, subtraction, multiplication, reciprocitation, exponentiation,
+logarithm etc. And the rmAD engine stores the Jacobian of these
+elementary operations.
+
+Since in neural networks we want to find derivatives of a single loss
+function $L(x; \theta)$, we can omit $L$ when writing derivatives and
+denote, say $\bar \theta_k := \partial_{\theta_k} L$.
+
+In implementations of rmAD, one can represent the Jacobian as a
+transformation $j: (x \to y) \to (y, \bar y, x) \to \bar x$. $j$ is
+called the /Vector Jacobian Product/ (VJP). For example,
+$j(\exp)(y, \bar y, x) = y \bar y$ since given $y = \exp(x)$,
+
+$\partial_x L = \partial_x y \cdot \partial_y L = \partial_x \exp(x) \cdot \partial_y L = y \bar y$
+
+as another example, $j(+)(y, \bar y, x_1, x_2) = (\bar y, \bar y)$ since
+given $y = x_1 + x_2$, $\bar{x_1} = \bar{x_2} = \bar y$.
+
+Similarly,
+
+1. $j(/)(y, \bar y, x_1, x_2) = (\bar y / x_2, - \bar y x_1 / x_2^2)$
+2. $j(\log)(y, \bar y, x) = \bar y / x$
+3. $j((A, \beta) \mapsto A \beta)(y, \bar y, A, \beta) = (\bar y \otimes \beta, A^T \bar y)$.
+4. etc...
+
+In the third one, the function is a matrix $A$ multiplied on the right
+by a column vector $\beta$, and $\bar y \otimes \beta$ is the tensor
+product which is a fancy way of writing $\bar y \beta^T$. See
+[[https://github.com/mattjj/autodidact/blob/master/autograd/numpy/numpy_vjps.py][numpy_vjps.py]]
+for the implementation in autodidact.
+
+So, given a node say $y = y(x_1, x_2, ..., x_n)$, and given the value of
+$y$, $x_{1 : n}$ and $\bar y$, rmAD computes the values of
+$\bar x_{1 : n}$ by using the Jacobians.
+
+This is the gist of rmAD. It stores the values of each node in a forward
+pass, and computes the derivatives of each node exactly once in a
+backward pass.
+
+It is a nice exercise to derive the backpropagation in the fully
+connected feedforward neural networks
+(e.g. [[http://neuralnetworksanddeeplearning.com/chap2.html#the_four_fundamental_equations_behind_backpropagation][the
+one for MNIST in Neural Networks and Deep Learning]]) using rmAD.
+
+AD is an approach lying between the extremes of numerical approximation
+(e.g. finite difference) and symbolic evaluation. It uses exact formulas
+(VJP) at each elementary operation like symbolic evaluation, while
+evaluates each VJP numerically rather than lumping all the VJPs into an
+unwieldy symbolic formula.
+
+Things to look further into: the higher-order functional currying form
+$j: (x \to y) \to (y, \bar y, x) \to \bar x$ begs for a functional
+programming implementation.
-- 
cgit v1.2.3