posts/2018-06-03-automatic_differentiation.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

# Copyright (C) 2013-2021 Yuchen Pei.

# Permission is granted to copy, distribute and/or modify this
# document under the terms of the GNU Free Documentation License,
# Version 1.3 or any later version published by the Free Software
# Foundation; with no Invariant Sections, no Front-Cover Texts, and
# no Back-Cover Texts. You should have received a copy of the GNU
# Free Documentation License. If not, see <https://www.gnu.org/licenses/>.

# This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

#+title: Automatic differentiation

#+date: <2018-06-03>

This post serves as a note and explainer of autodiff. It is licensed
under [[https://www.gnu.org/licenses/fdl.html][GNU FDL]].

For my learning I benefited a lot from
[[http://www.cs.toronto.edu/%7Ergrosse/courses/csc321_2018/slides/lec10.pdf][Toronto
CSC321 slides]] and the
[[https://github.com/mattjj/autodidact/][autodidact]] project which is a
pedagogical implementation of
[[https://github.com/hips/autograd][Autograd]]. That said, any mistakes
in this note are mine (especially since some of the knowledge is
obtained from interpreting slides!), and if you do spot any I would be
grateful if you can let me know.

Automatic differentiation (AD) is a way to compute derivatives. It does
so by traversing through a computational graph using the chain rule.

There are the forward mode AD and reverse mode AD, which are kind of
symmetric to each other and understanding one of them results in little
to no difficulty in understanding the other.

In the language of neural networks, one can say that the forward mode AD
is used when one wants to compute the derivatives of functions at all
layers with respect to input layer weights, whereas the reverse mode AD
is used to compute the derivatives of output functions with respect to
weights at all layers. Therefore reverse mode AD (rmAD) is the one to
use for gradient descent, which is the one we focus in this post.

Basically rmAD requires the computation to be sufficiently decomposed,
so that in the computational graph, each node as a function of its
parent nodes is an elementary function that the AD engine has knowledge
about.

For example, the Sigmoid activation $a' = \sigma(w a + b)$ is quite
simple, but it should be decomposed to simpler computations:

- $a' = 1 / t_1$
- $t_1 = 1 + t_2$
- $t_2 = \exp(t_3)$
- $t_3 = - t_4$
- $t_4 = t_5 + b$
- $t_5 = w a$

Thus the function $a'(a)$ is decomposed to elementary operations like
addition, subtraction, multiplication, reciprocitation, exponentiation,
logarithm etc. And the rmAD engine stores the Jacobian of these
elementary operations.

Since in neural networks we want to find derivatives of a single loss
function $L(x; \theta)$, we can omit $L$ when writing derivatives and
denote, say $\bar \theta_k := \partial_{\theta_k} L$.

In implementations of rmAD, one can represent the Jacobian as a
transformation $j: (x \to y) \to (y, \bar y, x) \to \bar x$. $j$ is
called the /Vector Jacobian Product/ (VJP). For example,
$j(\exp)(y, \bar y, x) = y \bar y$ since given $y = \exp(x)$,

$\partial_x L = \partial_x y \cdot \partial_y L = \partial_x \exp(x) \cdot \partial_y L = y \bar y$

as another example, $j(+)(y, \bar y, x_1, x_2) = (\bar y, \bar y)$ since
given $y = x_1 + x_2$, $\bar{x_1} = \bar{x_2} = \bar y$.

Similarly,

1. $j(/)(y, \bar y, x_1, x_2) = (\bar y / x_2, - \bar y x_1 / x_2^2)$
2. $j(\log)(y, \bar y, x) = \bar y / x$
3. $j((A, \beta) \mapsto A \beta)(y, \bar y, A, \beta) = (\bar y \otimes \beta, A^T \bar y)$.
4. etc...

In the third one, the function is a matrix $A$ multiplied on the right
by a column vector $\beta$, and $\bar y \otimes \beta$ is the tensor
product which is a fancy way of writing $\bar y \beta^T$. See
[[https://github.com/mattjj/autodidact/blob/master/autograd/numpy/numpy_vjps.py][numpy_vjps.py]]
for the implementation in autodidact.

So, given a node say $y = y(x_1, x_2, ..., x_n)$, and given the value of
$y$, $x_{1 : n}$ and $\bar y$, rmAD computes the values of
$\bar x_{1 : n}$ by using the Jacobians.

This is the gist of rmAD. It stores the values of each node in a forward
pass, and computes the derivatives of each node exactly once in a
backward pass.

It is a nice exercise to derive the backpropagation in the fully
connected feedforward neural networks
(e.g. [[http://neuralnetworksanddeeplearning.com/chap2.html#the_four_fundamental_equations_behind_backpropagation][the
one for MNIST in Neural Networks and Deep Learning]]) using rmAD.

AD is an approach lying between the extremes of numerical approximation
(e.g. finite difference) and symbolic evaluation. It uses exact formulas
(VJP) at each elementary operation like symbolic evaluation, while
evaluates each VJP numerically rather than lumping all the VJPs into an
unwieldy symbolic formula.

Things to look further into: the higher-order functional currying form
$j: (x \to y) \to (y, \bar y, x) \to \bar x$ begs for a functional
programming implementation.