Created
April 16, 2015 00:47
-
-
Save se4u/c72fa5b9b2db6c3923f9 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Filename: automouse.py | |
# Description: Pedagogical implementations of Auto diff algorithm | |
# Author: Pushpendre Rastogi | |
# Created: Tue Apr 14 22:29:14 2015 (-0400) | |
# Last-Updated: | |
# By: | |
# Update #: 94 | |
''' | |
Section 1 and 2 of [0] presents the clearest explanation of AD I have ever read. The other explanations I have read were [1],[2]. I found [1] confusing and would suggest skipping it, [2] was better but [0] was the best. Actually see the bibliography at the end of [2] which I have also copied at the end for fear of losing it. | |
To understand [0] I wrote the following code which tries to compute the gradient of simple scalar functions. | |
References: | |
0. "Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator", Barak Pearlmutter and Jeffrey Siskind. | |
1. [Justin Domke's blog](https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation) | |
2. [Alexey Radul's blog](http://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation) | |
Appendix 1: | |
All these papers are by (Pearlmutter and Siskind) or (Siskind and Pearlmutter) and Alexey was their student. All the comments were written by Alexey. I did not write them. I am only copying them here. | |
- Putting the Automatic Back into AD: Part I, What's Wrong. | |
A screed on the state of high-performance automatic differentiation tooling. By reversing the senses of all the assertions, this can be read as a manifesto for what automatic differentiation tools should offer. | |
- Putting the Automatic Back into AD: Part II, Dynamic, Automatic, Nestable, and Fast. | |
An argument, to the scientific computing community, that higher-order functions are valuable in general, and that differentiation is a higher-order function; describes AD in those terms. Argues that therefore implementation technology developed for dealing with higher-order functions would serve the community well. | |
- Perturbation Confusion and Referential Transparency: Correct Functional Implementation of Forward-Mode AD. | |
A concise paper on the perturbation confusion problem, in the context of forward mode AD embedded into a functional language. | |
- Confusion of Tagged Perturbations in Forward Automatic Differentiation of Higher-Order Functions. | |
A spectacular bug that your implementation of AD may still have even if you have zealously read everything else and followed all the advice. Don't read this unless you really want to implement AD, or unless you need further convincing that the task is subtle. | |
''' | |
def f(x): | |
return x + 2 | |
def g(x): | |
return 2 * x | |
def h(x): | |
return x**3 | |
def fgh(x): | |
return h(g(f(x))) | |
x_in = 2 | |
true_deriv = 384 | |
print "True_Deriv", true_deriv | |
# Attempt 1 | |
''' | |
In Attempt 1 we want to use the theory of [0.Sec~2] in the simplest way. So we define a tape. | |
Push all the intermediate computations onto the tape and code the forward pass as a series of | |
applications of appl. | |
Then we manually create function that compute the derivatives of all the functions used. and pass them in | |
the right order to the backward_pass function that finally computes the derivative. | |
''' | |
tape = [] | |
def appl(f, v): | |
tape.append(v) | |
return f(v) | |
forward_pass = appl(h, appl(g, appl(f, x_in))) | |
def df(x): | |
return 1 | |
def dg(x): | |
return 2 | |
def dh(x): | |
return 3*(x**2) | |
def backward_pass(tape, deriv_list): | |
val = 1 | |
for v, f in zip(tape, deriv_list)[::-1]: | |
val *= f(v) | |
return val | |
attempt1 = backward_pass(tape, [df, dg, dh]) | |
print "Auto_Diff, Attempt 1", attempt1 | |
assert attempt1 == true_deriv | |
# End of Attempt 1 | |
''' | |
Note that I am only interested in doing reverse mode auto-diff for scalar range, vector domain, functions. | |
Let's review what's ugly and needs to be fixed. | |
1. I had to manually create df, dg, dh | |
2. I had to pass them in the right order to the backward_pass function | |
3. I had to do a manual transformation of `fgh` into appl(...) | |
If I get rid of these 3 draw-backs then I'd have a | |
usable autodiff system. | |
I can fix problem 1, by operator overloading. | |
I can fix problem 2, by module based auto-diff | |
I can fix problem 3, by [AST transformations in python](https://github.com/lihaoyi/macropy) | |
Actually I think all 1,2,3 can be fixed by operator overloading and module based auto diff. | |
''' | |
# Attempt 2: Do a module based auto-diff | |
# Basically we register the derivative finding functions with original | |
# functions and push them all on the tape. | |
f.deriv = df | |
g.deriv = dg | |
h.deriv = dh | |
tape = [] | |
def appl2(f, v): | |
tape.append((v, f)) | |
return f(v) | |
forward_pass = appl2(h, appl2(g, appl2(f, x_in))) | |
def backward_pass2(tape): | |
val = 1 | |
for (v, f) in tape[::-1]: | |
val *= f.deriv(v) | |
return val | |
attempt2 = backward_pass2(tape) | |
print "Auto_Diff, Attempt 2", attempt2 | |
assert attempt2 == true_deriv | |
# End of Attempt 2 | |
''' | |
So now we got rid of problem 2. But problem 1 still remains. | |
Let's implement operator overloading by defining a new class | |
that has overloaded operators. What those operators do is that | |
they store all the operations that happened on them inside the function. | |
Since python is wonderfully dynamic implementing this is easy. | |
See [Operator Reference](http://rgruet.free.fr/PQR26/PQR2.6.html#SpecialMethods) | |
for a comprehensive list. For now we will just overload | |
__add__, __radd__, | |
__pow__, __rpow__, | |
__mul__, __rmul__ | |
since those are what we use in functions f, g, h. | |
''' | |
def diagnose(e): | |
# print "here", e | |
return e | |
class Node(object): | |
def __init__(self, x, deriv=None): | |
assert not isinstance(x, Node) | |
self.x=x | |
if deriv is None: | |
self.deriv = lambda : diagnose(1) | |
else: | |
self.deriv = deriv | |
def __pow__(self, other): | |
v = self.x**other | |
return type(self)(v, lambda : diagnose(other*(v/self.x)*(self.deriv()))) | |
def __add__(self, other): | |
v = self.x + other | |
return type(self)(v, lambda : diagnose(self.deriv())) | |
def __mul__(self, other): | |
v = self.x * other | |
return type(self)(v, lambda : diagnose(other*(self.deriv()))) | |
def __repr__(self): | |
return str(self.x) | |
__rpow__ = __pow__ | |
__radd__ = __add__ | |
__rmul__ = __mul__ | |
pass | |
for (x, td) in [(Node(1), 1), (Node(1)*2, 2), (Node(2)**3, 12), (Node(2)*3, 3), | |
((Node(2)*2)**3, 96), (((Node(2)+1)*2)**3, 216), (Node(6)**3, 108)]: | |
assert td == x.deriv() | |
# Use Nodes to automatically find df, dg, dh | |
tape = None | |
attempt3 = h(g(f(Node(x_in)))).deriv() | |
print "Auto_Diff, Attempt 3", attempt3 | |
# End of Attempt 3 | |
''' | |
I must admit that Attempt 3 was confusing. The whole thing | |
looks elegant but there is no backward pass !! And there is | |
no tape. | |
The problem is that while I wanted to implement Reverse Mode | |
AD I accidentally implemented the Forward mode AD instead. We | |
can confirm this by uncommenting the print in the diagnose | |
function. What we are doing is just nesting all the closures | |
and building a call graph. so that finally we can call deriv | |
on the object. The problem is that with higher dimensional | |
input forward AD does a lot more computations. What is nice | |
is that we didn't need a tape, no module based stuff. Just | |
plain operator overloading in a single class. We didn't need | |
to change any of our own code. Everything was done by the Node | |
class. That's the power of a dynamic runtime. | |
Now let's try to implement Reverse Mode AD, with a tape and | |
operator overloading. The key observation is that we don't want to | |
compute the actual derivative, we just want to store | |
sensitivities !!! and once we have the sensitivities we use the | |
tape to do the multiplication. | |
One question we must answer is whether we should store the sensitivities | |
or their thunks. We really should be lazy as much as possible to save on | |
space etc. | |
''' | |
class RNode(object): | |
def __init__(self, x, sens=None): | |
assert not isinstance(x, Node) | |
self.x=x | |
if sens is None: | |
self.sens = lambda inp, output: 1 # NOTE: We defer | |
# computations till they are actually needed. | |
else: | |
self.sens = sens | |
def __pow__(self, other): | |
v = self.x**other | |
return type(self)(v, lambda inp, output: other*(output/inp)) | |
def __add__(self, other): | |
v = self.x + other | |
return type(self)(v) | |
def __mul__(self, other): | |
v = self.x * other | |
return type(self)(v, lambda inp, output: other) | |
def __repr__(self): | |
return str(self.x) | |
__rpow__ = __pow__ | |
__radd__ = __add__ | |
__rmul__ = __mul__ | |
pass | |
tape = [] | |
inp = RNode(x_in) | |
forward_pass = appl(h, appl(g, appl(f, inp))) | |
def backward_pass3(tape, fp, inp): | |
tape=tape[::-1] | |
val = fp.sens(tape[0].x, fp.x) | |
for v in tape[1:]: | |
val *= v.sens() | |
return val | |
attempt2 = backward_pass3(tape, forward_pass, inp) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment