Skip to content

Instantly share code, notes, and snippets.

@se4u
Created April 16, 2015 00:47
Show Gist options
  • Save se4u/c72fa5b9b2db6c3923f9 to your computer and use it in GitHub Desktop.
Save se4u/c72fa5b9b2db6c3923f9 to your computer and use it in GitHub Desktop.
# Filename: automouse.py
# Description: Pedagogical implementations of Auto diff algorithm
# Author: Pushpendre Rastogi
# Created: Tue Apr 14 22:29:14 2015 (-0400)
# Last-Updated:
# By:
# Update #: 94
'''
Section 1 and 2 of [0] presents the clearest explanation of AD I have ever read. The other explanations I have read were [1],[2]. I found [1] confusing and would suggest skipping it, [2] was better but [0] was the best. Actually see the bibliography at the end of [2] which I have also copied at the end for fear of losing it.
To understand [0] I wrote the following code which tries to compute the gradient of simple scalar functions.
References:
0. "Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator", Barak Pearlmutter and Jeffrey Siskind.
1. [Justin Domke's blog](https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation)
2. [Alexey Radul's blog](http://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation)
Appendix 1:
All these papers are by (Pearlmutter and Siskind) or (Siskind and Pearlmutter) and Alexey was their student. All the comments were written by Alexey. I did not write them. I am only copying them here.
- Putting the Automatic Back into AD: Part I, What's Wrong.
A screed on the state of high-performance automatic differentiation tooling. By reversing the senses of all the assertions, this can be read as a manifesto for what automatic differentiation tools should offer.
- Putting the Automatic Back into AD: Part II, Dynamic, Automatic, Nestable, and Fast.
An argument, to the scientific computing community, that higher-order functions are valuable in general, and that differentiation is a higher-order function; describes AD in those terms. Argues that therefore implementation technology developed for dealing with higher-order functions would serve the community well.
- Perturbation Confusion and Referential Transparency: Correct Functional Implementation of Forward-Mode AD.
A concise paper on the perturbation confusion problem, in the context of forward mode AD embedded into a functional language.
- Confusion of Tagged Perturbations in Forward Automatic Differentiation of Higher-Order Functions.
A spectacular bug that your implementation of AD may still have even if you have zealously read everything else and followed all the advice. Don't read this unless you really want to implement AD, or unless you need further convincing that the task is subtle.
'''
def f(x):
return x + 2
def g(x):
return 2 * x
def h(x):
return x**3
def fgh(x):
return h(g(f(x)))
x_in = 2
true_deriv = 384
print "True_Deriv", true_deriv
# Attempt 1
'''
In Attempt 1 we want to use the theory of [0.Sec~2] in the simplest way. So we define a tape.
Push all the intermediate computations onto the tape and code the forward pass as a series of
applications of appl.
Then we manually create function that compute the derivatives of all the functions used. and pass them in
the right order to the backward_pass function that finally computes the derivative.
'''
tape = []
def appl(f, v):
tape.append(v)
return f(v)
forward_pass = appl(h, appl(g, appl(f, x_in)))
def df(x):
return 1
def dg(x):
return 2
def dh(x):
return 3*(x**2)
def backward_pass(tape, deriv_list):
val = 1
for v, f in zip(tape, deriv_list)[::-1]:
val *= f(v)
return val
attempt1 = backward_pass(tape, [df, dg, dh])
print "Auto_Diff, Attempt 1", attempt1
assert attempt1 == true_deriv
# End of Attempt 1
'''
Note that I am only interested in doing reverse mode auto-diff for scalar range, vector domain, functions.
Let's review what's ugly and needs to be fixed.
1. I had to manually create df, dg, dh
2. I had to pass them in the right order to the backward_pass function
3. I had to do a manual transformation of `fgh` into appl(...)
If I get rid of these 3 draw-backs then I'd have a
usable autodiff system.
I can fix problem 1, by operator overloading.
I can fix problem 2, by module based auto-diff
I can fix problem 3, by [AST transformations in python](https://github.com/lihaoyi/macropy)
Actually I think all 1,2,3 can be fixed by operator overloading and module based auto diff.
'''
# Attempt 2: Do a module based auto-diff
# Basically we register the derivative finding functions with original
# functions and push them all on the tape.
f.deriv = df
g.deriv = dg
h.deriv = dh
tape = []
def appl2(f, v):
tape.append((v, f))
return f(v)
forward_pass = appl2(h, appl2(g, appl2(f, x_in)))
def backward_pass2(tape):
val = 1
for (v, f) in tape[::-1]:
val *= f.deriv(v)
return val
attempt2 = backward_pass2(tape)
print "Auto_Diff, Attempt 2", attempt2
assert attempt2 == true_deriv
# End of Attempt 2
'''
So now we got rid of problem 2. But problem 1 still remains.
Let's implement operator overloading by defining a new class
that has overloaded operators. What those operators do is that
they store all the operations that happened on them inside the function.
Since python is wonderfully dynamic implementing this is easy.
See [Operator Reference](http://rgruet.free.fr/PQR26/PQR2.6.html#SpecialMethods)
for a comprehensive list. For now we will just overload
__add__, __radd__,
__pow__, __rpow__,
__mul__, __rmul__
since those are what we use in functions f, g, h.
'''
def diagnose(e):
# print "here", e
return e
class Node(object):
def __init__(self, x, deriv=None):
assert not isinstance(x, Node)
self.x=x
if deriv is None:
self.deriv = lambda : diagnose(1)
else:
self.deriv = deriv
def __pow__(self, other):
v = self.x**other
return type(self)(v, lambda : diagnose(other*(v/self.x)*(self.deriv())))
def __add__(self, other):
v = self.x + other
return type(self)(v, lambda : diagnose(self.deriv()))
def __mul__(self, other):
v = self.x * other
return type(self)(v, lambda : diagnose(other*(self.deriv())))
def __repr__(self):
return str(self.x)
__rpow__ = __pow__
__radd__ = __add__
__rmul__ = __mul__
pass
for (x, td) in [(Node(1), 1), (Node(1)*2, 2), (Node(2)**3, 12), (Node(2)*3, 3),
((Node(2)*2)**3, 96), (((Node(2)+1)*2)**3, 216), (Node(6)**3, 108)]:
assert td == x.deriv()
# Use Nodes to automatically find df, dg, dh
tape = None
attempt3 = h(g(f(Node(x_in)))).deriv()
print "Auto_Diff, Attempt 3", attempt3
# End of Attempt 3
'''
I must admit that Attempt 3 was confusing. The whole thing
looks elegant but there is no backward pass !! And there is
no tape.
The problem is that while I wanted to implement Reverse Mode
AD I accidentally implemented the Forward mode AD instead. We
can confirm this by uncommenting the print in the diagnose
function. What we are doing is just nesting all the closures
and building a call graph. so that finally we can call deriv
on the object. The problem is that with higher dimensional
input forward AD does a lot more computations. What is nice
is that we didn't need a tape, no module based stuff. Just
plain operator overloading in a single class. We didn't need
to change any of our own code. Everything was done by the Node
class. That's the power of a dynamic runtime.
Now let's try to implement Reverse Mode AD, with a tape and
operator overloading. The key observation is that we don't want to
compute the actual derivative, we just want to store
sensitivities !!! and once we have the sensitivities we use the
tape to do the multiplication.
One question we must answer is whether we should store the sensitivities
or their thunks. We really should be lazy as much as possible to save on
space etc.
'''
class RNode(object):
def __init__(self, x, sens=None):
assert not isinstance(x, Node)
self.x=x
if sens is None:
self.sens = lambda inp, output: 1 # NOTE: We defer
# computations till they are actually needed.
else:
self.sens = sens
def __pow__(self, other):
v = self.x**other
return type(self)(v, lambda inp, output: other*(output/inp))
def __add__(self, other):
v = self.x + other
return type(self)(v)
def __mul__(self, other):
v = self.x * other
return type(self)(v, lambda inp, output: other)
def __repr__(self):
return str(self.x)
__rpow__ = __pow__
__radd__ = __add__
__rmul__ = __mul__
pass
tape = []
inp = RNode(x_in)
forward_pass = appl(h, appl(g, appl(f, inp)))
def backward_pass3(tape, fp, inp):
tape=tape[::-1]
val = fp.sens(tape[0].x, fp.x)
for v in tape[1:]:
val *= v.sens()
return val
attempt2 = backward_pass3(tape, forward_pass, inp)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment