se4u/automouse.py

## automouse.py
# Filename: automouse.py
# Description: Pedagogical implementations of Auto diff algorithm
# Author: Pushpendre Rastogi
# Created: Tue Apr 14 22:29:14 2015 (-0400)
# Last-Updated:
#           By:
#     Update #: 94
'''
Section 1 and 2 of [0] presents the clearest explanation of AD I have ever read. The other explanations I have read were [1],[2]. I found [1] confusing and would suggest skipping it, [2] was better but [0] was the best. Actually see the bibliography at the end of [2] which I have also copied at the end for fear of losing it.

To understand [0] I wrote the following code which tries to compute the gradient of simple scalar functions.

References:
0. "Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator", Barak Pearlmutter and Jeffrey Siskind.
1. [Justin Domke's blog](https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation)
2. [Alexey Radul's blog](http://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation)
Appendix 1:
All these papers are by (Pearlmutter and Siskind) or (Siskind and Pearlmutter) and Alexey was their student. All the comments were written by Alexey. I did not write them. I am only copying them here.
- Putting the Automatic Back into AD: Part I, What's Wrong.
  A screed on the state of high-performance automatic differentiation tooling. By reversing the senses of all the assertions, this can be read as a manifesto for what automatic differentiation tools should offer.
- Putting the Automatic Back into AD: Part II, Dynamic, Automatic, Nestable, and Fast.
  An argument, to the scientific computing community, that higher-order functions are valuable in general, and that differentiation is a higher-order function; describes AD in those terms. Argues that therefore implementation technology developed for dealing with higher-order functions would serve the community well.
- Perturbation Confusion and Referential Transparency: Correct Functional Implementation of Forward-Mode AD.
  A concise paper on the perturbation confusion problem, in the context of forward mode AD embedded into a functional language.
- Confusion of Tagged Perturbations in Forward Automatic Differentiation of Higher-Order Functions.
  A spectacular bug that your implementation of AD may still have even if you have zealously read everything else and followed all the advice. Don't read this unless you really want to implement AD, or unless you need further convincing that the task is subtle.
'''
def f(x):
    return x + 2

def g(x):
    return 2 * x

def h(x):
    return x**3

def fgh(x):
    return h(g(f(x)))

x_in = 2
true_deriv = 384
print "True_Deriv", true_deriv

# Attempt 1
'''
In Attempt 1 we want to use the theory of [0.Sec~2] in the simplest way. So we define a tape.
Push all the intermediate computations onto the tape and code the forward pass as a series of
applications of appl.

Then we manually create function that compute the derivatives of all the functions used. and pass them in
the right order to the backward_pass function that finally computes the derivative.
'''
tape = []
def appl(f, v):
    tape.append(v)
    return f(v)

forward_pass = appl(h, appl(g, appl(f, x_in)))
def df(x):
    return 1

def dg(x):
    return 2

def dh(x):
    return 3*(x**2)

def backward_pass(tape, deriv_list):
    val = 1
    for v, f in zip(tape, deriv_list)[::-1]:
        val *= f(v)
    return val

attempt1 = backward_pass(tape, [df, dg, dh])
print "Auto_Diff, Attempt 1", attempt1
assert attempt1 == true_deriv
# End of Attempt 1
'''
Note that I am only interested in doing reverse mode auto-diff for scalar range, vector domain, functions.
Let's review what's ugly and needs to be fixed.
1. I had to manually create df, dg, dh
2. I had to pass them in the right order to the backward_pass function
3. I had to do a manual transformation of `fgh` into appl(...)
If I get rid of these 3 draw-backs then I'd have a
usable autodiff system.
I can fix problem 1, by operator overloading.
I can fix problem 2, by module based auto-diff
I can fix problem 3, by [AST transformations in python](https://github.com/lihaoyi/macropy)

Actually I think all 1,2,3 can be fixed by operator overloading and module based auto diff.
'''
# Attempt 2: Do a module based auto-diff
# Basically we register the derivative finding functions with original
# functions and push them all on the tape.
f.deriv = df
g.deriv = dg
h.deriv = dh

tape = []
def appl2(f, v):
    tape.append((v, f))
    return f(v)
forward_pass = appl2(h, appl2(g, appl2(f, x_in)))
def backward_pass2(tape):
    val = 1
    for (v, f) in tape[::-1]:
        val *= f.deriv(v)
    return val

attempt2 = backward_pass2(tape)
print "Auto_Diff, Attempt 2", attempt2
assert attempt2 == true_deriv
# End of Attempt 2
'''
So now we got rid of problem 2. But problem 1 still remains.
Let's implement operator overloading by defining a new class
that has overloaded operators. What those operators do is that
they store all the operations that happened on them inside the function.
Since python is wonderfully dynamic implementing this is easy.
See [Operator Reference](http://rgruet.free.fr/PQR26/PQR2.6.html#SpecialMethods)
for a comprehensive list. For now we will just overload
__add__, __radd__,
__pow__, __rpow__,
__mul__, __rmul__
since those are what we use in functions f, g, h.
'''
def diagnose(e):
    # print "here", e
    return e

class Node(object):
    def __init__(self, x, deriv=None):
        assert not isinstance(x, Node)
        self.x=x
        if deriv is None:
            self.deriv = lambda : diagnose(1)
        else:
            self.deriv = deriv

    def __pow__(self, other):
        v = self.x**other
        return type(self)(v, lambda : diagnose(other*(v/self.x)*(self.deriv())))

    def __add__(self, other):
        v = self.x + other
        return type(self)(v, lambda : diagnose(self.deriv()))

    def __mul__(self, other):
        v = self.x * other
        return type(self)(v, lambda : diagnose(other*(self.deriv())))

    def __repr__(self):
        return str(self.x)

    __rpow__ = __pow__
    __radd__ = __add__
    __rmul__ = __mul__
    pass

for (x, td) in [(Node(1), 1), (Node(1)*2, 2), (Node(2)**3, 12), (Node(2)*3, 3),
                ((Node(2)*2)**3, 96), (((Node(2)+1)*2)**3, 216), (Node(6)**3, 108)]:
    assert td == x.deriv()
# Use Nodes to automatically find df, dg, dh
tape = None
attempt3 = h(g(f(Node(x_in)))).deriv()
print "Auto_Diff, Attempt 3", attempt3
# End of Attempt 3
'''
I must admit that Attempt 3 was confusing. The whole thing
looks elegant but there is no backward pass !! And there is
no tape.
The problem is that while I wanted to implement Reverse Mode
AD I accidentally implemented the Forward mode AD instead. We
can confirm this by uncommenting the print in the diagnose
function. What we are doing is just nesting all the closures
and building a call graph. so that finally we can call deriv
on the object. The problem is that with higher dimensional
input forward AD does a lot more computations. What is nice
is that we didn't need a tape, no module based stuff. Just
plain operator overloading in a single class. We didn't need
to change any of our own code. Everything was done by the Node
class. That's the power of a dynamic runtime.

Now let's try to implement Reverse Mode AD, with a tape and
operator overloading. The key observation is that we don't want to
compute the actual derivative, we just want to store
sensitivities !!! and once we have the sensitivities we use the
tape to do the multiplication.

One question we must answer is whether we should store the sensitivities
or their thunks. We really should be lazy as much as possible to save on
space etc.
'''

class RNode(object):
    def __init__(self, x, sens=None):
        assert not isinstance(x, Node)
        self.x=x
        if sens is None:
            self.sens = lambda inp, output: 1 # NOTE: We defer
            # computations till they are actually needed.
        else:
            self.sens = sens

    def __pow__(self, other):
        v = self.x**other
        return type(self)(v, lambda inp, output: other*(output/inp))

    def __add__(self, other):
        v = self.x + other
        return type(self)(v)

    def __mul__(self, other):
        v = self.x * other
        return type(self)(v, lambda inp, output: other)

    def __repr__(self):
        return str(self.x)

    __rpow__ = __pow__
    __radd__ = __add__
    __rmul__ = __mul__
    pass
tape = []
inp = RNode(x_in)
forward_pass = appl(h, appl(g, appl(f, inp)))
def backward_pass3(tape, fp, inp):
    tape=tape[::-1]
    val = fp.sens(tape[0].x, fp.x)
    for v in tape[1:]:
        val *= v.sens()
    return val

attempt2 = backward_pass3(tape, forward_pass, inp)
	# Filename: automouse.py
	# Description: Pedagogical implementations of Auto diff algorithm
	# Author: Pushpendre Rastogi
	# Created: Tue Apr 14 22:29:14 2015 (-0400)
	# Last-Updated:
	# By:
	# Update #: 94
	'''
	Section 1 and 2 of [0] presents the clearest explanation of AD I have ever read. The other explanations I have read were [1],[2]. I found [1] confusing and would suggest skipping it, [2] was better but [0] was the best. Actually see the bibliography at the end of [2] which I have also copied at the end for fear of losing it.

	To understand [0] I wrote the following code which tries to compute the gradient of simple scalar functions.

	References:
	0. "Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator", Barak Pearlmutter and Jeffrey Siskind.
	1. [Justin Domke's blog](https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation)
	2. [Alexey Radul's blog](http://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation)
	Appendix 1:
	All these papers are by (Pearlmutter and Siskind) or (Siskind and Pearlmutter) and Alexey was their student. All the comments were written by Alexey. I did not write them. I am only copying them here.
	- Putting the Automatic Back into AD: Part I, What's Wrong.
	A screed on the state of high-performance automatic differentiation tooling. By reversing the senses of all the assertions, this can be read as a manifesto for what automatic differentiation tools should offer.
	- Putting the Automatic Back into AD: Part II, Dynamic, Automatic, Nestable, and Fast.
	An argument, to the scientific computing community, that higher-order functions are valuable in general, and that differentiation is a higher-order function; describes AD in those terms. Argues that therefore implementation technology developed for dealing with higher-order functions would serve the community well.
	- Perturbation Confusion and Referential Transparency: Correct Functional Implementation of Forward-Mode AD.
	A concise paper on the perturbation confusion problem, in the context of forward mode AD embedded into a functional language.
	- Confusion of Tagged Perturbations in Forward Automatic Differentiation of Higher-Order Functions.
	A spectacular bug that your implementation of AD may still have even if you have zealously read everything else and followed all the advice. Don't read this unless you really want to implement AD, or unless you need further convincing that the task is subtle.
	'''
	def f(x):
	return x + 2

	def g(x):
	return 2 * x

	def h(x):
	return x**3

	def fgh(x):
	return h(g(f(x)))

	x_in = 2
	true_deriv = 384
	print "True_Deriv", true_deriv

	# Attempt 1
	'''
	In Attempt 1 we want to use the theory of [0.Sec~2] in the simplest way. So we define a tape.
	Push all the intermediate computations onto the tape and code the forward pass as a series of
	applications of appl.

	Then we manually create function that compute the derivatives of all the functions used. and pass them in
	the right order to the backward_pass function that finally computes the derivative.
	'''
	tape = []
	def appl(f, v):
	tape.append(v)
	return f(v)

	forward_pass = appl(h, appl(g, appl(f, x_in)))
	def df(x):
	return 1

	def dg(x):
	return 2

	def dh(x):
	return 3(x*2)

	def backward_pass(tape, deriv_list):
	val = 1
	for v, f in zip(tape, deriv_list)[::-1]:
	val *= f(v)
	return val

	attempt1 = backward_pass(tape, [df, dg, dh])
	print "Auto_Diff, Attempt 1", attempt1
	assert attempt1 == true_deriv
	# End of Attempt 1
	'''
	Note that I am only interested in doing reverse mode auto-diff for scalar range, vector domain, functions.
	Let's review what's ugly and needs to be fixed.
	1. I had to manually create df, dg, dh
	2. I had to pass them in the right order to the backward_pass function
	3. I had to do a manual transformation of `fgh` into appl(...)
	If I get rid of these 3 draw-backs then I'd have a
	usable autodiff system.
	I can fix problem 1, by operator overloading.
	I can fix problem 2, by module based auto-diff
	I can fix problem 3, by [AST transformations in python](https://github.com/lihaoyi/macropy)

	Actually I think all 1,2,3 can be fixed by operator overloading and module based auto diff.
	'''
	# Attempt 2: Do a module based auto-diff
	# Basically we register the derivative finding functions with original
	# functions and push them all on the tape.
	f.deriv = df
	g.deriv = dg
	h.deriv = dh

	tape = []
	def appl2(f, v):
	tape.append((v, f))
	return f(v)
	forward_pass = appl2(h, appl2(g, appl2(f, x_in)))
	def backward_pass2(tape):
	val = 1
	for (v, f) in tape[::-1]:
	val *= f.deriv(v)
	return val

	attempt2 = backward_pass2(tape)
	print "Auto_Diff, Attempt 2", attempt2
	assert attempt2 == true_deriv
	# End of Attempt 2
	'''
	So now we got rid of problem 2. But problem 1 still remains.
	Let's implement operator overloading by defining a new class
	that has overloaded operators. What those operators do is that
	they store all the operations that happened on them inside the function.
	Since python is wonderfully dynamic implementing this is easy.
	See [Operator Reference](http://rgruet.free.fr/PQR26/PQR2.6.html#SpecialMethods)
	for a comprehensive list. For now we will just overload
	__add__, __radd__,
	__pow__, __rpow__,
	__mul__, __rmul__
	since those are what we use in functions f, g, h.
	'''
	def diagnose(e):
	# print "here", e
	return e

	class Node(object):
	def __init__(self, x, deriv=None):
	assert not isinstance(x, Node)
	self.x=x
	if deriv is None:
	self.deriv = lambda : diagnose(1)
	else:
	self.deriv = deriv

	def __pow__(self, other):
	v = self.x**other
	return type(self)(v, lambda : diagnose(other(v/self.x)(self.deriv())))

	def __add__(self, other):
	v = self.x + other
	return type(self)(v, lambda : diagnose(self.deriv()))

	def __mul__(self, other):
	v = self.x * other
	return type(self)(v, lambda : diagnose(other*(self.deriv())))

	def __repr__(self):
	return str(self.x)

	__rpow__ = __pow__
	__radd__ = __add__
	__rmul__ = __mul__
	pass

	for (x, td) in [(Node(1), 1), (Node(1)2, 2), (Node(2)3, 12), (Node(2)3, 3),
	((Node(2)2)3, 96), (((Node(2)+1)2)3, 216), (Node(6)3, 108)]:
	assert td == x.deriv()
	# Use Nodes to automatically find df, dg, dh
	tape = None
	attempt3 = h(g(f(Node(x_in)))).deriv()
	print "Auto_Diff, Attempt 3", attempt3
	# End of Attempt 3
	'''
	I must admit that Attempt 3 was confusing. The whole thing
	looks elegant but there is no backward pass !! And there is
	no tape.
	The problem is that while I wanted to implement Reverse Mode
	AD I accidentally implemented the Forward mode AD instead. We
	can confirm this by uncommenting the print in the diagnose
	function. What we are doing is just nesting all the closures
	and building a call graph. so that finally we can call deriv
	on the object. The problem is that with higher dimensional
	input forward AD does a lot more computations. What is nice
	is that we didn't need a tape, no module based stuff. Just
	plain operator overloading in a single class. We didn't need
	to change any of our own code. Everything was done by the Node
	class. That's the power of a dynamic runtime.

	Now let's try to implement Reverse Mode AD, with a tape and
	operator overloading. The key observation is that we don't want to
	compute the actual derivative, we just want to store
	sensitivities !!! and once we have the sensitivities we use the
	tape to do the multiplication.

	One question we must answer is whether we should store the sensitivities
	or their thunks. We really should be lazy as much as possible to save on
	space etc.
	'''

	class RNode(object):
	def __init__(self, x, sens=None):
	assert not isinstance(x, Node)
	self.x=x
	if sens is None:
	self.sens = lambda inp, output: 1 # NOTE: We defer
	# computations till they are actually needed.
	else:
	self.sens = sens

	def __pow__(self, other):
	v = self.x**other
	return type(self)(v, lambda inp, output: other*(output/inp))

	def __add__(self, other):
	v = self.x + other
	return type(self)(v)

	def __mul__(self, other):
	v = self.x * other
	return type(self)(v, lambda inp, output: other)

	def __repr__(self):
	return str(self.x)

	__rpow__ = __pow__
	__radd__ = __add__
	__rmul__ = __mul__
	pass
	tape = []
	inp = RNode(x_in)
	forward_pass = appl(h, appl(g, appl(f, inp)))
	def backward_pass3(tape, fp, inp):
	tape=tape[::-1]
	val = fp.sens(tape[0].x, fp.x)
	for v in tape[1:]:
	val *= v.sens()
	return val

	attempt2 = backward_pass3(tape, forward_pass, inp)