bobbicodes/lecture-1.md

## lecture-1.md

      
    Raw
  

              lecture-1.md
            
          
    micrograd lecture

Intro

hello my name is andre and i've been training deep neural networks for a bit more than a decade and in this lecture i'd like to show you what neural network training looks like under the hood so in particular we are going to start with a blank jupiter notebook and by the end of this lecture we will define and train in neural net and you'll get to see everything that goes on under the hood and exactly sort of how that works on an intuitive level
micrograd overview

now specifically what i would like to do is i would like to take you through building of micrograd now micrograd is this library that i released on github about two years ago but at the time i only uploaded the source code and you'd have to go in by yourself and really figure out how it works so in this lecture i will take you through it step by step and kind of comment on all the pieces of it
so what is micrograd and why is it interesting?
good um micrograd is basically an autograd engine autograd is short for automatic gradient and really what it does is it implements backpropagation now backpropagation is this algorithm that allows you to efficiently evaluate the gradient of some kind of a loss function with respect to the weights of a neural network and what that allows us to do then is we can iteratively tune the weights of that neural network to minimize the loss function and therefore improve the accuracy of the network
so back propagation would be at the mathematical core of any modern deep neural network library like say pytorch or jaxx
so the functionality of microgrant is i think best illustrated by an example so if we just scroll down here you'll see that micrograph basically allows you to build out mathematical expressions
(see micrograd readme)
from micrograd.engine import Value

a = Value(-4.0)
b = Value(2.0)
c = a + b
d = a * b + b**3
c += c + 1
c += 1 + c + (-a)
d += d * 2 + (b + a).relu()
d += 3 * d + (b - a).relu()
e = c - d
f = e**2
g = f / 2.0
g += 10.0 / f
print(f'{g.data:.4f}') # prints 24.7041, the outcome of this forward pass
g.backward()
print(f'{a.grad:.4f}') # prints 138.8338, i.e. the numerical value of dg/da
print(f'{b.grad:.4f}') # prints 645.5773, i.e. the numerical value of dg/db
and um here what we are doing is we have an expression that we're building out where you have two inputs a and b and you'll see that a and b are negative four and two but we are wrapping those values into this value object that we are going to build out as part of micrograd so this value object will wrap the numbers themselves and then we are going to build out a mathematical expression here where a and b are transformed into c d and eventually e f and g and i'm showing some of the functions some of the functionality of micrograph and the operations that it supports so you can add two value objects, you can multiply them, you can raise them to a constant power, you can offset by one, negate, squash at zero, square, divide by constant, divide by it etc.
and so we're building out an expression graph with with these two inputs a and b and we're creating an output value of g and micrograd will in the background build out this entire mathematical expression so it will for example know that c is also a value c was a result of an addition operation and the child nodes of c are a and b because the and will maintain pointers to a and b value objects so we'll basically know exactly how all of this is laid out and then not only can we do what we call the forward pass where we actually look at the value of g of course that's pretty straightforward we will access that using the dot data attribute and so the output of the forward pass the value of g is 24.7 it turns out but the big deal is that we can also take this g value object and we can call that backward and this will basically uh initialize back propagation at the node g and what backpropagation is going to do is it's going to start at g and it's going to go backwards through that expression graph and it's going to recursively apply the chain rule from calculus and what that allows us to do then is we're going to evaluate basically the derivative of g with respect to all the internal nodes like e d and c but also with respect to the inputs a and b and then we can actually query this derivative of g with respect to a for example that's a dot grad in this case it happens to be 138 and the derivative of g with respect to b which also happens to be here 645 and this derivative we'll see soon is very important information because it's telling us how a and b are affecting g through this mathematical expression so in particular a dot grad is 138 so if we slightly nudge a and make it slightly larger 138 is telling us that g will grow and the slope of that growth is going to be 138 and the slope of growth of b is going to be 645. so that's going to tell us about how g will respond if a and b get tweaked a tiny amount in a positive direction
okay now you might be confused about what this expression is that we built out here and this expression by the way is completely meaningless i just made it up i'm just flexing about the kinds of operations that are supported by micrograd what we actually really care about are neural networks but it turns out that neural networks are just mathematical expressions just like this one but actually slightly bit less crazy even neural networks are just a mathematical expression they take the input data as an input and they take the weights of a neural network as an input and it's a mathematical expression and the output are your predictions of your neural net or the loss function we'll see this in a bit but basically neural networks just happen to be a certain class of mathematical expressions but back propagation is actually significantly more general it doesn't actually care about neural networks at all it only tells us about arbitrary mathematical expressions and then we happen to use that machinery for training of neural networks
now one more note i would like to make at this stage is that as you see here micrograd is a scalar valued auto grant engine so it's working on the you know level of individual scalars like negative four and two and we're taking neural nets and we're breaking them down all the way to these atoms of individual scalars and all the little pluses and times and it's just excessive and so obviously you would never be doing any of this in production it's really just put down for pedagogical reasons because it allows us to not have to deal with these n-dimensional tensors that you would use in modern deep neural network library so this is really done so that you understand and refactor out back propagation and chain rule and understanding of neurologic training and then if you actually want to train bigger networks you have to be using these tensors but none of the math changes this is done purely for efficiency we are basically taking scale value all the scale values we're packaging them up into tensors which are just arrays of these scalars and then because we have these large arrays we're making operations on those large arrays that allows us to take advantage of the parallelism in a computer and all those operations can be done in parallel and then the whole thing runs faster but really none of the math changes and that's done purely for efficiency so i don't think that it's pedagogically useful to be dealing with tensors from scratch uh and i think and that's why i fundamentally wrote micrograd because you can understand how things work uh at the fundamental level and then you can speed it up later
okay so here's the fun part my claim is that micrograd is what you need to train your networks and everything else is just efficiency so you'd think that micrograd would be a very complex piece of code and that turns out to not be the case so if we just go to micrograd and you'll see that there's only two files here in micrograd this is the actual engine it doesn't know anything about neural nuts and this is the entire neural nets library on top of micrograd so engine and nn.pi so the actual backpropagation autograd engine that gives you the power of neural networks is literally 100 lines of code of like very simple python which we'll understand by the end of this lecture
engine.py
nn.py
and then nn.pi this neural network library built on top of the autograd engine um is like a joke it's like we have to define what is a neuron and then we have to define what is the layer of neurons and then we define what is a multi-layer perceptron which is just a sequence of layers of neurons and so it's just a total joke so basically there's a lot of power that comes from only 150 lines of code and that's all you need to understand to understand neural network training and everything else is just efficiency and of course there's a lot to efficiency but fundamentally that's all that's happening
derivative of a simple function with one input

okay so now let's dive right in and implement micrograph step by step.
the first thing i'd like to do is i'd like to make sure that you have a very good understanding intuitively of what a derivative is and exactly what information it gives you
so let's start with some basic imports that i copy paste in every jupiter notebook always
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
and let's define a function a scalar valued function f of x as follows
def f(x):
  return 3*x**2 - 4*x + 5
so i just make this up randomly i just want to scale a valid function that takes a single scalar x and returns a single scalar y
and we can call this function of course so we can pass in say 3.0 and get 20 back
f(3.0)
# 20.0
now we can also plot this function to get a sense of its shape you can tell from the mathematical expression that this is probably a parabola it's a quadratic and so if we just uh create a set of um um scale values that we can feed in using for example a range from negative five to five in steps of 0.25 so this is so axis is just from negative 5 to 5 not including 5 in steps of 0.25 and we can actually call this function on this numpy array as well so we get a set of y's if we call f on axis and these y's are basically also applying a function on every one of these elements independently and we can plot this using matplotlib so plt.plot x's and y's and we get a nice parabola
xs = np.arange(-5, 5, 0.25)
ys = f(xs)
plt.plot(xs, ys)

so previously here we fed in 3.0 somewhere here and we received 20 back which is here the y coordinate so now i'd like to think through what is the derivative of this function at any single input point x right so what is the derivative at different points x of this function
now if you remember back to your calculus class you've probably derived derivatives so we take this mathematical expression 3x squared minus 4x plus 5 and you would write out on a piece of paper and you would you know apply the product rule and all the other rules and derive the mathematical expression of the great derivative of the original function and then you could plug in different texts and see what the derivative is we're not going to actually do that because no one in neural networks actually writes out the expression for the neural net it would be a massive expression um it would be you know thousands tens of thousands of terms no one actually derives the derivative of course and so we're not going to take this kind of like a symbolic approach instead what i'd like to do is i'd like to look at the definition of derivative and just make sure that we really understand what derivative is measuring what it's telling you about the function
and so if we just look up derivative... https://en.wikipedia.org/wiki/Derivative#Definition
we see that okay so this is not a very good definition of derivative this is a definition of what it means to be differentiable but if you remember from your calculus it is the limit as h goes to zero of f of x plus h minus f of x over h

so basically what it's saying is if you slightly bump up you're at some point x that you're interested in or a and if you slightly bump up you know you slightly increase it by small number h how does the function respond with what sensitivity does it respond what is the slope at that point does the function go up or does it go down and by how much and that's the slope of that function the the slope of that response at that point
and so we can basically evaluate the derivative here numerically by taking a very small h of course the definition would ask us to take h to zero we're just going to pick a very small h 0.001
h = 0.000001
x = 2/3
(f(x + h) - f(x))/h
# 2.999378523327323e-06
and let's say we're interested in point 3.0 so we can look at f of x of course as 20 and now f of x plus h so if we slightly nudge x in a positive direction how is the function going to respond
and just looking at this do you expect do you expect f of x plus h to be slightly greater than 20 or do you expect to be slightly lower than 20 and since this 3 is here and this is 20 if we slightly go positively the function will respond positively so you'd expect this to be slightly greater than 20. and now by how much it's telling you the sort of the the strength of that slope right the the size of the slope so f of x plus h minus f of x this is how much the function responded in the positive direction and we have to normalize by the run so we have the rise over run to get the slope so this of course is just a numerical approximation of the slope because we have to make age very very small to converge to the exact amount now if i'm doing too many zeros at some point i'm gonna get an incorrect answer because we're using floating point arithmetic and the representations of all these numbers in computer memory is finite and at some point we get into trouble so we can converse towards the right answer with this approach
but basically um at 3 the slope is 14. and you can see that by taking 3x squared minus 4x plus 5 and differentiating it in our head so 3x squared would be 6 x minus 4 and then we plug in x equals 3 so that's 18 minus 4 is 14. so this is correct so that's at 3. now how about the slope at say negative 3 would you expect would you expect for the slope now telling the exact value is really hard but what is the sign of that slope so at negative three if we slightly go in the positive direction at x the function would actually go down and so that tells you that the slope would be negative so we'll get a slight number below below 20. and so if we take the slope we expect something negative negative 22. okay and at some point here of course the slope would be zero now for this specific function i looked it up previously and it's at point two over three so at roughly two over three uh that's somewhere here um this derivative be zero so basically at that precise point yeah at that precise point if we nudge in a positive direction the function doesn't respond this stays the same almost and so that's why the slope is zero
derivative of a function with multiple inputs

okay now let's look at a bit more complex case
# les get more complex
a = 2.0
b = -3.0
c = 10.0
d = a*b + c
print(d)

# 4.0
so we're going to start you know complexifying a bit so now we have a function here with output variable d that is a function of three scalar inputs a b and c so a b and c are some specific values three inputs into our expression graph and a single output d and so if we just print d we get four
and now what i have to do is i'd like to again look at the derivatives of d with respect to a b and c and uh think through uh again just the intuition of what this derivative is telling us
so in order to evaluate this derivative we're going to get a bit hacky here we're going to again have a very small value of h and then we're going to fix the inputs at some values that we're interested in so these are the this is the point abc at which we're going to be evaluating the the derivative of d with respect to all a b and c at that point so there are the inputs and now we have d1 is that expression and then we're going to for example look at the derivative of d with respect to a so we'll take a and we'll bump it by h and then we'll get d2 to be the exact same function and now we're going to print um you know f1 d1 is d1 d2 is d2 and print slope so the derivative or slope here will be um of course d2 minus d1 divide h so d2 minus d1 is how much the function increased uh when we bumped the uh the specific input that we're interested in by a tiny amount and this is then normalized by h to get the slope
h = 0.0001

# inputs
a = 2.0
b = -3.0
c = 10.0

d1 = a*b + c
c += h
d2 = a*b + c

print('d1', d1)
print('d2', d2)
print('slope', (d2 - d1)/h)

# d1 4.0
# d2 4.0001
# slope 0.9999999999976694
so um yeah so this so if i just run this we're going to print d1 which we know is four now d2 will be bumped a will be bumped by h so let's just think through a little bit uh what d2 will be uh printed out here in particular d1 will be four will d2 be a number slightly greater than four or slightly lower than four and that's going to tell us the sl the the sign of the derivative so we're bumping a by h b as minus three c is ten so you can just intuitively think through this derivative and what it's doing a will be slightly more positive and but b is a negative number so if a is slightly more positive because b is negative three we're actually going to be adding less to d so you'd actually expect that the value of the function will go down so let's just see this yeah and so we went from 4 to 3.9996 and that tells you that the slope will be negative and then uh will be a negative number because we went down and then the exact number of slope will be exact amount of slope is negative 3.
and you can also convince yourself that negative 3 is the right answer mathematically and analytically because if you have a times b plus c and you are you know you have calculus then differentiating a times b plus c with respect to a gives you just b and indeed the value of b is negative 3 which is the derivative that we have so you can tell that that's correct
so now if we do this with b so if we bump b by a little bit in a positive direction we'd get different slopes so what is the influence of b on the output d so if we bump b by a tiny amount in a positive direction then because a is positive we'll be adding more to d right so um and now what is the what is the sensitivity what is the slope of that addition and it might not surprise you that this should be 2 and y is a 2 because d of d by db differentiating with respect to b would be would give us a and the value of a is two so that's also working well
and then if c gets bumped a tiny amount in h by h then of course a times b is unaffected and now c becomes slightly bit higher what does that do to the function it makes it slightly bit higher because we're simply adding c and it makes it slightly bit higher by the exact same amount that we added to c and so that tells you that the slope is one, that will be the the rate at which d will increase as we scale c
starting the core Value object of micrograd and its visualization

okay so we now have some intuitive sense of what this derivative is telling you about the function and we'd like to move to neural networks now as i mentioned neural networks will be pretty massive expressions mathematical expressions so we need some data structures that maintain these expressions and that's what we're going to start to build out now so we're going to build out this value object that i showed you in the readme page of micrograd so let me copy paste a skeleton of the first very simple value object
class Value:
  
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._backward = lambda: None
    self._prev = set(_children)
    self._op = _op
    self.label = label

  def __repr__(self):
    return f"Value(data={self.data})"
  
  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
    
    def _backward():
      self.grad += 1.0 * out.grad
      other.grad += 1.0 * out.grad
    out._backward = _backward
    
    return out

  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
    
    def _backward():
      self.grad += other.data * out.grad
      other.grad += self.data * out.grad
    out._backward = _backward
      
    return out
  
  def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
    out = Value(t, (self, ), 'tanh')
    
    def _backward():
      self.grad += (1 - t**2) * out.grad
    out._backward = _backward
    
    return out
  
  def backward(self):
    
    topo = []
    visited = set()
    def build_topo(v):
      if v not in visited:
        visited.add(v)
        for child in v._prev:
          build_topo(child)
        topo.append(v)
    build_topo(self)
    
    self.grad = 1.0
    for node in reversed(topo):
      node._backward()


a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label = 'e'
d = e + c; d.label = 'd'
f = Value(-2.0, label='f')
L = d * f; L.label = 'L'
L

# Value(data=-8.0)
so class value takes a single scalar value that it wraps and keeps track of and that's it so we can for example do value of 2.0 and then we can get we can look at its content and python will internally use the wrapper function to uh return uh this string oops like that so this is a value object with data equals two that we're creating here now we'd like to do is like we'd like to be able to have not just like two values but we'd like to do a bluffy right we'd like to add them so currently you would get an error because python doesn't know how to add two value objects so we have to tell it so here's addition so you have to basically use these special double underscore methods in python to define these operators for these objects so if we call um the uh if we use this plus operator python will internally call a dot add of b that's what will happen internally and so b will be the other and self will be a and so we see that what we're going to return is a new value object and it's just it's going to be wrapping the plus of their data but remember now because data is the actual like numbered python number so this operator here is just the typical floating point plus addition now it's not an addition of value objects and will return a new value so now a plus b should work and it should print value of negative one because that's two plus minus three there we go
okay let's now implement multiply just so we can recreate this expression here so multiply i think it won't surprise you will be fairly similar so instead of add we're going to be using mul and then here of course we want to do times and so now we can create a c value object which will be 10.0 and now we should be able to do a times b well let's just do a times b first um [Music] that's value of negative six now and by the way i skipped over this a little bit suppose that i didn't have the wrapper function here then it's just that you'll get some kind of an ugly expression so what wrapper is doing is it's providing us a way to print out like a nicer looking expression in python uh so we don't just have something cryptic we actually are you know it's value of negative six
so this gives us a times and then this we should now be able to add c to it because we've defined and told the python how to do mul and add and so this will call this will basically be equivalent to a dot small of b and then this new value object will be dot add of c and so let's see if that worked yep so that worked well that gave us four which is what we expect from before and i believe we can just call them manually as well there we go so yeah
okay so now what we are missing is the connective tissue of this expression as i mentioned we want to keep these expression graphs so we need to know and keep pointers about what values produce what other values so here for example we are going to introduce a new variable which we'll call children and by default it will be an empty tuple and then we're actually going to keep a slightly different variable in the class which we'll call underscore prev which will be the set of children this is how i done i did it in the original micrograd looking at my code here i can't remember exactly the reason i believe it was efficiency but this underscore children will be a tuple for convenience but then when we actually maintain it in the class it will be just this set yeah i believe for efficiency um so now when we are creating a value like this with a constructor children will be empty and prep will be the empty set but when we're creating a value through addition or multiplication we're going to feed in the children of this value which in this case is self and other so those are the children here so now we can do d dot prev and we'll see that the children of the we now know are this value of negative 6 and value of 10 and this of course is the value resulting from a times b and the c value which is 10.
now the last piece of information we don't know so we know that the children of every single value but we don't know what operation created this value so we need one more element here let's call it underscore pop and by default this is the empty set for leaves and then we'll just maintain it here and now the operation will be just a simple string and in the case of addition it's plus in the case of multiplication is times
so now we not just have d dot pref we also have a d dot up and we know that d was produced by an addition of those two values and so now we have the full mathematical expression uh and we're building out this data structure and we know exactly how each value came to be by word expression and from what other values
now because these expressions are about to get quite a bit larger we'd like a way to nicely visualize these expressions that we're building out so for that i'm going to copy paste a bunch of slightly scary code that's going to visualize this these expression graphs for us so here's the code
from graphviz import Digraph

def trace(root):
  # builds a set of all nodes and edges in a graph
  nodes, edges = set(), set()
  def build(v):
    if v not in nodes:
      nodes.add(v)
      for child in v._prev:
        edges.add((child, v))
        build(child)
  build(root)
  return nodes, edges

def draw_dot(root):
  dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'}) # LR = left to right
  
  nodes, edges = trace(root)
  for n in nodes:
    uid = str(id(n))
    # for any value in the graph, create a rectangular ('record') node for it
    dot.node(name = uid, label = "{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad), shape='record')
    if n._op:
      # if this value is a result of some operation, create an op node for it
      dot.node(name = uid + n._op, label = n._op)
      # and connect this node to it
      dot.edge(uid + n._op, uid)

  for n1, n2 in edges:
    # connect n1 to the op node of n2
    dot.edge(str(id(n1)), str(id(n2)) + n2._op)

  return dot
draw_dot(L)
and i'll explain it in a bit but first let me just show you what this code does basically what it does is it creates a new function drawdot that we can call on some root node and then it's going to visualize it so if we call drawdot on d which is this final value here that is a times b plus c it creates something like this

so this is d and you see that this is a times b creating an integrated value plus c gives us this output node d so that's dried out of d and i'm not going to go through this in complete detail you can take a look at graphless and its api uh graphis is a open source graph visualization software and what we're doing here is we're building out this graph and graphis api and you can basically see that trace is this helper function that enumerates all of the nodes and edges in the graph so that just builds a set of all the nodes and edges and then we iterate for all the nodes and we create special node objects for them in using dot node and then we also create edges using dot dot edge and the only thing that's like slightly tricky here is you'll notice that i basically add these fake nodes which are these operation nodes so for example this node here is just like a plus node and i create these special op nodes here and i connect them accordingly so these nodes of course are not actual nodes in the original graph they're not actually a value object the only value objects here are the things in squares those are actual value objects or representations thereof and these op nodes are just created in this drawdot routine so that it looks nice
let's also add labels to these graphs just so we know what variables are where so let's create a special underscore label um or let's just do label equals empty by default and save it in each node and then here we're going to do label as a label is the label a c and then let's create a special um e equals a times b and e dot label will be e it's kind of naughty and e will be e plus c and a d dot label will be d okay so nothing really changes i just added this new e function a new e variable and then here when we are printing this i'm going to print the label here so this will be a percent s bar and this will be end.label and so now we have the label on the left here so it says a b creating e and then e plus c creates d just like we have it here
and finally let's make this expression just one layer deeper so d will not be the final output node instead after d we are going to create a new value object called f we're going to start running out of variables soon f will be negative 2.0 and its label will of course just be f and then l capital l will be the output of our graph and l will be p times f okay so l will be negative eight is the output so now we don't just draw a d we draw l okay and somehow the label of l was undefined oops all that label has to be explicitly sort of given to it there we go so l is the output
so let's quickly recap what we've done so far we are able to build out mathematical expressions using only plus and times so far they are scalar valued along the way and we can do this forward pass and build out a mathematical expression so we have multiple inputs here a b c and f going into a mathematical expression that produces a single output l and this here is visualizing the forward pass so the output of the forward pass is negative eight that's the value now what we'd like to do next is we'd like to run back propagation and in back propagation we are going to start here at the end and we're going to reverse and calculate the gradient along along all these intermediate values and really what we're computing for every single value here um we're going to compute the derivative of that node with respect to l so the derivative of l with respect to l is just uh one and then we're going to derive what is the derivative of l with respect to f with respect to d with respect to c with respect to e with respect to b and with respect to a and in the neural network setting you'd be very interested in the derivative of basically this loss function l with respect to the weights of a neural network and here of course we have just these variables a b c and f but some of these will eventually represent the weights of a neural net and so we'll need to know how those weights are impacting the loss function so we'll be interested basically in the derivative of the output with respect to some of its leaf nodes and those leaf nodes will be the weights of the neural net and the other leaf nodes of course will be the data itself but usually we will not want or use the derivative of the loss function with respect to data because the data is fixed but the weights will be iterated on using the gradient information
so next we are going to create a variable inside the value class that maintains the derivative of l with respect to that value and we will call this variable grad so there's a data and there's a self.grad and initially it will be zero and remember that zero is basically means no effect so at initialization we're assuming that every value does not impact does not affect the out the output right because if the gradient is zero that means that changing this variable is not changing the loss function so by default we assume that the gradient is zero and then now that we have grad and it's 0.0 we are going to be able to visualize it here after data so here grad is 0.4 f and this will be in that graph and now we are going to be showing both the data and the grad initialized at zero
and we are just about getting ready to calculate the back propagation and of course this grad again as i mentioned is representing the derivative of the output in this case l with respect to this value so with respect to so this is the derivative of l with respect to f with respect to d and so on
so let's now fill in those gradients and actually do back propagation manually
manual backpropagation example #1: simple expression

so let's start filling in these gradients and start all the way at the end as i mentioned here first we are interested to fill in this gradient here so what is the derivative of l with respect to l in other words if i change l by a tiny amount of h how much does l change? it changes by h. so it's proportional and therefore derivative will be one we can of course measure these or estimate these numerical gradients numerically just like we've seen before
a.data += 0.01 * a.grad
b.data += 0.01 * b.grad
c.data += 0.01 * c.grad
f.data += 0.01 * f.grad

e = a * b
d = e + c
L = d * f

print(L.data)

# -7.286496
so if i take this expression and i create a def lol function here
def lol():
  
  h = 0.001
  
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label = 'e'
  d = e + c; d.label = 'd'
  f = Value(-2.0, label='f')
  L = d * f; L.label = 'L'
  L1 = L.data
  
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  b.data += h
  c = Value(10.0, label='c')
  e = a*b; e.label = 'e'
  d = e + c; d.label = 'd'
  f = Value(-2.0, label='f')
  L = d * f; L.label = 'L'
  L2 = L.data
  
  print((L2 - L1)/h)
  
lol()

# -3.9999999999995595
and put this here now the reason i'm creating a gating function lol here is because i don't want to pollute or mess up the global scope here this is just kind of like a little staging area and as you know in python all of these will be local variables to this function so i'm not changing any of the global scope here so here l1 will be l and then copy pasting this expression we're going to add a small amount h in for example a right and this would be measuring the derivative of l with respect to a so here this will be l2 and then we want to print this derivative so print l2 minus l1 which is how much l changed and then normalize it by h so this is the rise over run
and we have to be careful because l is a value node so we actually want its data um so that these are floats dividing by h and this should print the derivative of l with respect to a because a is the one that we bumped a little bit by h so what is the derivative of l with respect to a it's six okay and obviously if we change l by h then that would be here effectively this looks really awkward but changing l by h you see the derivative here is 1. um that's kind of like the base case of what we are doing here so basically we can now come up here and we can manually set l.grad to one this is our manual back propagation l dot grad is one and let's redraw and we'll see that we filled in grad as 1 for l
we're now going to continue the back propagation so let's here look at the derivatives of l with respect to d and f let's do a d first so what we are interested in if i create a markdown on here is we'd like to know basically we have that l is d times f and we'd like to know what is uh d l by d d what is that and if you know your calculus uh l is d times f so what is d l by d d it would be f and if you don't believe me we can also just derive it because the proof would be fairly straightforward uh we go to the definition of the derivative which is f of x plus h minus f of x divide h as a limit limit of h goes to zero of this kind of expression so when we have l is d times f then increasing d by h would give us the output of b plus h times f that's basically f of x plus h right minus d times f and then divide h and symbolically expanding out here we would have basically d times f plus h times f minus t times f divide h and then you see how the df minus df cancels so you're left with h times f divide h which is f
so in the limit as h goes to zero of you know derivative definition we just get f in the case of d times f so symmetrically dl by d f will just be d so what we have is that f dot grad we see now is just the value of d which is 4. and we see that d dot grad is just uh the value of f and so the value of f is negative two so we'll set those manually let me erase this markdown node and then let's redraw what we have okay
and let's just make sure that these were correct so we seem to think that dl by dd is negative two so let's double check um let me erase this plus h from before and now we want the derivative with respect to f so let's just come here when i create f and let's do a plus h here and this should print the derivative of l with respect to f so we expect to see four yeah and this is four up to floating point funkiness and then dl by dd should be f which is negative two grad is negative two so if we again come here and we change d d dot data plus equals h right here so we expect so we've added a little h and then we see how l changed and we expect to print uh negative two there we go
so we've numerically verified what we're doing here is what kind of like an inline gradient check gradient check is when we are deriving this like back propagation and getting the derivative with respect to all the intermediate results and then numerical gradient is just you know estimating it using small step size
now we're getting to the crux of backpropagation so this will be the most important node to understand because if you understand the gradient for this node you understand all of back propagation and all of training of neural nets basically
so we need to derive dl by bc in other words the derivative of l with respect to c because we've computed all these other gradients already now we're coming here and we're continuing the back propagation manually so we want dl by dc and then we'll also derive dl by de now here's the problem how do we derive dl by dc we actually know the derivative l with respect to d so we know how l assessed it to d but how is l sensitive to c so if we wiggle c how does that impact l through d so we know dl by dc and we also here know how c impacts d and so just very intuitively if you know the impact that c is having on d and the impact that d is having on l then you should be able to somehow put that information together to figure out how c impacts l and indeed this is what we can actually do so in particular we know just concentrating on d first let's look at how what is the derivative basically of d with respect to c so in other words what is dd by dc so here we know that d is c times c plus e that's what we know and now we're interested in dd by dc if you just know your calculus again and you remember that differentiating c plus e with respect to c you know that that gives you 1.0 and we can also go back to the basics and derive this because again we can go to our f of x plus h minus f of x divide by h that's the definition of a derivative as h goes to zero and so here focusing on c and its effect on d we can basically do the f of x plus h will be c is incremented by h plus e that's the first evaluation of our function minus c plus e and then divide h and so what is this uh just expanding this out this will be c plus h plus e minus c minus e divide h and then you see here how c minus c cancels e minus e cancels we're left with h over h which is 1.0
and so by symmetry also d d by d e will be 1.0 as well
so basically the derivative of a sum expression is very simple and and this is the local derivative so i call this the local derivative because we have the final output value all the way at the end of this graph and we're now like a small node here and this is a little plus node and it the little plus node doesn't know anything about the rest of the graph that it's embedded in all it knows is that it did a plus it took a c and an e added them and created d and this plus note also knows the local influence of c on d or rather rather the derivative of d with respect to c and it also knows the derivative of d with respect to e but that's not what we want that's just a local derivative what we actually want is d l by d c and l could l is here just one step away but in a general case this little plus note is could be embedded in like a massive graph so again we know how l impacts d and now we know how c and e impact d how do we put that information together to write dl by dc?
and the answer of course is the chain rule in calculus
and so um i pulled up a chain rule here from kapedia and i'm going to go through this very briefly so chain rule wikipedia sometimes can be very confusing and calculus can can be very confusing like this is the way i learned chain rule and it was very confusing like what is happening it's just complicated so i like this expression much better:
if a variable z depends on a variable y which itself depends on the variable x then z depends on x as well obviously through the intermediate variable y in this case the chain rule is expressed as if you want dz by dx then you take the dz by dy and you multiply it by d y by dx so the chain rule fundamentally is telling you how we chain these uh derivatives together correctly so to differentiate through a function composition we have to apply a multiplication of those derivatives
so that's really what chain rule is telling us and there's a nice little intuitive explanation here which i also think is kind of cute the chain rule says that knowing the instantaneous rate of change of z with respect to y and y relative to x allows one to calculate the instantaneous rate of change of z relative to x as a product of those two rates of change simply the product of those two
so here's a good one if a car travels twice as fast as bicycle and the bicycle is four times as fast as walking man then the car travels two times four eight times as fast as demand and so this makes it very clear that the correct thing to do sort of is to multiply
so cars twice as fast as bicycle and bicycle is four times as fast as man so the car will be eight times as fast as the man and so we can take these intermediate rates of change if you will and multiply them together and that justifies the chain rule intuitively
so have a look at chain rule about here really what it means for us is there's a very simple recipe for deriving what we want which is dl by dc and what we have so far is we know want and we know what is the impact of d on l so we know d l by d d the derivative of l with respect to d d we know that that's negative two and now because of this local reasoning that we've done here we know dd by d c so how does c impact d and in particular this is a plus node so the local derivative is simply 1.0 it's very simple and so the chain rule tells us that dl by dc going through this intermediate variable will just be simply d l by d times dd by dc that's chain rule so this is identical to what's happening here except z is rl y is our d and x is rc
so we literally just have to multiply these and because these local derivatives like dd by dc are just one we basically just copy over dl by dd because this is just times one so what does it do so because dl by dd is negative two what is dl by dc?
well it's the local gradient 1.0 times dl by dd which is negative two so literally what a plus node does you can look at it that way is it literally just routes the gradient because the plus nodes local derivatives are just one and so in the chain rule one times dl by dd is um is uh is just dl by dd and so that derivative just gets routed to both c and to e in this case so basically um we have that that grad or let's start with c since that's the one we looked at is negative two times one negative two and in the same way by symmetry e that grad will be negative two that's the claim so we can set those we can redraw and you see how we just assign negative to negative two so this backpropagating signal which is carrying the information of like what is the derivative of l with respect to all the intermediate nodes we can imagine it almost like flowing backwards through the graph and a plus node will simply distribute the derivative to all the leaf nodes sorry to all the children nodes of it
so this is the claim and now let's verify it so let me remove the plus h here from before and now instead what we're going to do is we're going to increment c so c dot data will be credited by h and when i run this we expect to see negative 2 negative 2. and then of course for e so e dot data plus equals h and we expect to see negative 2. simple
so those are the derivatives of these internal nodes and now we're going to recurse our way backwards again and we're again going to apply the chain rule so here we go our second application of chain rule and we will apply it all the way through the graph we just happen to only have one more node remaining we have that d l by d e as we have just calculated is negative two so we know that so we know the derivative of l with respect to e and now we want dl by da right and the chain rule is telling us that that's just dl by de negative 2 times the local gradient so what is the local gradient basically d e by d a we have to look at that so i'm a little times node inside a massive graph and i only know that i did a times b and i produced an e so now what is d e by d a and d e by d b that's the only thing that i sort of know about that's my local gradient so because we have that e's a times b we're asking what is d e by d a and of course we just did that here we had a times so i'm not going to rederive it but if you want to differentiate this with respect to a you'll just get b right the value of b which in this case is negative 3.0 so basically we have that dl by da well let me just do it right here we have that a dot grad and we are applying chain rule here is d l by d e which we see here is negative two times what is d e by d a it's the value of b which is negative 3. that's it
and then we have b grad is again dl by de which is negative 2 just the same way times what is d e by d um db is the value of a which is 2.2.0 as the value of a so these are our claimed derivatives let's redraw and we see here that a dot grad turns out to be 6 because that is negative 2 times negative 3 and b dot grad is negative 4 times sorry is negative 2 times 2 which is negative 4.
so those are our claims let's delete this and let's verify them
we have a here a dot data plus equals h so the claim is that a dot grad is six let's verify six and we have beta data plus equals h so nudging b by h and looking at what happens we claim it's negative four and indeed it's negative four plus minus again float oddness um and uh that's it this that was the manual back propagation uh all the way from here to all the leaf nodes and we've done it piece by piece and really all we've done is as you saw we iterated through all the nodes one by one and locally applied the chain rule we always know what is the derivative of l with respect to this little output and then we look at how this output was produced this output was produced through some operation and we have the pointers to the children nodes of this operation and so in this little operation we know what the local derivatives are and we just multiply them onto the derivative always so we just go through and recursively multiply on the local derivatives and that's what back propagation is is just a recursive application of chain rule backwards through the computation graph
let's see this power in action:
preview of a single optimization step

just very briefly what we're going to do is we're going to nudge our inputs to try to make l go up so in particular what we're doing is we want a.data we're going to change it and if we want l to go up that means we just have to go in the direction of the gradient so a should increase in the direction of gradient by like some small step amount this is the step size and we don't just want this for ba but also for b also for c also for f those are leaf nodes which we usually have control over and if we nudge in direction of the gradient we expect a positive influence on l so we expect l to go up positively so it should become less negative it should go up to say negative you know six or something like that uh it's hard to tell exactly
and we'd have to rewrite the forward pass so let me just um do that here um this would be the forward pass f would be unchanged this is effectively the forward pass and now if we print l.data we expect because we nudged all the values all the inputs in the rational gradient we expected a less negative l we expect it to go up so maybe it's negative six or so let's see what happens okay negative seven and uh this is basically one step of an optimization that we'll end up running and really does gradient just give us some power because we know how to influence the final outcome and this will be extremely useful for training knowledge as well as you'll see
so now i would like to do one more uh example of manual backpropagation using a bit more complex and uh useful example we are going to back propagate through a neuron.
manual backpropagation example #2: a neuron

so we want to eventually build up neural networks and in the simplest case these are multilateral perceptrons as they're called so this is a two layer neural net and it's got these hidden layers made up of neurons and these neurons are fully connected to each other now biologically neurons are very complicated devices but we have very simple mathematical models of them and so this is a very simple mathematical model of a neuron:

you have some inputs axis and then you have these synapses that have weights on them so the w's are weights and then the synapse interacts with the input to this neuron multiplicatively so what flows to the cell body of this neuron is w times x but there's multiple inputs so there's many w times x's flowing into the cell body the cell body then has also like some bias so this is kind of like the inert innate sort of trigger happiness of this neuron so this bias can make it a bit more trigger happy or a bit less trigger happy regardless of the input but basically we're taking all the w times x of all the inputs adding the bias and then we take it through an activation function and this activation function is usually some kind of a squashing function like a sigmoid or 10h or something like that so as an example we're going to use the 10h
plt.plot(np.arange(-5,5,0.2), np.tanh(np.arange(-5,5,0.2))); plt.grid();


in this example numpy has a np.10h so we can call it on a range and we can plot it this is the 10h function and you see that the inputs as they come in get squashed on the y coordinate here so um right at zero we're going to get exactly zero and then as you go more positive in the input then you'll see that the function will only go up to one and then plateau out and so if you pass in very positive inputs we're gonna cap it smoothly at one and on the negative side we're gonna cap it smoothly to negative one so that's 10h and that's the squashing function or an activation function and what comes out of this neuron is just the activation function applied to the dot product of the weights and the inputs
so let's write one out um i'm going to copy paste because i don't want to type too much
# inputs x1,x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
# weights w1,w2
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
# bias of the neuron
b = Value(6.8813735870195432, label='b')
# x1*w1 + x2*w2 + b
x1w1 = x1*w1; x1w1.label = 'x1*w1'
x2w2 = x2*w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'

draw_dot(o)
but okay so here we have the inputs x1 x2 so this is a two-dimensional neuron so two inputs are going to come in these are thought out as the weights of this neuron weights w1 w2 and these weights again are the synaptic strengths for each input and this is the bias of the neuron b and now we want to do is according to this model we need to multiply x1 times w1 and x2 times w2 and then we need to add bias on top of it and it gets a little messy here but all we are trying to do is x1 w1 plus x2 w2 plus b and these are multiply here except i'm doing it in small steps so that we actually have pointers to all these intermediate nodes so we have x1 w1 variable x times x2 w2 variable and i'm also labeling them
so n is now the cell body raw raw activation without the activation function for now and this should be enough to basically plot it
so draw dot of n gives us x1 times w1 x2 times w2 being added then the bias gets added on top of this and this n is this sum
so we're now going to take it through an activation function and let's say we use the 10h so that we produce the output so what we'd like to do here is we'd like to do the output and i'll call it o is um n dot 10h okay but we haven't yet written the 10h now the reason that we need to implement another 10h function here is that tanh is a hyperbolic function and we've only so far implemented a plus and the times and you can't make a 10h out of just pluses and times you also need exponentiation so 10h is this kind of a formula here you can use either one of these and you see that there's exponentiation involved which we have not implemented yet for our low value node here so we're not going to be able to produce 10h yet and we have to go back up and implement something like it
now one option here is we could actually implement um exponentiation right and we could return the x of a value instead of a 10h of a value because if we had x then we have everything else that we need so um because we know how to add and we know how to um we know how to add and we know how to multiply so we'd be able to create 10h if we knew how to x but for the purposes of this example i specifically wanted to show you that we don't necessarily need to have the most atomic pieces in um in this value object we can actually like create functions at arbitrary points of abstraction they can be complicated functions but they can be also very very simple functions like a plus and it's totally up to us the only thing that matters is that we know how to differentiate through any one function so we take some inputs and we make an output the only thing that matters it can be arbitrarily complex function as long as you know how to create the local derivative if you know the local derivative of how the inputs impact the output then that's all you need so we're going to cluster up all of this expression and we're not going to break it down to its atomic pieces we're just going to directly implement tanh so let's do that
depth nh and then out will be a value of and we need this expression here so um let me actually copy paste let's grab n which is a cell.theta and then this i believe is the tan h math.x of two no n n minus one over two n plus one maybe i can call this x just so that it matches exactly okay and now this will be t and uh children of this node there's just one child and i'm wrapping it in a tuple so this is a tuple of one object just self and here the name of this operation will be 10h and we're going to return that okay
so now valley should be implementing 10h and now we can scroll all the way down here and we can actually do n.10 h and that's going to return the tanhd output of n and now we should be able to draw it out of o not of n so let's see how that worked there we go n went through 10 h to produce this output

so now tan h is a sort of our little micro grad supported node here as an operation and as long as we know the derivative of 10h then we'll be able to back propagate through it
now let's see this 10h in action currently it's not squashing too much because the input to it is pretty low so if the bias was increased to say eight then we'll see that what's flowing into the 10h now is two and 10h is squashing it to 0.96 so we're already hitting the tail of this 10h and it will sort of smoothly go up to 1 and then plateau out over there
okay so now i'm going to do something slightly strange i'm going to change this bias from 8 to this number 6.88 etc and i'm going to do this for specific reasons because we're about to start back propagation and i want to make sure that our numbers come out nice they're not like very crazy numbers they're nice numbers that we can sort of understand in our head let me also add a pose label o is short for output here so that's zero
okay so 0.88 flows into 10 h comes out 0.7 so on so now we're going to do back propagation and we're going to fill in all the gradients so what is the derivative o with respect to all the inputs here and of course in the typical neural network setting what we really care about the most is the derivative of these neurons on the weights specifically the w2 and w1 because those are the weights that we're going to be changing part of the optimization and the other thing that we have to remember is here we have only a single neuron but in the neural natives typically have many neurons and they're connected so this is only like a one small neuron a piece of a much bigger puzzle and eventually there's a loss function that sort of measures the accuracy of the neural net and we're back propagating with respect to that accuracy and trying to increase it
so let's start off by propagation here in the end
what is the derivative of o with respect to o the base case sort of we know always is that the gradient is just 1.0 so let me fill it in and then let me split out the drawing function here and then here cell clear this output here okay so now when we draw o we'll see that oh that grad is one so now we're going to back propagate through the tan h so to back propagate through 10h we need to know the local derivative of 10h so if we have that o is 10 h of n then what is d o by d n?
now what you could do is you could come here and you could take this expression and you could do your calculus derivative taking um and that would work but we can also just scroll down wikipedia here into a section that hopefully tells us that derivative uh d by dx of 10 h of x is any of these i like this one 1 minus 10 h square of x so this is 1 minus 10 h of x squared so basically what this is saying is that d o by d n is 1 minus 10 h of n squared and we already have 10 h of n that's just o so it's one minus o squared so o is the output here so the output is this number data is this number and then what this is saying is that do by dn is 1 minus this squared so one minus of that data squared is 0.5 conveniently so the local derivative of this 10 h operation here is 0.5 and so that would be d o by d n so we can fill in that in that grad is 0.5 we'll just fill in so this is exactly 0.5 one half so now we're going to continue the back propagation this is 0.5 and this is a plus node so how is backprop going to what is that going to do here and if you remember our previous example a plus is just a distributor of gradient so this gradient will simply flow to both of these equally and that's because the local derivative of this operation is one for every one of its nodes so 1 times 0.5 is 0.5 so therefore we know that this node here which we called this its grad is just 0.5 and we know that b dot grad is also 0.5 so let's set those and let's draw
so 0.5 continuing we have another plus 0.5 again we'll just distribute it so 0.5 will flow to both of these so we can set theirs x2w2 as well that grad is 0.5 and let's redraw pluses are my favorite uh operations to back propagate through because it's very simple so now it's flowing into these expressions is 0.5 and so really again keep in mind what the derivative is telling us at every point in time along here this is saying that if we want the output of this neuron to increase then the influence on these expressions is positive on the output both of them are positive contribution to the output
so now back propagating to x2 and w2 first this is a times node so we know that the local derivative is you know the other term so if we want to calculate x2.grad then can you think through what it's going to be so x2.grad will be w2.data times this x2w2 by grad right and w2.grad will be x2 that data times x2w2.grad
right so that's the local piece of chain rule let's set them and let's redraw so here we see that the gradient on our weight 2 is 0 because x2 data was 0 right but x2 will have the gradient 0.5 because data here was 1. and so what's interesting here right is because the input x2 was 0 then because of the way the times works of course this gradient will be zero and think about intuitively why that is derivative always tells us the influence of this on the final output if i wiggle w2 how is the output changing it's not changing because we're multiplying by zero so because it's not changing there's no derivative and zero is the correct answer because we're squashing it at zero
and let's do it here point five should come here and flow through this times and so we'll have that x1.grad is can you think through a little bit what what this should be the local derivative of times with respect to x1 is going to be w1 so w1 is data times x1 w1 dot grad and w1.grad will be x1.data times x1 w2 w1 with graph
let's see what those came out to be
so this is 0.5 so this would be negative 1.5 and this would be 1. and we've back propagated through this expression these are the actual final derivatives so if we want this neuron's output to increase we know that what's necessary is that w2 we have no gradient w2 doesn't actually matter to this neuron right now but this neuron this weight should uh go up so if this weight goes up then this neuron's output would have gone up and proportionally because the gradient is one
implementing the backward function for each operation

okay so doing the back propagation manually is obviously ridiculous so we are now going to put an end to this suffering and we're going to see how we can implement uh the backward pass a bit more automatically we're not going to be doing all of it manually out here
it's now pretty obvious to us by example how these pluses and times are back property ingredients so let's go up to the value object and we're going to start codifying what we've seen in the examples below
so we're going to do this by storing a special cell dot backward and underscore backward and this will be a function which is going to do that little piece of chain rule at each little node that compute that took inputs and produced output uh we're going to store how we are going to chain the the outputs gradient into the inputs gradients so by default this will be a function that uh doesn't do anything so um and you can also see that here in the value in micrograb so with this backward function by default doesn't do anything this is an empty function and that would be sort of the case for example for a leaf node for leaf node there's nothing to do but now if when we're creating these out values these out values are an addition of self and other and so we will want to sell set outs backward to be the function that propagates the gradient
so let's define what should happen and we're going to store it in a closure
let's define what should happen when we call outs grad for in addition our job is to take outs grad and propagate it into self's grad and other grad so basically we want to sell self.grad to something and we want to set others.grad to something okay and the way we saw below how chain rule works we want to take the local derivative times the sort of global derivative i should call it which is the derivative of the final output of the expression with respect to out's data (with respect to out)
so the local derivative of self in an addition is 1.0 so it's just 1.0 times outs grad that's the chain rule and others.grad will be 1.0 times outgrad and what you basically what you're seeing here is that outscrad will simply be copied onto selfs grad and others grad as we saw happens for an addition operation so we're going to later call this function to propagate the gradient having done an addition
let's now do multiplication
we're going to also define that backward and we're going to set its backward to be backward and we want to chain outgrad into self.grad and others.grad and this will be a little piece of chain rule for multiplication so we'll have so what should this be, can you think through?
so what is the local derivative here the local derivative was others.data and then oops others.data and the times of that grad that's channel and here we have self.data times of that grad that's what we've been doing
and finally here for 10 h left backward and then we want to set out backwards to be just backward and here we need to back propagate we have out that grad and we want to chain it into self.grad and salt.grad will be the local derivative of this operation that we've done here which is 10h and so we saw that the local the gradient is 1 minus the tan h of x squared which here is t that's the local derivative because that's t is the output of this 10 h so 1 minus t squared is the local derivative and then gradient um has to be multiplied because of the chain rule
so outgrad is chained through the local gradient into salt.grad
and that should be basically it so we're going to redefine our value node we're going to swing all the way down here and we're going to redefine our expression make sure that all the grads are zero okay but now we don't have to do this manually anymore we are going to basically be calling the dot backward in the right order
so first we want to call os dot backwards so o was the outcome of 10h right so calling all that those who's backward will be this function this is what it will do now we have to be careful because there's a times out.grad and out.grad remember is initialized to zero so here we see grad zero so as a base case we need to set both.grad to 1.0 to initialize this with 1 and then once this is 1 we can call oda backward and what that should do is it should propagate this grad through 10h so the local derivative times the global derivative which is initialized at one so this should um... uh doh...
so i thought about redoing it but i figured i should just leave the error in here because it's pretty funny why is anti-object not callable uh it's because i screwed up we're trying to save these functions so this is correct this here we don't want to call the function because that returns none these functions return none we just want to store the function so let me redefine the value object and then we're going to come back in redefine the expression draw a dot everything is great o dot grad is one o dot grad is one and now now this should work of course okay so all that backward should this grant should now be 0.5 if we redraw and if everything went correctly 0.5 yay
okay so now we need to call ns.grad and it's not awkward sorry ends backward so that seems to have worked so instead backward routed the gradient to both of these so this is looking great
now we could of course called uh called b grad beat up backwards sorry what's gonna happen
well b doesn't have it backward b is backward because b is a leaf node b's backward is by initialization the empty function so nothing would happen but we can call call it on it but when we call this one it's backward then we expect this 0.5 to get further routed right so there we go 0.5.5 and then finally we want to call it here on x2 w2 and on x1 w1
do both of those and there we go so we get 0 0.5 negative 1.5 and 1 exactly as we did before but now we've done it through calling that backward um sort of manually
implementing the backward function for a whole expression graph

so we have one last piece to get rid of which is us calling underscore backward manually so let's think through what we are actually doing
we've laid out a mathematical expression and now we're trying to go backwards through that expression um so going backwards through the expression just means that we never want to call a dot backward for any node before we've done a sort of um everything after it
so we have to do everything after it before we're ever going to call that backward on any one node we have to get all of its full dependencies everything that it depends on has to propagate to it before we can continue back propagation
so this ordering of graphs can be achieved using something called topological sort
so topological sort is basically a laying out of a graph such that all the edges go only from left to right basically

so here we have a graph it's a directory a cyclic graph a dag and this is two different topological orders of it i believe where basically you'll see that it's laying out of the notes such that all the edges go only one way from left to right and implementing topological sort you can look in wikipedia and so on i'm not going to go through it in detail but basically this is what builds a topological graph we maintain a set of visited nodes and then we are going through starting at some root node which for us is o that's where we want to start the topological sort and starting at o we go through all of its children and we need to lay them out from left to right
and basically this starts at o if it's not visited then it marks it as visited and then it iterates through all of its children and calls build topological on them and then uh after it's gone through all the children it adds itself so basically this node that we're going to call it on like say o is only going to add itself to the topo list after all of the children have been processed and that's how this function is guaranteeing that you're only going to be in the list once all your children are in the list and that's the invariant that is being maintained so if we built upon o and then inspect this list we're going to see that it ordered our value objects and the last one is the value of 0.707 which is the output
so this is o and then this is n and then all the other nodes get laid out before it
so that builds the topological graph and really what we're doing now is we're just calling dot underscore backward on all of the nodes in a topological order so if we just reset the gradients they're all zero what did we do?
we started by setting o dot grad to b1 that's the base case then we built the topological order and then we went for node in reversed of topo
now in in the reverse order because this list goes from you know we need to go through it in reversed order so starting at o note that backward and this should be it there we go
those are the correct derivatives finally we are going to hide this functionality so i'm going to copy this and we're going to hide it inside the valley class because we don't want to have all that code lying around
so instead of an underscore backward we're now going to define an actual backward so that's backward without the underscore and that's going to do all the stuff that we just arrived so let me just clean this up a little bit
so we're first going to build a topological graph starting at self so build topo of self will populate the topological order into the topo list which is a local variable then we set self.grad to be one and then for each node in the reversed list so starting at us and going to all the children underscore backward and that should be it
so save come down here redefine [Music] okay all the grands are zero and now what we can do is oh that backward without the underscore and there we go and that's uh that's back propagation place for one neuron
fixing a backprop bug when one node is used multiple times

now we shouldn't be too happy with ourselves actually because we have a bad bug um and we have not surfaced the bug because of some specific conditions that we are we have to think about right now
so here's the simplest case that shows the bug
say i create a single node a and then i create a b that is a plus a and then i called backward
so what's going to happen is a is 3 and then a b is a plus a so there's two arrows on top of each other here
then we can see that b is of course the forward pass works b is just a plus a which is six but the gradient here is not actually correct that we calculate it automatically and that's because um of course uh just doing calculus in your head the derivative of b with respect to a should be uh two one plus one... it's not one
intuitively what's happening here right so b is the result of a plus a and then we call backward on it so let's go up and see what that does um b is a result of addition so out as b and then when we called backward what happened is self.grad was set to one and then other that grad was set to one but because we're doing a plus a self and other are actually the exact same object so we are overriding the gradient we are setting it to one and then we are setting it again to one and that's why it stays at one so that's a problem
there's another way to see this in a little bit more complicated expression
so here we have a and b and then uh d will be the multiplication of the two and e will be the addition of the two and then we multiply e times d to get f and then we called fda backward and these gradients if you check will be incorrect
so fundamentally what's happening here again is basically we're going to see an issue anytime we use a variable more than once until now in these expressions above every variable is used exactly once so we didn't see the issue but here if a variable is used more than once what's going to happen during backward pass?
we're backpropagating from f to e to d so far so good but now equals it backward and it deposits its gradients to a and b but then we come back to d and call backward and it overwrites those gradients at a and b
so that's obviously a problem and the solution here if you look at the multivariate case of the chain rule and its generalization there the solution there is basically that we have to accumulate these gradients these gradients add and so instead of setting those gradients we can simply do plus equals we need to accumulate those gradients - plus equals, plus equals, plus equals, plus equals.
and this will be okay remember because we are initializing them at zero so they start at zero and then any contribution that flows backwards will simply add so now if we redefine this one because the plus equals this now works because a.grad started at zero and we called beta backward we deposit one and then we deposit one again and now this is two which is correct and here this will also work and we'll get correct gradients because when we call eta backward we will deposit the gradients from this branch and then we get to back into detail backward it will deposit its own gradients and then those gradients simply add on top of each other and so we just accumulate those gradients and that fixes the issue
okay now before we move on let me actually do a bit of cleanup here and delete some of these some of this intermediate work so we're not gonna need any of this now that we've derived all of it um we are going to keep this because i want to come back to it delete the 10h delete our morning example delete the step delete this keep the code that draws and then delete this example and leave behind only the definition of value
breaking up a tanh, exercising with more operations

and now let's come back to this non-linearity here that we implemented the tanh
now i told you that we could have broken down 10h into its explicit atoms in terms of other expressions if we had the x function so if you remember tan h is defined like this and we chose to develop tan h as a single function and we can do that because we know its derivative and we can back propagate through it but we can also break down tan h into and express it as a function of x
and i would like to do that now because i want to prove to you that you get all the same results and all those ingredients but also because it forces us to implement a few more expressions it forces us to do exponentiation addition subtraction division and things like that and i think it's a good exercise to go through a few more of these
okay so let's scroll up to the definition of value and here one thing that we currently can't do is we can do like a value of say 2.0 but we can't do you know here for example we want to add constant one and we can't do something like this and we can't do it because it says object has no attribute data that's because a plus one comes right here to add and then other is the integer one and then here python is trying to access one.data and that's not a thing and that's because basically one is not a value object and we only have addition for value objects
so as a matter of convenience so that we can create expressions like this and make them make sense we can simply do something like this: basically we let other alone if other is an instance of value but if it's not an instance of value we're going to assume that it's a number like an integer float and we're going to simply wrap it in in value and then other will just become value of other and then other will have a data attribute and this should work
so if i just say this predefined value then this should work there we go okay now let's do the exact same thing for multiply because we can't do something like this again for the exact same reason so we just have to go to mole and if other is not a value then let's wrap it in value let's redefine value and now this works
now here's a kind of unfortunate and not obvious part a times two works we saw that but two times a is that gonna work you'd expect it to right but actually it will not and the reason it won't is because python doesn't know like when when you do a times two basically um so a times two python will go and it will basically do something like a dot mul of two that's basically what it will call but to it 2 times a is the same as 2 dot mol of a and it doesn't 2 can't multiply value and so it's really confused about that
so instead what happens is in python the way this works is you are free to define something called the r mold and our mole is kind of like a fallback so if python can't do 2 times a it will check if um if by any chance a knows how to multiply two and that will be called into our mole so because python can't do two times a it will check is there an our mole in value and because there is it will now call that and what we'll do here is we will swap the order of the operands so basically two times a will redirect to armel and our mole will basically call a times two and that's how that will work
so redefining now with armor two times a becomes four okay
now looking at the other elements that we still need we need to know how to exponentiate and how to divide so let's first the explanation to the exponentiation part we're going to introduce a single function x here and x is going to mirror 10h in the sense that it's a simple single function that transforms a single scalar value and outputs a single scalar value
so we pop out the python number we use math.x to exponentiate it create a new value object everything that we've seen before
the tricky part of course is how do you propagate through e to the x and so here you can potentially pause the video and think about what should go here
okay so basically we need to know what is the local derivative of e to the x so d by d x of e to the x is famously just e to the x and we've already just calculated e to the x and it's inside out that data so we can do up that data times and out that grad that's the chain rule so we're just chaining on to the current running grad and this is what the expression looks like it looks a little confusing but this is what it is and that's the exponentiation
so redefining we should now be able to call a.x and hopefully the backward pass works as well okay and the last thing we'd like to do of course is we'd like to be able to divide
now i actually will implement something slightly more powerful than division because division is just a special case of something a bit more powerful
so in particular just by rearranging if we have some kind of a b equals value of 4.0 here we'd like to basically be able to do a divide b and we'd like this to be able to give us 0.5
now division actually can be reshuffled as follows if we have a divide b that's actually the same as a multiplying one over b and that's the same as a multiplying b to the power of negative one and so what i'd like to do instead is i basically like to implement the operation of x to the k for some constant uh k so it's an integer or a float
um and we would like to be able to differentiate this and then as a special case uh negative one will be division and so i'm doing that just because uh it's more general and um yeah you might as well do it that way
so basically what i'm saying is we can redefine uh division which we will put here somewhere yeah we can put it here somewhere what i'm saying is that we can redefine division so self-divide other can actually be rewritten as self times other to the power of negative one
and now a value raised to the power of negative one we have now defined that
so here's so we need to implement the pow function where am i going to put the power function maybe here somewhere this is the skeleton for it so this function will be called when we try to raise a value to some power and other will be that power
now i'd like to make sure that other is only an int or a float usually other is some kind of a different value object but here other will be forced to be an int or a float otherwise the math won't work for for or try to achieve in the specific case that would be a different derivative expression if we wanted other to be a value
so here we create the output value which is just uh you know this data raised to the power of other and other here could be for example negative one that's what we are hoping to achieve
and then uh this is the backwards stub and this is the fun part which is what is the uh chain rule expression here for back propagating through the power function where the power is to the power of some kind of a constant
so this is the exercise and maybe pause the video here and see if you can figure it out yourself as to what we should put here
okay so you can actually go here and look at derivative rules as an example and we see lots of derivatives that you can hopefully know from calculus in particular what we're looking for is the power rule because that's telling us that if we're trying to take d by dx of x to the n which is what we're doing here then that is just n times x to the n minus 1 right
okay so that's telling us about the local derivative of this power operation so all we want here basically n is now other and self.data is x and so this now becomes other which is n times self.data which is now a python in torah float it's not a valley object we're accessing the data attribute raised to the power of other minus one or n minus one
i can put brackets around this but this doesn't matter because power takes precedence over multiply and python so that would have been okay and that's the local derivative only but now we have to chain it and we change just simply by multiplying by output grad that's chain rule
and this should technically work and we're going to find out soon but now if we do this this should now work and we get 0.5 so the forward pass works but does the backward pass work and i realize that we actually also have to know how to subtract so right now a minus b will not work to make it work we need one more piece of code here and basically this is the subtraction and the way we're going to implement subtraction is we're going to implement it by addition of a negation and then to implement negation we're gonna multiply by negative one so just again using the stuff we've already built and just um expressing it in terms of what we have and a minus b is now working
okay so now let's scroll again to this expression here for this neuron and let's just compute the backward pass here once we've defined o and let's draw it so here's the gradients for all these leaf nodes for this two-dimensional neuron that has a 10h that we've seen before
so now what i'd like to do is i'd like to break up this 10h into this expression here so let me copy paste this here and now instead of we'll preserve the label and we will change how we define o
so in particular we're going to implement this formula here so we need e to the 2x minus 1 over e to the x plus 1. so e to the 2x we need to take 2 times n and we need to exponentiate it that's e to the two x and then because we're using it twice let's create an intermediate variable e and then define o as e plus one over e minus one over e plus one
e minus one over e plus one and that should be it and then we should be able to draw that of o
so now before i run this what do we expect to see?
number one we're expecting to see a much longer graph here because we've broken up 10h into a bunch of other operations but those operations are mathematically equivalent and so what we're expecting to see is number one the same result here so the forward pass works and number two because of that mathematical equivalence we expect to see the same backward pass and the same gradients on these leaf nodes so these gradients should be identical so let's run this so number one let's verify that instead of a single 10h node we have now x and we have plus we have times negative one uh this is the division and we end up with the same forward pass here and then the gradients we have to be careful because they're in slightly different order potentially the gradients for w2x2 should be 0 and 0.5 w2 and x2 are 0 and 0.5 and w1 x1 are 1 and negative 1.5 1 and negative 1.5
so that means that both our forward passes and backward passes were correct because this turned out to be equivalent to 10h before
and so the reason i wanted to go through this exercise is number one we got to practice a few more operations and uh writing more backwards passes and number two i wanted to illustrate the point that the um the level at which you implement your operations is totally up to you you can implement backward passes for tiny expressions like a single individual plus or a single times or you can implement them for say 10h which is a kind of a potentially you can see it as a composite operation because it's made up of all these more atomic operations
but really all of this is kind of like a fake concept all that matters is we have some kind of inputs and some kind of an output and this output is a function of the inputs in some way and as long as you can do forward pass and the backward pass of that little operation it doesn't matter what that operation is and how composite it is if you can write the local gradients you can chain the gradient and you can continue back propagation so the design of what those functions are is completely up to you
doing the same thing but in PyTorch: comparison

so now i would like to show you how you can do the exact same thing by using a modern deep neural network library like for example pytorch which i've roughly modeled micrograd by and so pytorch is something you would use in production and i'll show you how you can do the exact same thing but in pytorch api so i'm just going to copy paste it in and walk you through it a little bit this is what it looks like so we're going to import pi torch and then we need to define these value objects like we have here
now micrograd is a scalar valued engine so we only have scalar values like 2.0 but in pi torch everything is based around tensors and like i mentioned tensors are just n-dimensional arrays of scalars so that's why things get a little bit more complicated here i just need a scalar value to tensor a tensor with just a single element but by default when you work with pytorch you would use um more complicated tensors like this so if i import pytorch then i can create tensors like this and this tensor for example is a two by three array of scalars in a single compact representation so we can check its shape we see that it's a two by three array and so on so this is usually what you would work with um in the actual libraries
so here i'm creating a tensor that has only a single element 2.0 and then i'm casting it to be double because python is by default using double precision for its floating point numbers so i'd like everything to be identical by default the data type of these tensors will be float32 so it's only using a single precision float so i'm casting it to double so that we have float64 just like in python
so i'm casting to double and then we get something similar to value of two the next thing i have to do is because these are leaf nodes by default pytorch assumes that they do not require gradients so i need to explicitly say that all of these nodes require gradients okay so this is going to construct scalar valued one element tensors make sure that fighters knows that they require gradients now by default these are set to false by the way because of efficiency reasons because usually you would not want gradients for leaf nodes like the inputs to the network and this is just trying to be efficient in the most common cases
so once we've defined all of our values in python we can perform arithmetic just like we can here in microgradlend so this will just work and then there's a torch.10h also and when we get back is a tensor again and we can just like in micrograd it's got a data attribute and it's got grant attributes so these tensor objects just like in micrograd have a dot data and a dot grad and the only difference here is that we need to call it that item because otherwise um pi torch that item basically takes a single tensor of one element and it just returns that element stripping out the tensor
so let me just run this and hopefully we are going to get this is going to print the forward pass which is 0.707 and this will be the gradients which hopefully are 0.5 0 negative 1.5 and 1. so if we just run this there we go 0.7 so the forward pass agrees and then point five zero negative one point five and one
so pi torch agrees with us
and just to show you here basically o here's a tensor with a single element and it's a double and we can call that item on it to just get the single number out so that's what item does and o is a tensor object like i mentioned and it's got a backward function just like we've implemented and then all of these also have a dot graph so like x2 for example in the grad and it's a tensor and we can pop out the individual number with that actin
so basically torch can do what we did in micrograph is a special case when your tensors are all single element tensors but the big deal with pytorch is that everything is significantly more efficient because we are working with these tensor objects and we can do lots of operations in parallel on all of these tensors but otherwise what we've built very much agrees with the api of pytorch
building out a neural net library (multi-layer perceptron) in micrograd

okay so now that we have some machinery to build out pretty complicated mathematical expressions we can also start building out neural nets and as i mentioned neural nets are just a specific class of mathematical expressions
so we're going to start building out a neural net piece by piece and eventually we'll build out a two-layer multi-layer layer perceptron as it's called and i'll show you exactly what that means
let's start with a single individual neuron we've implemented one here but here i'm going to implement one that also subscribes to the pytorch api in how it designs its neural network modules
so just like we saw that we can like match the api of pytorch on the auto grad side we're going to try to do that on the neural network modules
so here's class neuron and just for the sake of efficiency i'm going to copy paste some sections that are relatively straightforward
so the constructor will take number of inputs to this neuron which is how many inputs come to a neuron so this one for example has three inputs and then it's going to create a weight there is some random number between negative one and one for every one of those inputs and a bias that controls the overall trigger happiness of this neuron
and then we're going to implement a def underscore underscore call of self and x some input x and really what we don't do here is w times x plus b where w times x here is a dot product specifically now if you haven't seen call let me just return 0.0 here for now the way this works now is we can have an x which is say like 2.0 3.0 then we can initialize a neuron that is two-dimensional because these are two numbers and then we can feed those two numbers into that neuron to get an output
and so when you use this notation n of x python will use call so currently call just return 0.0
now we'd like to actually do the forward pass of this neuron instead so we're going to do here first is we need to basically multiply all of the elements of w with all of the elements of x pairwise we need to multiply them
so the first thing we're going to do is we're going to zip up celta w and x and in python zip takes two iterators and it creates a new iterator that iterates over the tuples of the corresponding entries
so for example just to show you we can print this list and still return 0.0 here
so we see that these w's are paired up with the x's w with x
and now what we want to do is for w i x i in we want to multiply w times w wi times x i and then we want to sum all of that together to come up with an activation and add also subnet b on top
so that's the raw activation and then of course we need to pass that through a non-linearity so what we're going to be returning is act.10h and here's out
so now we see that we are getting some outputs and we get a different output from a neuron each time because we are initializing different weights and by and biases
and then to be a bit more efficient here actually sum by the way takes a second optional parameter which is the start and by default the start is zero so these elements of this sum will be added on top of zero to begin with but actually we can just start with cell dot b and then we just have an expression like this
and then the generator expression here must be parenthesized in python there we go yep so now we can forward a single neuron
next up we're going to define a layer of neurons so here we have a schematic for a mlb so we see that these mlps each layer this is one layer has actually a number of neurons and they're not connected to each other but all of them are fully connected to the input so what is a layer of neurons it's just it's just a set of neurons evaluated independently so in the interest of time i'm going to do something fairly straightforward here
it's um literally a layer is just a list of neurons and then how many neurons do we have we take that as an input argument here how many neurons do you want in your layer number of outputs in this layer and so we just initialize completely independent neurons with this given dimensionality and when we call on it we just independently evaluate them
so now instead of a neuron we can make a layer of neurons they are two-dimensional neurons and let's have three of them and now we see that we have three independent evaluations of three different neurons right
okay finally let's complete this picture and define an entire multi-layer perceptron or mlp
and as we can see here in an mlp these layers just feed into each other sequentially so let's come here and i'm just going to copy the code here in interest of time
so an mlp is very similar we're taking the number of inputs as before but now instead of taking a single n out which is number of neurons in a single layer we're going to take a list of an outs and this list defines the sizes of all the layers that we want in our mlp
so here we just put them all together and then iterate over consecutive pairs of these sizes and create layer objects for them and then in the call function we are just calling them sequentially so that's an mlp really
and let's actually re-implement this picture so we want three input neurons and then two layers of four and an output unit
so we want a three-dimensional input say this is an example input we want three inputs into two layers of four and one output and this of course is an mlp
and there we go that's a forward pass of an mlp to make this a little bit nicer you see how we have just a single element but it's wrapped in a list because layer always returns lists circle for convenience return outs at zero if len out is exactly a single element else return fullest and this will allow us to just get a single value out at the last layer that only has a single neuron
and finally we should be able to draw dot of n of x and as you might imagine these expressions are now getting relatively involved so this is an entire mlp that we're defining now all the way until a single output okay and so obviously you would never differentiate on pen and paper these expressions but with micrograd we will be able to back propagate all the way through this and back propagate into these weights of all these neurons so let's see how that works
creating a tiny dataset, writing the loss function

okay so let's create ourselves a very simple example data set here so this data set has four examples and so we have four possible inputs into the neural net and we have four desired targets so we'd like the neural net to assign or output 1.0 when it's fed this example negative one when it's fed these examples and one when it's fed this example so it's a very simple binary classifier neural net basically that we would like here
now let's think what the neural net currently thinks about these four examples we can just get their predictions um basically we can just call n of x for x in axis and then we can print
so these are the outputs of the neural net on those four examples so the first one is 0.91 but we'd like it to be one so we should push this one higher this one we want to be higher this one says 0.88 and we want this to be negative one this is 0.8 we want it to be negative one and this one is 0.8 we want it to be one
so how do we make the neural net and how do we tune the weights to better predict the desired targets
and the trick used in deep learning to achieve this is to calculate a single number that somehow measures the total performance of your neural net and we call this single number the loss
so the loss first is is a single number that we're going to define that basically measures how well the neural net is performing right now we have the intuitive sense that it's not performing very well because we're not very much close to this so the loss will be high and we'll want to minimize the loss
so in particular in this case what we're going to do is we're going to implement the mean squared error loss so this is doing is we're going to basically iterate um for y ground truth and y output in zip of um wise and white red so we're going to pair up the ground truths with the predictions and this zip iterates over tuples of them and for each y ground truth and y output we're going to subtract them and square them
so let's first see what these losses are these are individual loss components and so basically for each one of the four we are taking the prediction and the ground truth we are subtracting them and squaring them so because this one is so close to its target 0.91 is almost one subtracting them gives a very small number so here we would get like a negative point one and then squaring it just makes sure that regardless of whether we are more negative or more positive we always get a positive number
instead of squaring we should we could also take for example the absolute value we need to discard the sign and so you see that the expression is ranged so that you only get zero exactly when y out is equal to y ground truth when those two are equal so your prediction is exactly the target you are going to get zero and if your prediction is not the target you are going to get some other number so here for example we are way off and so that's why the loss is quite high
and the more off we are the greater the loss will be, so we don't want high loss we want low loss
and so the final loss here will be just the sum of all of these numbers so you see that this should be zero roughly plus zero roughly but plus seven so loss should be about seven here
and now we want to minimize the loss we want the loss to be low because if loss is low then every one of the predictions is equal to its target so the loss the lowest it can be is zero and the greater it is the worse off the neural net is predicting
so now of course if we do lost that backward something magical happened when i hit enter and the magical thing of course that happened is that we can look at end.layers.neuron and that layers at say like the the first layer that neurons at zero because remember that mlp has the layers which is a list and each layer has a neurons which is a list and that gives us an individual neuron and then it's got some weights and so we can for example look at the weights at zero um oops it's not called weights it's called w and that's a value but now this value also has a groud because of the backward pass and so we see that because this gradient here on this particular weight of this particular neuron of this particular layer is negative we see that its influence on the loss is also negative
so slightly increasing this particular weight of this neuron of this layer would make the loss go down and we actually have this information for every single one of our neurons and all their parameters
actually it's worth looking at also the draw dot loss by the way so previously we looked at the draw dot of a single neural neuron forward pass and that was already a large expression but what is this expression we actually forwarded every one of those four examples and then we have the loss on top of them with the mean squared error and so this is a really massive graph because this graph that we've built up now oh my gosh this graph that we've built up now which is kind of excessive it's excessive because it has four forward passes of a neural net for every one of the examples and then it has the loss on top and it ends with the value of the loss which was 7.12 and this loss will now back propagate through all the four forward passes all the way through just every single intermediate value of the neural net all the way back to of course the parameters of the weights which are the input so these weight parameters here are inputs to this neural net and these numbers here these scalars are inputs to the neural net so if we went around here we'll probably find some of these examples this 1.0 potentially maybe this 1.0 or you know some of the others and you'll see that they all have gradients as well the thing is these gradients on the input data are not that useful to us and that's because the input data seems to be not changeable it's it's a given to the problem and so it's a fixed input we're not going to be changing it or messing with it even though we do have gradients for it but some of these gradients here will be for the neural network parameters the ws and the bs and those we of course we want to change okay
collecting all of the parameters of the neural net

so now we're going to want some convenience code to gather up all of the parameters of the neural net so that we can operate on all of them simultaneously and every one of them we will nudge a tiny amount based on the gradient information so let's collect the parameters of the neural net all in one array so let's create a parameters of self that just returns celta w which is a list concatenated with a list of self.b so this will just return a list list plus list just you know gives you a list so that's parameters of neuron and i'm calling it this way because also pi torch has a parameters on every single and in module and uh it does exactly what we're doing here it just returns the parameter tensors for us as the parameter scalars now layer is also a module so it will have parameters itself and basically what we want to do here is something like this like params is here and then for neuron in salt out neurons we want to get neuron.parameters and we want to params.extend right so these are the parameters of this neuron and then we want to put them on top of params so params dot extend of peace and then we want to return brands
so this is way too much code so actually there's a way to simplify this which is return p for neuron in self neurons for p in neuron dot parameters so it's a single list comprehension in python you can sort of nest them like this and you can um then create uh the desired array so this is these are identical we can take this out
and then let's do the same here def parameters self and return a parameter for layer in self dot layers for p in layer dot parameters
and that should be good
now let me pop out this so we don't re-initialize our network because we need to re-initialize our okay so unfortunately we will have to probably re-initialize the network because we just add functionality because this class of course we i want to get all the and that parameters but that's not going to work because this is the old class okay
so unfortunately we do have to reinitialize the network which will change some of the numbers but let me do that so that we pick up the new api we can now do in the parameters and these are all the weights and biases inside the entire neural net
so in total this mlp has 41 parameters and now we'll be able to train them
doing gradient descent optimization manually, training the network

if we recalculate the loss here we see that unfortunately we have slightly different predictions and slightly different laws but that's okay
okay so we see that this neurons gradient is slightly negative we can also look at its data right now which is 0.85 so this is the current value of this neuron and this is its gradient on the loss so what we want to do now is we want to iterate for every p in n dot parameters so for all the 41 parameters in this neural net we actually want to change p data slightly according to the gradient information
okay so dot dot to do here but this will be basically a tiny update in this gradient descent scheme in gradient descent we are thinking of the gradient as a vector pointing in the direction of increased loss
and so in gradient descent we are modifying p data by a small step size in the direction of the gradient so the step size as an example could be like a very small number like 0.01 is the step size times p dot grad right
but we have to think through some of the signs here so uh in particular working with this specific example here we see that if we just left it like this then this neuron's value would be currently increased by a tiny amount of the gradient the grain is negative so this value of this neuron would go slightly down it would become like 0.8 you know four or something like that
but if this neuron's value goes lower that would actually increase the loss that's because the derivative of this neuron is negative so increasing this makes the loss go down so increasing it is what we want to do instead of decreasing it so basically what we're missing here is we're actually missing a negative sign and again this other interpretation and that's because we want to minimize the loss we don't want to maximize the loss we want to decrease it and the other interpretation as i mentioned is you can think of the gradient vector so basically just the vector of all the gradients as pointing in the direction of increasing the loss but then we want to decrease it so we actually want to go in the opposite direction and so you can convince yourself that this sort of plug does the right thing here with the negative because we want to minimize the loss
so if we nudge all the parameters by tiny amount then we'll see that this data will have changed a little bit so now this neuron is a tiny amount greater value so 0.854 went to 0.857 and that's a good thing because slightly increasing this neuron uh data makes the loss go down according to the gradient and so the correct thing has happened sign wise and so now what we would expect of course is that because we've changed all these parameters we expect that the loss should have gone down a bit so we want to re-evaluate the loss let me basically this is just a data definition that hasn't changed but the forward pass here of the network we can recalculate and actually let me do it outside here so that we can compare the two loss values so here if i recalculate the loss we'd expect the new loss now to be slightly lower than this number so hopefully what we're getting now is a tiny bit lower than 4.84 4.36 okay
and remember the way we've arranged this is that low loss means that our predictions are matching the targets so our predictions now are probably slightly closer to the targets and now all we have to do is we have to iterate this process so again um we've done the forward pass and this is the loss now we can lost that backward let me take these out and we can do a step size and now we should have a slightly lower loss 4.36 goes to 3.9 and okay so we've done the forward pass here's the backward pass nudge and now the loss is 3.66 3.47 and you get the idea we just continue doing this and this is uh gradient descent we're just iteratively doing forward pass backward pass update forward pass backward pass update and the neural net is improving its predictions
so here if we look at why pred now like red we see that um this value should be getting closer to one so this value should be getting more positive these should be getting more negative and this one should be also getting more positive so if we just iterate this a few more times
actually we may be able to afford go to go a bit faster let's try a slightly higher learning rate oops okay there we go so now we're at 0.31 if you go too fast by the way if you try to make it too big of a step you may actually overstep it's overconfidence because again remember we don't actually know exactly about the loss function the loss function has all kinds of structure and we only know about the very local dependence of all these parameters on the loss but if we step too far we may step into you know a part of the loss that is completely different and that can destabilize training and make your loss actually blow up even so the loss is now 0.04 so actually the predictions should be really quite close let's take a look
so you see how this is almost one almost negative one almost one we can continue going uh so yep backward update oops there we go
so we went way too fast and um we actually overstepped so we got two uh too eager where are we now oops okay seven e negative nine so this is very very low loss
and the predictions are basically perfect...
so somehow we basically we were doing way too big updates and we briefly exploded but then somehow we ended up getting into a really good spot so usually this learning rate and the tuning of it is a subtle art you want to set your learning rate if it's too low you're going to take way too long to converge but if it's too high the whole thing gets unstable and you might actually even explode the loss depending on your loss function so finding the step size to be just right it's it's a pretty subtle art sometimes when you're using sort of vanilla gradient descent but we happen to get into a good spot we can look at n-dot parameters so this is the setting of weights and biases that makes our network predict the desired targets very very close
and basically we've successfully trained neural net
okay let's make this a tiny bit more respectable and implement an actual training loop and what that looks like so this is the data definition that stays this is the forward pass um so for uh k in range you know we're going to take a bunch of steps
first you do the forward pass, we validate the loss. let's re-initialize the neural net from scratch, and here's the data and we first do before pass then we do the backward pass
and then we do an update that's gradient descent
and then we should be able to iterate this and we should be able to print the current step the current loss um let's just print the sort of number of the loss and that should be it
and then the learning rate 0.01 is a little too small 0.1 we saw is like a little bit dangerously too high let's go somewhere in between and we'll optimize this for not 10 steps but let's go for say 20 steps
let me erase all of this junk and uh let's run the optimization
and you see how we've actually converged slower in a more controlled manner and got to a loss that is very low. so i expect white bread to be quite good there we go
um and that's it
okay so this is kind of embarrassing but we actually have a really terrible bug in here and it's a subtle bug and it's a very common bug and i can't believe i've done it for the 20th time in my life especially on camera and i could have reshot the whole thing but i think it's pretty funny and you know you get to appreciate a bit what um working with neural nets maybe is like sometimes we are guilty of common bug i've actually tweeted the most common neural net mistakes a long time ago:

and i'm not really gonna explain any of these except for we are guilty of number three you forgot to zero grad before that backward what is that basically what's happening and it's a subtle bug and i'm not sure if you saw it is that all of these weights here have a dot data and a dot grad and that grad starts at zero and then we do backward and we fill in the gradients and then we do an update on the data but we don't flush the grad it stays there so when we do the second forward pass and we do backward again remember that all the backward operations do a plus equals on the grad and so these gradients just add up and they never get reset to zero
so basically we didn't zero grad
so here's how we zero grad before backward:
we need to iterate over all the parameters and we need to make sure that p dot grad is set to zero we need to reset it to zero just like it is in the constructor so remember all the way here for all these value nodes grad is reset to zero and then all these backward passes do a plus equals from that grad
but we need to make sure that we reset these graphs to zero so that when we do backward all of them start at zero and the actual backward pass accumulates um the loss derivatives into the grads
so this is zero grad in pytorch and uh we will slightly get we'll get a slightly different optimization let's reset the neural net the data is the same this is now i think correct and we get a much more you know we get a much more slower descent we still end up with pretty good results and we can continue this a bit more to get down lower and lower and lower
yeah so the only reason that the previous thing worked it's extremely buggy um the only reason that worked is that this is a very very simple problem and it's very easy for this neural net to fit this data and so the grads ended up accumulating and it effectively gave us a massive step size and it made us converge extremely fast
but basically now we have to do more steps to get to very low values of loss and get wipe red to be really good
we can try to step a bit greater yeah we're gonna get closer and closer to one minus one and one
so working with neural nets is sometimes tricky because uh you may have lots of bugs in the code and uh your network might actually work just like ours worked but chances are is that if we had a more complex problem then actually this bug would have made us not optimize the loss very well and we were only able to get away with it because the problem is very simple
summary of what we learned, how to go towards modern neural nets

so let's now bring everything together and summarize what we learned
what are neural nets?
neural nets are these mathematical expressions, fairly simple mathematical expressions in the case of multi-layer perceptron that take input as the data and they take input the weights and the parameters of the neural net mathematical expression for the forward pass followed by a loss function and the loss function tries to measure the accuracy of the predictions and usually the loss will be low when your predictions are matching your targets or where the network is basically behaving well so we we manipulate the loss function so that when the loss is low the network is doing what you want it to do on your problem
and then we backward the loss use backpropagation to get the gradient and then we know how to tune all the parameters to decrease the loss locally
but then we have to iterate that process many times in what's called the gradient descent so we simply follow the gradient information and that minimizes the loss and the loss is arranged so that when the loss is minimized the network is doing what you want it to do
and yeah so we just have a blob of neural stuff and we can make it do arbitrary things and that's what gives neural nets their power um it's you know this is a very tiny network with 41 parameters but you can build significantly more complicated neural nets with billions at this point almost trillions of parameters and it's a massive blob of neural tissue simulated neural tissue roughly speaking and you can make it do extremely complex problems and these neurons then have all kinds of very fascinating emergent properties in when you try to make them do significantly hard problems
as in the case of gpt for example we have massive amounts of text from the internet and we're trying to get a neural net to predict to take like a few words and try to predict the next word in a sequence that's the learning problem and it turns out that when you train this on all of internet the neural net actually has like really remarkable emergent properties but that neural net would have hundreds of billions of parameters but it works on fundamentally the exact same principles the neural net of course will be a bit more complex but otherwise the value in the gradient is there and would be identical and the gradient descent would be there and would be basically identical but people usually use slightly different updates this is a very simple stochastic gradient descent update um and the loss function would not be mean squared error they would be using something called the cross-entropy loss for predicting the next token so there's a few more details but fundamentally the neural network setup and neural network training is identical and pervasive and now you understand intuitively how that works under the hood
walkthrough of the full code of micrograd on github

in the beginning of this video i told you that by the end of it you would understand everything in micrograd and then we'd slowly build it up let me briefly prove that to you so i'm going to step through all the code that is in micrograd as of today actually potentially some of the code will change by the time you watch this video because i intend to continue developing micrograd but let's look at what we have so far at least
init.pi is empty, when you go to engine.pi that has the value everything here you should mostly recognize so we have the data.grad attributes we have the backward function uh we have the previous set of children and the operation that produced this value we have addition multiplication and raising to a scalar power
we have the relu non-linearity which is slightly different type of nonlinearity than 10h that we used in this video both of them are non-linearities and notably 10h is not actually present in micrograd as of right now but i intend to add it later
with the backward which is identical and then all of these other operations which are built up on top of operations here so values should be very recognizable except for the non-linearity used in this video um there's no massive difference between relu and 10h and sigmoid and these other non-linearities they're all roughly equivalent and can be used in mlps so i use 10h because it's a bit smoother and because it's a little bit more complicated than relu and therefore it's stressed a little bit more the local gradients and working with those derivatives which i thought would be useful
and then that pi is the neural networks library as i mentioned so you should recognize identical implementation of neuron layer and mlp notably or not so much
we have a class module here there is a parent class of all these modules i did that because there's an nn.module class in pytorch and so this exactly matches that api and end.module and pytorch has also a zero grad which i've refactored out here
so that's the end of micrograd really then there's a test which you'll see basically creates two chunks of code one in micrograd and one in pi torch and we'll make sure that the forward and the backward pass agree identically for a slightly less complicated expression a slightly more complicated expression everything agrees so we agree with pytorch on all of these operations
and finally there's a demo.ipymb here and it's a bit more complicated binary classification demo than the one i covered in this lecture so we only had a tiny data set of four examples um here we have a bit more complicated example with lots of blue points and lots of red points and we're trying to again build a binary classifier to distinguish uh two dimensional points as red or blue it's a bit more complicated mlp here with it's a bigger mlp the loss is a bit more complicated because it supports batches so because our dataset was so tiny we always did a forward pass on the entire data set of four examples but when your data set is like a million examples what we usually do in practice is we chair we basically pick out some random subset we call that a batch and then we only process the batch forward backward and update so we don't have to forward the entire training set
so this supports batching because there's a lot more examples here we do a forward pass the loss is slightly more different this is a max margin loss that i implement here the one that we used was the mean squared error loss because it's the simplest one there's also the binary cross entropy loss all of them can be used for binary classification and don't make too much of a difference in the simple examples that we looked at so far
there's something called l2 regularization used here this has to do with generalization of the neural net and controls the overfitting in machine learning setting but i did not cover these concepts and concepts in this video potentially later
and the training loop you should recognize so forward backward with zero grad and update and so on you'll notice that in the update here the learning rate is scaled as a function of number of iterations and it shrinks and this is something called learning rate decay so in the beginning you have a high learning rate and as the network sort of stabilizes near the end you bring down the learning rate to get some of the fine details in the end
and in the end we see the decision surface of the neural net and we see that it learns to separate out the red and the blue area based on the data points
so that's the slightly more complicated example and then we'll demo that hyper ymb that you're free to go over but yeah as of today that is micrograd
real stuff: diving into PyTorch, finding their backward pass for tanh

i also wanted to show you a little bit of real stuff so that you get to see how this is actually implemented in production grade library like pytorch uh so in particular i wanted to show i wanted to find and show you the backward pass for 10h in pytorch
so here in micrograd we see that the backward password 10h is one minus t square where t is the output of the tanh of x times of that grad which is the chain rule so we're looking for something that looks like this now i went to pytorch um which has an open source github codebase and uh i looked through a lot of its code and honestly i i i spent about 15 minutes and i couldn't find 10h and that's because these libraries unfortunately they grow in size and entropy and if you just search for 10h you get apparently 2 800 results and 400 and 406 files so i don't know what these files are doing honestly and why there are so many mentions of 10h but unfortunately these libraries are quite complex they're meant to be used not really inspected um eventually i did stumble on someone who tries to change the 10 h backward code for some reason and someone here pointed to the cpu kernel and the kuda kernel for 10 inch backward so this so basically depends on if you're using pi torch on a cpu device or on a gpu which these are different devices and i haven't covered this but this is the 10 h backwards kernel for uh cpu
and the reason it's so large is that number one this is like if you're using a complex type which we haven't even talked about if you're using a specific data type of b-float 16 which we haven't talked about and then if you're not then this is the kernel and deep here we see something that resembles our backward pass so they have a times one minus b square uh so this b b here must be the output of the 10h and this is the health.grad so here we found it uh deep inside pi torch from this location for some reason inside binaryops kernel when 10h is not actually a binary op
and then this is the gpu kernel we're not complex we're here and here we go with one line of code so we did find it but basically unfortunately these codepieces are very large and micrograd is very very simple but if you actually want to use real stuff uh finding the code for it you'll actually find that difficult
i also wanted to show you a little example here where pytorch is showing you how can you can register a new type of function that you want to add to pytorch as a lego building block so here if you want to for example add a gender polynomial 3 here's how you could do it
you will register it as a class that subclasses storage.org that function and then you have to tell pytorch how to forward your new function and how to backward through it so as long as you can do the forward pass of this little function piece that you want to add and as long as you know the the local derivative the local gradients which are implemented in the backward pi torch will be able to back propagate through your function and then you can use this as a lego block in a larger lego castle of all the different lego blocks that pytorch already has and so that's the only thing you have to tell pytorch and everything would just work and you can register new types of functions in this way following this example
and that is everything that i wanted to cover in this lecture so i hope you enjoyed building out micrograd with me i hope you find it interesting insightful and yeah i will post a lot of the links that are related to this video in the video description below i will also probably post a link to a discussion forum or discussion group where you can ask questions related to this video and then i can answer or someone else can answer your questions and i may also do a follow-up video that answers some of the most common questions but for now that's it i hope you enjoyed it