To train a network, we need to first use train_generic
:
let train_generic ?state ?params ?(init_model=true) nn x y =
if init_model = true then init nn;
let f = forward nn in
let b = backward nn in
let u = update nn in
let s = save nn in
let p = match params with
| Some p -> p
| None -> Optimise.Params.default ()
in
Optimise.minimise_network ?state p f b u s x y
We can see that it is actually a wrapper around minimise_network
:
let minimise_network ?state params forward backward update save x y =
...
let iterate i =
let xt, yt = bach_fun x y i in
let yt', ws = forward xt in
let loss = loss_fun yt yt' in
let loss = Maths.(loss / (_f (Mat.row_num yt |> float_of_int))) in
let reg = ...
let loss = Maths.(loss + reg) in
let ws, gs' = backward loss in
loss |> primal', ws, gs'
in
...
while Checkpoint.(state.stop = false) do
let loss', ws, gs' = iterate Checkpoint.(state.current_batch) in
...
done
What are the forward
and backward
functions here?
let forward nn x = mktag (tag ()) nn; run x nn, mkpar nn
let backward nn y = reverse_prop (_f 1.) y; mkpri nn, mkadj nn
These two operations are called at each iteration. Now we can see the process from a high level.
First, differentiate between two kinds of neurons: the ones that contain weights to update, such as Conv2D, and I call them Type A neurons; the rest that only do calculation, such as MaxPool2D, and I call it Type B neurons. Hereafter I'll use "layer" and "neuron" interchangeably.
Each neuron normally contains several nodes, each node has type t
:
type t =
| F of A.elt (* constructor of float numbers *)
| Arr of A.arr (* constructor of ndarrays *)
| DF of t * t * int (* primal, tangent, tag *)
| DR of t * t ref * trace_op * int ref * int (* primal, adjoint, op, fanout, tag *)
For example, if the run
function of one node is Maths.(relu (x - _f a))
, it will generate 3 nodes. I don't think we need to use DF
value here; I also understand the primal value of a DR
as the value of this operation itself, and adjoint value as gradient hereafter.
(wait, what is primal
anyway? weight? output value? The graph is definitely hold in op
not here...)
mktag (tag ()) nn
: for each layer in a neural network, if it is Type A, change each of its parameters to aDR
value by callingmake_reverse
; if it is a type B neuron, do nothing.
let make_reverse p i = DR (p, ref (zero p), Noop, ref 0, i)
Note that the DR
values created here are ALL Noop
operations, which means they are nothing more than placeholders at this stage.
run x nn
: connect all the existing operations into a graph by running this network layer by layer, regardless whether it's type A or B. The whole graph is accumulated to the output node. Note that therun
function of each neuron uses operation fromAlgodiff.Maths
rather than normal math operations. Let's look an example :
(* module Conv2D *)
let run x l = Maths.((conv2d ~padding:l.padding x l.w l.stride) + l.b)
Here both l.w
and l.b
are already set to DR
placeholders. x
is a t
output value from the previous neuron. How the conv2d
operation is implemented in Algodiff then?
and conv2d ?padding a b s =
let ff a b =
match a, b with
| Arr a, Arr b -> Arr A.(conv2d ?padding a b s)
| _ -> error_binop "conv2d" a b
in
let fd a b = conv2d ?padding a b s in
...
let r_d_c a b = Conv2D_D_C (a, b, s) in
op_d_d_d a b ff fd _ _ _ _ r_d_c _
Here a
and b
are input and kernel respectively, both of type t
. For simplicity, we only look at the case where input is DR
and Kernel
is a constant Arr
:
(One question: wait, you just said that the l.w
is already set to DR
in the previous step, how then could it be Arr
now? What has changed it?)
and op_d_d_d a b ff fd df_da df_db df_dab r_d_d r_d_c r_c_d =
match a, b with
| DR (ap, _, _, _, ai), Arr _bp -> let cp = fd ap b in DR (cp, ref (zero cp), r_d_c a b, ref 0, ai)
|...
So what Maths.conv2d
does is this: first calculate the result value by updating the existing primal value of a DR (let cp = fd ap b
=> let cp = conv2d ?padding ap b s
), ignore the gradient value (just set to zero: ref (zero cp)
), set the Noop
operation to Conv2D_D_C
(r_d_c a b
), and set the tag as is (ai
).
What I don't quite understand is function fd
; it calls it self, why? I temporarily interpret it as "calculating result value", but it is quite likely wrong.
The translation of Type B neuron is similar. For example, for the Maxpool2D
neuron:
let run x l = Maths.(max_pool2d l.padding x l.kernel l.stride)
and max_pool2d padding a b s =
let ff = function
| Arr a -> Arr A.(max_pool2d ~padding a b s)
| _ -> error_uniop "max_pool2d" a
in
let fd a = max_pool2d padding a b s in
let df _cp _ap _at = failwith "max_pool2d:df" in
let r a = Maxpool2D_D (a, padding, b, s) in
op_d_d a ff fd df r
and op_d_d a ff fd df r =
match a with
| DF (ap, at, ai) -> ...
| DR (ap, _, _, _, ai) -> let cp = fd ap in DR (cp, ref (zero cp), r a, ref 0, ai)
| ap -> ff ap
If the input is DR
, then this operation similarly adds a DR
node to the graph; otherwise the ff
is called, and a Arr
node is added.
After a forward pass is finished, we get one output DR
value. But it's much more than an output value; it actually contains a whole computation graph in its op
:
and trace_op =
| Conv1D_D_C of t * t * int array
| Maxpool2D_D of t * padding * int array * int array
...
The output of each run
function is accumulated in this way into a graph of t
s.
- The final step
mkpar nn
is simple: return the parameters of each layer in an array, which ist array array
.
We have already get the graph in the output DR
value y
from forward pass; now let's apply backward step on it. Note that the backward step is actually applied on a loss
value, which append some extra nodes at the end of the output graph from forward pass, but let's ignore it for now.
Starting from reverse_prop (_f 1.) y
, which simply comprises of two steps:
let reverse_prop v x =
reverse_reset x;
reverse_push v x
let reverse_reset x =
let rec reset xs =
match xs with
| [] -> ()
| x :: t ->
| DR (_ap, aa, ao, af, _ai) -> (
aa := reset_zero !aa;
match ao with
| Noop -> reset t
....)
else reset t
)
| _ -> reset t
in
reset [x]
let reverse\_push v x =
let open Maths in
let rec push xs =
match xs with
| [] -> ()
| (v, x) :: t -> (
match x with
| DR (ap, aa, ao, af, _ai) -> (
aa := Maths.(!aa + v);
match ao with
| Noop -> push t
| Conv2D_D_C (a, b, s) -> push ((conv2d_backward_input a b s !aa, a) :: t)
| ....
)
| _ -> push t
)
in
push [(v, x)]
-
No magic for the
reverse_reset
function. Starting from the root nodey
, for each node: 1) set my own gradient to 0; 2) add my parents to the Stack; 3) process the first element of the Stack until it is empty. -
reverse_push
is little bit complex but similar. Starting from(v, y)
, where v fory
is 1, for each node: 1) update my gradient by adding my current gradient withv
; 2) calculate thev
for my parents; 3) add(v, parent)
to the Stack; 4) process the first element of the Stack until the it is empty. In both steps, if the node is not aDR
, then just ignore it. -
After one backward step, the gradient of each node is updated.
The rest is easy: mkpri nn, mkadj nn
: get the weight value and gradient of each node in arrays if it contains any.