Skip to content

Instantly share code, notes, and snippets.

@rxwei
Forked from bartvm/myia.md
Created November 7, 2016 21:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save rxwei/eb7668e9c378b9bede91e41e1702b0b9 to your computer and use it in GitHub Desktop.
Save rxwei/eb7668e9c378b9bede91e41e1702b0b9 to your computer and use it in GitHub Desktop.

Myia (Theano 2.0)

Automatic differentiation vs. symbolic

In the literature the term automatic differentiation (AD) is reserved for a specific technique in which the gradient of a program is calculated by creating an adjoint program, which performs the gradient calculation of the given program. Note that this adjoint program includes all control flow statements. There are two approaches to implementing AD: Source code transformation (SCT) and operator overloading (OO). With source code transformation we generate the adjoint program in the host language e.g. given a Python function we manipulate the abstract syntax tree (AST) directly in order to create a new Python function which performs the gradient computation. Operator overloading on the other hand overloads each operator to add an entry to a tape (Wengert list). Once the function exits, the gradient is calculated by going through the tape in reverse order, applying the gradient operators.

Theano does not employ AD but "a highly optimized form of symbolic differentiation". Theano requires you to define the computation graph of your model in advance. Frameworks that use AD (with operator overloading) are e.g. torch-autograd and Chainer (whose developers refer to it as "define-by-run"). AD is more powerful than symbolic differentiation since it allows you to backpropagate through algorithms (c.f. the meta-gradient paper) and control flows, without requiring any specific DSL. Some models or algorithms (ones whose computation graph relies on the input for example) are basically not possible with symbolic differentiation, others (e.g. RNNs, L-BFGS) are just much easier with AD.

AD framework development

A framework using AD does not need to implement equivalents of Theano's scan or ifelse. It integrates directly with the host language i.e. the for loop (including break and continue) and if-else conditions can be written in Python directly, instead of requiring custom operators. This makes the code easier to write for the user, and equally important, makes the framework itself much easier to develop.

Optimization

An analog can be drawn between Theano/TensorFlow and torch-autograd/Chainer on one side, and compiled and interpreted languages on the other. Automatic differentiation with operator overloading is similar to an interpreted language, and comes with similar benefits of being able to debug at runtime (e.g. inspecting array shapes). On the other hand frameworks that require you to predefine your model will theoretically always outperform AD frameworks because they can perform static analysis on the computation graph and apply all sorts of optimizations, just like compiled languages.

Just like compiled and interpreted languages are both valuable, I think there is room for both types of gradient-enabled numerical frameworks. Depending on the research project, one might require a highly flexible that is easy to debug, while for other projects performance of off-the-shelf models might be more important.

Problems with current AD frameworks

The current AD frameworks can incur significant overhead from tape management and operator overloading when the model involves many small operators e.g. when working with scalars or small vectors. This is exarcebated by the AD logic being done in Python and/or Lua, relatively slow languages compared to e.g. C. (I have had models in torch-autograd where 35% of the time was spent retrieving upvalues in the wrapped operators.)

Neither torch-autograd nor Chainer seems to be particularly intelligent about when to allow in-place operations, and on which values to store for the backward propagation. Many operators (e.g. addition, tanh) don't require the input during the backpropagation, so tanh(x) can be done in place (assuming that no operator which does need its inputs for the backprop has x as an input). In practice, this means that each operator needs to tag an array as read-only when it needs it for the backprop. Subsequent operators will not be allowed to operate in-place on these arrays.

torch-autograd (not sure about Chainer) does not accumulate gradients in-place i.e. if two operators contribute to dX it will allocate two tensors and then sum these. This is not efficient.

torch-autograd tried to sidestep some of these issues through an awkward distinction between direct mode (which can be relatively slow) and optimized mode (which is faster, but comes with significant restrictions and is internally muddled).

Outline of Myia

  • The AD tape overhead and Python interface can be entirely parallelized
  • Ability to execute kernels in parallel (i.e different threads or CUDA streams)
    • The previous two points can be achieved by having one thread for the Python interpreter which sends jobs to the scheduler, which is running in a separate thread. The scheduler immediately returns a future which the main thread continues with. The scheduler tracks the dependencies of each operator, and once all the dependencies have been met (i.e. the actual inputs are available) it schedules a kernel in a worker thread or CUDA stream.
  • For each operator we define:
    • A method which performs the forward prop and, depending on which input variables are differentiable, marks input variables as read-only (guaranteeing that they can be used in the backprop)
    • A backprop method that can allocate the gradients with respect to the input (if it doesn't exist yet) or adds its gradient contribution in-place
  • The back-end can be done using TH, THC, THNN and THCUNN
  • The entire framework should keep the possibility of implementing checkpointing open. Checkpointing involves not saving some variables required for the backward pass but instead recomputing those on the backward pass. This trades computation for memory usage, and has the potential to train much, much larger networks out of the box; a killer feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment