bartvm/dl-frameworks.rst

## dl-frameworks.rst

      
    Raw
  

              dl-frameworks.rst
            
          
    A comparison of Theano with other deep learning frameworks, highlighting a series of low-level design choices in no particular order.

Overview

Differentiation
Computation graph/IR
Optimizations
Scheduling
GPU/CPU transparency
Foreign function interface
Dynamic compilation / metaprogramming
Theano wishlist


Differentiation

Symbolic: Theano, CGT; Automatic: Torch, MXNet
Symbolic and automatic differentiation are often confused or used interchangeably, although their implementations are significantly different.
Reverse automatic differentiation is a generalization of backpropagation [1]. When a gradient needs to be calculated, the input is forward propagated through the computation graph with each node remembering its input. The graph is then traversed in reverse, calling the "backward" method of each node with the gradient (with respect to the node's outputs) as its argument.
On the other hand, symbolic differentiation relies on each operator having its gradient defined in terms of the same symbolic operators that constructed the graph. When the gradient of a parameter is requested, the nodes are traversed in reverse, each returning a series of symbolic operators that extend the computation graph. The result is an extended computation graph that defines a path from input to gradient.

Graph size
With automatic differentiation the same graph is traversed twice, halving the size of the graph that needs to be optimized and compiled.
Debugging
Automatic differentiation is easier to debug, since each operator that could raise an error was at one point called by the user. With symbolic differentiation, new operators might appear as part of the gradient calculation which are hard to place for the user.
Optimization
Symbolic differentiation requires thorough optimization of the graph in order to e.g. remove common subexpressions whereas automatic differentiation requires no optimization whatsoever. However, symbolic differentitation might allow for more optimization e.g. operators that cannot be fused in the forward pass, could be fused in the backward pass.
Memory
Automatic differentiation requires each operator to remember its input (at least, a naive implementation does [2]). Symbolic differentiation with proper memory optimization routines could perhaps be more memory efficient.
Higher order derivatives
Symbolic differentiation comes with the advantage that only the forward R-op needs to be implemented. With automatic differentiation, we need to implement both the forward and the backward R-op. [3]
Flexibility
Symbolic differentiation seems to be more flexible when considering cases in which we have multiple cost functions, want to have a symbolic representation of the gradient to manipulate further, etc. Automatic differentiation can support all these cases, but requires the framework to be extended in order to provide symbolc expressions of the gradients to be manipulated further.


[1] Automatic differentiation in machine learning: a survey


[2] The Data-Flow Equations of Checkpointing in Reverse Automatic Differentiation


[3] torch/nn#399


Computation graph/IR

Many deep learning libraries rely on the ability to construct a computation graph, which can be considered the intermediate representation (IR) of our program. The lazy construction of a graph allows for optimization (Theano, CGT), scheduling (MXNet), and/or automatic differentiation (Torch, MXNet). The question is what information should be contained in this graph.

Shapes
The shape of a tensor can be defined when constructing the graph. Requiring this reduces flexibility significantly (for example, minibatches can have different shapes). Alternatively, shape information can be left out of the graph entirely, which is currently the case in Theano (except for some specialized operators like convolutions). [*] The middle ground would be to make the specification of shape information optional. This can useful for optimization purposes: If some tensors are determined to be of fixed size, they could be allocated at compilation time already (trades memory efficiency for performance) [4].
Memory allocation
The allocation of memory can be made part of the IR. That is, we can represent y = x + 1 as Var(x), Constant(1) -> [Add] -> y, or as const_storage = Allocate((1,)); y_storage = Allocate(x.shape); const_storage, x_storage y_storage -> [Add] -> y_storage. Unclear as to what the advantages of either approach are.
Loops and branches
Loops and branches can be supported in different ways. Theano's approach seems to be to enforce acyclicity of the computation graph by treating loops as single nodes (the scan op), which itself contains another computation graph. Branches are handled by the task scheduler (confusingly called virtual machines in Theano). Most other frameworks don't have any support for loops and branches. A more principled way of dealing with loops would be to allow loops in the computation graph. In order to maintain strict static single assignment (SSA, [5]) form, this requires the implementation of Phi functions (or something similar [6]). Keeping the graph flat could make loops more scalable e.g. allow for nested loops.
Device independence
Theano tries to support a variety of backends. In the case of Theano, they share the same IR, which is then transformed into a device-specific representation during the optimization phase. The device-independent optimization is separated from the device-specific stuff by setting optimization "levels". Programmatically, this could be separated more fully. In the spirit of compilers, we would have an IR optimizer, and then an entirely separate device-specific optimizer and/or task scheduler.


[*] Theano's documentation discusses "shape promotion", but I can find no code for the mentioned optimizations, local_shape_lift_*.


[4] MXNet documentation


[5] Wikipedia: Static single assignment


[6] WebKit code


Optimizations

One of Theano's fortes is its optimization engine, which guarantees both computational efficiency as well as numerical stability [7]. Only CGT emulates this behaviour; the other frameworks contain no optimization framework. A clear parallel can be drawn with the optimization literature:

Common subexpression eliminiation (CSE, called merge in Theano)
Constant folding
Constant propagation
Algebraic simplification (add, mul, pow specializations, neg_neg)
Algebaric reassociation (called "canonicalization")
Dead store elimination (Theano will discard the part of the computation graph that is not needed to calculate the requested outputs)
Operator strength reduction/local optimizations (e.g. merging multiple operators into a single GEMM call)

More unique to Theano's application are optimizations for numerical stability (which violates the concept of a compiler producing semantically equivalent output).

Non-elementwise fusing
An interesting idea would be to evaluate the possibility of non-elementwise fusing of operations e.g. multiple matrix-matrix multiplications that have one side in common can be combined e.g. y = Wx and y' = W'x can be calculated in a single call to [y | y'] = [W | W']x (cf. the LSTM trick as implemented in Blocks). Doing this involves optimizing memory allocations.
Debugging
An optimization framework complicates debugging, because the resulting computation graph is different from the user's constructed graph, making it difficult to pinpoint which exact instruction is the cause of the error.
Performance
The main goal of optimization is to increase performance. For Theano, the differences with and without optimization are significant. However, other frameworks can achieve speeds similar to Theano in the absence of an optimization framework. This could be due to Theano's finer-grained set of operators (compared to more monolithic "layers" in other frameworks), or perhaps their is a certain inefficiency inherent to Theano.
Usability
An optimization framework requires less expert knowledge of the user; inefficient operators are automatically optimized, allowing the user to focus on constructing models as opposed to optimize the efficiency of their code.


[7] Theano optimizations


Scheduling

Task scheduling (or instruction scheduling, to draw a parallel to the optimization literature) is the scheduling of the execution of nodes (operators) in the computation graph after it has been optimized (and, optionally, compiled). Significant gains can be made here by making use of parallelism (streams in CUDA [8], threads on CPU). This is an approach many new frameworks have focused on e.g. MXNet [9], Purine [10], Minerva [11], etc.


[8] CUDA streams


[9] MXNet's "dependency engine"


[10] Purine: A bi-graph based deep learning framework


[11] Minvera's "dataflow engine"


Multi-GPU
Schedulers can also be written to support the multi-GPU case. Automatic model parallelism is far from a solved case, whereas data parallelism can be an easy and straightforward way to schedule.
Asynchronous calls
Ideally, a scheduler should be called asynchronously. For example, data transferring and computation can be performed in parallel, so we could potentially load the training data onto the GPU while the previous computation is still ongoing. There are significant gains to be made here in some cases. Coding-wise and interface-wise, this is more complex than the current approach though.
CUDA streams and events
CUDA kernels that can run in parallel can be issued on separate CUDA streams so that they can execute in parallel. They can further be made to synchronize by making them wait for events. It would be interesting to know how much can be gained by employing these methods (as compared to using the 0 stream and/or synchronizing streams by calling e.g. cudaEventQuery).


GPU/CPU transparency

The interaction between CPU and GPU computation can be done in a variety of ways.

Fully transparent
Theano tries to make the interaction fully transparent by offloading to GPU when possible, but also transparently moving data to the host and back again if needed i.e. when a certain operator is only implemented on CPU. The advantage is that the user can still make use of GPU acceleration, even when a particular op isn't implemented on GPU. The downside is that a lot of subpar performance is due to the Theano transferring data to the host without the user being aware of it (e.g. because they used advanced indexing). This means users still need to write "GPU-runnable" Theano code.
Explicit transfers
Torch takes a more explicit approach. GPU-enabled operators will only accept GPU arrays, and vice versa. Moving tensors to and from the GPU simply requires calling cuda() or double() on the tensor [†]. The advantage is that the user is fully aware of the device the computation is being performed on. The downside is that scripts need to enable GPU usage by explicitly transferring tensors to the GPU.
Exclusive mode
A final option is to not allow transfer between the CPU and GPU within a single computation graph. The disadvantage is that if a user really needs to perform part of the computation on CPU, they will need to construct and compile separate computation graphs. The advantage is the user is fully aware of the transfer. Internally, this restriction could simplify the code. This is what Caffe does I believe.


[†] Note that Torch does not have a graph compiler or task scheduler, which means that the user can interrupt the computations with calls to cuda() or double() in an imperative style whenever. It is more difficult to imagine what this kind of explicit CPU-GPU transfer would look like in the case of a full computation graph approach.


Foreign function interface

Theano: C extensions,
CGT: Cython,
scikit-cuda: ctypes,
cuda4py: CFFI
There are a variety of methods to interface Python with C, each with pros and cons.

C extension (Python C API)
Theano's current approach. Small overhead and well-supported, but introduces quite some complexity e.g. the user is responsible for reference counting of Python objects.
ctypes
Part of the standard library, and the approach used by e.g. scikit-cuda. Very simple, well-supported, and requires only Python code. However, can incur some significant overhead (anywhere from 50% [12] to 250 compared to C-API [13]).
CFFI
PyPy's preferred way of interfacing with C, inspired by LuaJIT's FFI (which is one of the main reasons Torch was written in Lua). It is quite verbose, but relatively fast and pretty simple. Used by the cuda4py library. [14]
Cython
Popular framework that compiles an extended (annotated) version of Python to C extensions which can be imported as modules. Used by CGT. [15]
Numba
Although Numba does not provide a C-interface directly, its JIT compiler supports compiling Python functions with both ctypes [16] and CFFI calls [17], potentially speeding them up.


[12] http://www.dalkescientific.com/writings/NBN/c_extensions.html


[13] http://stackoverflow.com/a/8069179


[14] cuda4py


[15] cycgt.pyx


[16] Numba ctypes


[17] Numba cffi


Dynamic compilation / metaprogramming

✓ Theano
✓ CGT
✗ Torch
✗ MXNet
Libraries can either dynamically compile their operators (e.g. Theano), or they can simply call pre-compiled operators (e.g. Torch).

Performance
Dynamic compilation of kernels and C functions allows them to be completely specialized, skipping unnecessary input checks, using less general but more efficient implementations. Frameworks with hand-tuned kernels like cuDNN and cuBLAS reduce the need for dynamic compilation, since they are often more efficient than custom-written ones.
Compilation time
Having dynamically compiled operations introduces the need to wait for functions to compile before they can be evaluated. This can be mitigated by having non-compiled version of the operators (e.g. some Theano operators have NumPy implementations), but note that this introduces the burden of maintaining multiple implementations of each operator, and requires the operators to behave identically.
Kernel fusing
Having dynamically compiled kernels allows e.g. element-wise operations to be fused into a single kernel. For GPU programming, this can be beneficial since launching the kernel and transferring data from global to shared memory can be a significant overhead. In Theano, this is referred to as "elemwise fusion".
Foreign language interface
If multiple nodes in the computation graph can be compiled together, it can limit the number of functions calls to be made from the host language (e.g. Python for Theano, Lua for Torch). For languages like Python, with a relatively slow FFI and significant function call overhead, this could make a measurable difference.
Robustness
Dynamically compiling operators introduces a compilation step in the pipeline, which can be a source of errors e.g. (library) paths must be set correctly, memory issues can occur during compilation, etc. Dynamically compiled operators are harder to debug, because reproducing the error might require reproducing the exact conditions under which the operator was compiled.
Shared libraries
CGT compiles shared libraries which are dynamically linked during runtime, whereas Theano compiles a Python C-API extension (requiring Python and NumPy headers). The later can be significantly slower.


Theano wishlist


Stick with symbolic differentiation
No more dynamic compilation (except for element-wise kernel fusing perhaps); hand-tuned libraries like cuDNN have diminished the advantages of this approach, it simplifies code significantly, and no more waiting for things to compile.
Support for loops in the computation graph, full SSA form with Phi-functions
GPU task scheduling with support for CUDA streams and asynchronous calls.
CFFI or Cython interface; The C-API is more complex and brings few advantages if some of the more user-friendly alternatives can be made to perform at a similar speed.
Exclusive CPU/GPU modes; simplifies code and removes source of unexpected slowdowns.
Strict separation of IR from backend optimization.
Blue sky thinking: Non-elementwise fusing of kernels, requires intelligent memory allocation.