pjstein/ad-manifesto.md

## ad-manifesto.md

      
    Raw
  

              ad-manifesto.md
            
          
    First-Class Automatic Differentiation in Swift: A Manifesto


Author: Richard Wei
Date: October 2018

This document is written for both the machine learning community and the Swift
programming language design community, with a strong focus on language design.
Table of Contents


Introduction
What is AD
Why does Swift need AD?
Why make AD first-class?
Vision
Part 1: Differentiable Types
Part 2: Primitive Registration
Part 3: Basic Differentiation
Part 4: Generalized Differentiability
Part 5: True Differential Operators
Part 6: Generalized Types for Differentiation
Part 7: Customizable Differentiation
Acknowledgements

Introduction

Automatic Differentiation (AD), also known as algorithmic differentiation, is a
family of techniques used to obtain the derivative of a function. Functions can
be represented as a composition of elementary operators whose derivatives are
well-known. While partial derivatives can be computed through different
techniques, the most common is a recursive application of the chain rule in the
reverse direction, called reverse-mode AD. Reverse-mode AD computes
vector-Jacobian products, i.e. partial derivatives with respect to each input
parameter, and it has become a prerequisite for implementing gradient-based
learning methods.
We aim to provide best-in-class AD, including the best optimizations, best error
messages in failure cases, and the most flexibility and expressivity. To achieve
this, we built support for AD right into the Swift compiler. This manifesto
explains the design and vision of AD, and introduces to you the language
extensions that will make Swift the world's first general-purpose differentiable
programming language.
What is AD?

Basic Calculus

In basic calculus, differentiating a function of type

produces a function
 that
maps points onto their corresponding slopes.


In the context of Swift, differentiating a function (Float) -> Float produces
(Float) -> Float. Functions with multiple arguments, such as (Float, Float) -> Float, can be thought of as a function whose input domain is a product of
those arguments types, i.e.
,
so the derivative of such a function has type (Float, Float) -> (Float, Float). According to this typing rule, the differential operator
 can be declared as a
higher-order function, overloaded for each number of arguments because a Swift
function's argument list is not formally modeled as a tuple.
func 𝒟<T: FloatingPoint>(_ f: (T) -> T) -> (T) -> T
func 𝒟<T: FloatingPoint>(_ f: (T, T) -> T) -> (T, T) -> (T, T)
func 𝒟<T: FloatingPoint>(_ f: (T, T, T) -> T) -> (T, T, T) -> (T, T, T)
...
func f(_ x: Double, _ y: Double) -> Double {
    return tanh(x + y)
}
𝒟(f) // (Double, Double) -> (Double, Double)
Vectors and Jacobians

In numerical computing, users often write code that operates on high-dimensional
mathematical objects. The basic typing rules that we defined on real scalars
() can be generalized for
module-like types such as
vectors with extra consideration for shape. In vector calculus, the
differentiation of a function

is defined per scalar because there are multiple inputs and multiple outputs.
Full differentiation of a vector-valued function
 will thus result in a
matrix, each of whose entries is a function that computes the partial derivative
of an output scalar with respect to an input scalar. This matrix is called a
Jacobian. In
this definition, the Jacobian matrix has type
.
For simplicity, we will model it as a function that maps vectors to real-valued
matrices
.

  
While it is challenging to define this function with full type safety in Swift
because shapes cannot be generic parameters yet, we can define a differential
operator as the following, specialized on shapes.
func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
    where T: FloatingPoint
Computing the Jacobian of a function is often unnecessary in gradient-based
optimization methods. Computing a full Jacobian will require repeated
evaluations of some primitives in computer code: vector-Jacobian products (VJPs)
or Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we
need in practice. In these terms, "vector" refers to a vector of partial
derivatives that are to be chained with the Jacobian by left-multiplication or
right-multiplication. As we explain chaining next, we discuss how Automatic
Differentiation comes in the picture.
Gradient and Reverse-Mode AD

When we let a one-hot row vector

left-multiply a Jacobian matrix of type
, we are selecting one
row in the matrix, which is exactly the
gradient of
 evaluated at
, i.e.
.


When vector  in

represents the gradient of another function
 at
, namely
,
then the vector-Jacobian products represents
. The
linear function that takes a vector and left-multiplies it with the Jacobian is
also called a
pullback. We
can define this function in Swift as a higher-order function shown below. The
body of this function can be defined in terms of 𝒟, the differential operator
that returns a Jacobian.


func pullback<T: FloatingPoint>(
    of f: (Vector2<T>) -> Vector3<T>,
    at x: Vector2<T>
) -> (Vector2<T>) -> Vector2<T>
    return { adjoint in matmul(adjoint, 𝒟(f)(x)) }
}
However, when computing gradients or general vector-Jacobian products, we do not
need to compute the Jacobian at all: Automatic Differentiation is here to
help.
The chain rule of differentiation
can be interpreted in left-associative order, i.e. accumulating each function's
partial derivatives from the final output, eventiually reaching each input.
Directional Derivatives and Forward-Mode AD

Similarly, when we let a column vector
 right-multiply a
Jacobian value
matrix of type , the result is a vector whose elements are exactly the
directional derivatives
of each  evaluated at
 in direction .


The linear function that takes a vector and right-multiplies the Jacobian value
matrix is called a
differential, and it
can also be defined in Swift as a higher-order function in terms of 𝒟.
func differential<T: FloatingPoint>(
    of f: (Vector2<T>) -> Vector3<T>,
    at x: Vector2<T>
) -> (Vector3<T>) -> Vector3<T> {
    return { tangent in matmul(𝒟(f)(x), tangent) }
}
Just like vector-Jacobian products, Jacobian-vector products are easy to compute
using Automatic Differentiation. By simply applying the chain rule of
differentiation from an input, we will accumulate each function's partial
derivatives and reach each output.
AD has a rich background. For an in-depth introduction, here's some great
documentation:

Introduction to Automatic
Differentiation
Automatic differentiation in machine learning: a survey
The simple essence of automatic
differentiation

Why does Swift need AD?

Swift is a new programming language in the machine learning space. Recently, the
Swift for TensorFlow project brought the
full power of a machine learning framework into the Swift programming language.
Numerical computing has a very different set of requirements than application
development and systems development, and we believe that Swift needs to better
address those requirements and improve the usability of numerical software. One
of the most important building blocks in machine learning and numerical
computing is the ability to differentiate math code. Automatic Differentiation
has been implemented in many languages, but because of language constraints and
design trade-offs, many existing AD systems have limitations. We would like to
take this opportunity to improve Swift, and demonstrate what Swift can offer in
all areas of numerical computing in the presence of a compiler and a static type
system.
Why make AD first-class?

Automatic Differentiation has been a research topic in scientific computing and
high-performance computing for nearly half a century. Traditional tools such as
OpenAD,
TAPENADE and
ADIFOR are tools that
transform existing source code. There are many advanced techniques that improved
the performance of derivatives written in FORTRAN, but these tools have not
gained wide adoption in the machine learning community. More recent AD systems
like Stalin∇ (pronounced
Stalingrad, available as a dialect of Scheme) achieved good usability by
integrating the differential operator into the language, and are equipped with a
complete set of AD features (such as forward/reverse, nested AD, Hessians,
Jacobians, directional derivatives and checkpointing). Along with libraries such
as DiffSharp (available in F#), and
ad (available in Haskell), they
combine AD closely with functional programming languages.
Researchers in the machine learning community have built many library
implementations of AD in Python and C++, including
Autograd,
TensorFlow, Pytorch, etc.
As Automatic Differentiation is an integral part of any machine learning
framework, traditional designs and implementations of AD have some limitations.
Some of these libraries are implemented as a transformation on a standalone DSL
(a graph) with a closed set of operators. Others are implemented using operator
overloading directly on a subset of the source language. Although these
libraries have gained wide adoption, the ones that leverage ahead-of-time AD do
not expose an easy-to-use programming model, and the ones that have a friendlier
programming model lack static analysis to perform more optimized AD.
Recent projects such as Tangent,
Myia, and
Zygote.jl based their AD upon source code
transformation (SCT), a technique that was common in advanced AD systems before
the deep learning era such as
Stalin∇. The first two
libraries parse a Python subset into ASTs and transform a function to its
derivatives either in AST or in a functional IR, and Zygote hooks into the Julia
compiler and transforms Julia's IR directly. These tools are pushing the
boundaries of dynamic languages.
We would like our AD system to feel native and expressive. AD in Swift aims to
solve real-world usability problems by providing the best generalizations, best
error messages in failure cases, composable differential operators, and fully
customizable types and derivatives. To achieve this, we built support for AD
right into the Swift language. Even though AD has been incubated as part of the
Swift for TensorFlow project, we believe its importance and impact is beyond
machine learning, so we decided to propose it eventually through Swift Evolution
into the core language.
Vision

Swift will be world's first general-purpose differentiable programming
language.
Ease of Use

We expect Swift's language-integrated AD to be super easy to use in the context
of machine learning, control in robotics, and scientific computing. AD is a
general language feature that works seamlessly with third-party libraries such
as TensorFlow.
struct Parameters: Differentiable, ParameterGroup {
    var w1 = Tensor<Float>(randomNormal: [784, 30])
    var b1 = Tensor<Float>(zeros: [30])
    var w2 = Tensor<Float>(randomNormal: [30, 10])
    var b2 = Tensor<Float>(zeros: [10])
}

var params = Parameters()
let minibatches = Dataset(...)
var optimizer = StochasticGradientDescent()
for (x, y) in minibatches {
    let grads = gradient(at: params) { params in
        let h1 = tanh(matmul(x, params.w1) + params.b1)
        let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
        let loss = (y - ŷ).squared().mean()
        print("Loss is \(loss)")
        return loss
    }
    optimizer.fit(&params, gradients: grads)
}
Full Extensibility: Custom Types and Derivatives

We want our AD system to be fully extensible to the point where users can
request derivatives of a function taking their own user-defined numeric types,
and even use this feature to implement structure-dependent algorithms such as
tree-recursive neural networks. Therefore, when performing AD, Swift makes no
special assumptions about individual math functions or the types it should
support. We enable library designers and developers to easily define any type or
differentiable functions, all in pure Swift code.
Swift supports protocol-oriented programming and first-class value
semantics. AD is deeply
integrated with value types and has full extensibility via protocol
conformances. The user can make their custom data structures differentiable
simply by declaring a conformance to Differentiable protocol:
extension MyType: Differentiable {
    ...
}
Or make an obviously non-differentiable function differentiable by using the
@differentiable attribute, specifying a "tangent" function for computing its
Jacobian-vector products, or an "adjoint" function for computing its
vector-Jacobian products.
@differentiable(tangent: tangentFoo, adjoint: adjointFoo)
func foo(_ x: Float) -> Float {
    return Float(Int(x)) // obviously non-differentiable
}

func tangentFoo(_ x: (Float, Float), originalResult: Float) -> Float {
    // Insert custom code to compute the directional derivative
}

func adjointFoo(_ x: Float, originalResult: Float, adjoint: Float) -> Float {
    // Insert custom code to compute the gradient
}
Composable Differential Operators

With fully customizable data structures and derivatives, everything should feel
native in the language. In addition, differential operators are functional and
composable, and differentiability is naturally integrated in the type system.
All differential operators are defined in Swift, and developers can create their
own differential operators by composing existing ones. For example, the user can
use the "forward-on-reverse" approach to compute Hessian-vector
products, where the hvp(at:in:)
operator is defined as a native Swift function. The @autodiff(order: 2) attribute in the closure type
signature marks the closure argument as being differentiable up to at least the
2nd order, so that the caller of hvp(at:in:) will differentiate the actual
closure argument as needed.so that the caller of this function will implicitly
trigger differentiation as needed.
func hvp<T: Differentiable, R: FloatingPoint>(
    at x: T, in f: @autodiff(order: 2) (T) -> R
) -> @autodiff(linear) (T) -> T {
    return differential(at: x, in: gradient(of: f))
}
Static Analysis and Diagnostics

By building first-class AD into the programming language, we can provide better
diagnostics about differentiability and numeric stability than any other dynamic
languages, all at compile-time.
test.swift:58:10: error: function is not differentiable
  return #gradient(funcToDiff)(x)
         ^         ~~~~~~~~~~

test.swift:54:10: note: expression is not differentiable
  return middle2(x)
         ^

test.swift:50:10: note: when differentiating this function call
  return middle(x)
         ^

test.swift:46:10: note: when differentiating this function call
  return nested(y)
         ^
Flexible Functional-Style Differentiation

In common AD libraries, there are two differentiation styles: functional and
imperative.


Syntax
Meaning


Functional
let 𝝯f = gradient(of: f)
𝝯f(x)
Differentiating a function


Imperative
let y = f(x)
gradient(of: y, wrt: x)
Differentiating code traced through data flow


Functional-style AD is transforming one function to another, producing a
function that takes original arguments and returns the partial derivatives
evaluated at each argument. Imperative-style AD, on the other hand, is a
value-value dependency analysis. Although we use both notations in mathematics,
imperative AD comes at the cost of semantic inconsistency with the host
language, for example:
let y = f(x)
x = 3
gradient(of: y, wrt: x) // undefined
Semantically, y is a value, but x is both a value and a reference to a
memory location -- it is unclear what exactly we are differentiating with
respect to. Though making y and x have reference types could make this
particular example work out semantically, it would be fundamentally inconsistent
with Swift's core design where mathematical objects have value types, and would
also make scalar types like Float incompatible with automatic differentiation.
We believe Swift's AD can achieve the same level of expressivity as imperative
AD while preserving functional properties, and use language integration to push
developers' productivity to the next level.
Part 1: Differentiable Types

Swift is a general-purpose programming language. Therefore, not every function
is mathematically differentiable, and not every type represents a real vector
space to begin with. To make our system mathematically sound, we refine the
Swift standard library to form a basis for automatic differentiation.
The starting point of this refinement is the fundamental numeric protocols.
In this section, we talk about how we improve the Numeric protocol to support
the addition of vector types and protocols. Then, we introduce a protocol to
represent vector spaces as that would be a requirement for doing calculus.
Finally, we design a protocol specific to differentiation.
Revising the Numeric protocol

The Numeric protocol today refines
ExpressibleByIntegerLiteral.
This makes sense for scalars, but is not compatible with vector data structures
because type-checking would fail on the scalar multiplication operator.
On the Swift forum, we have discussed the fundamental blocker for vector types
to conform to the existing Numeric
protocol.
The consensus was to introduce a weakening of the Numeric protocol to
represent the abstractions shared between scalars and vectors: rng (ring
without unity) (We assumed that
vector spaces are rngs by endowing them with * as element-wise
multiplication). The protocol will be called Arithmetic.
public protocol Arithmetic: Equatable {
    static var zero: Self { get }
    prefix static func + (x: Self) -> Self
    static func + (lhs: Self, rhs: Self) -> Self
    static func += (lhs: inout Self, rhs: Self) -> Self
    static func - (lhs: Self, rhs: Self) -> Self
    static func -= (lhs: inout Self, rhs: Self) -> Self
    static func * (lhs: Self, rhs: Self) -> Self
    static func *= (lhs: inout Self, rhs: Self) -> Self
}
The existing Numeric will be changed to refine (inherit from) Arithmetic,
keeping all of its existing behavior.
public protocol Numeric: Arithmetic, ExpressibleByIntegerLiteral {
    associatedtype Magnitude: Comparable, Numeric
    init?<T>(exactly source: T) where T: BinaryInteger
    var magnitude: Magnitude { get }
}
The VectorNumeric protocol

After we introduce the Arithmetic protocol, which makes the standard library
suitable for vector APIs and beyond, we can define a protocol that generalizes
vectors. Mathematically, a vector space is a ring without unity if we endow them
with * as element-wise multiplication. We represent vector spaces through the
VectorNumeric protocol as follows. Scalar is the type of the elements of
this vector space -- the field which the vector space is over. Shape is the
shape of this vector space, which is customizable. The initializer takes a value
of the Scalar type and a Shape and returns a vector of the specified shape.
/// A type that represents an unranked vector space. Values of this type are
/// elements in this vector space and with a specific shape.
public protocol VectorNumeric: Arithmetic {
    /// The type of scalars in the vector space.
    associatedtype Scalar: Numeric

    /// The type whose values specifies the shape of an object in the vector 
    /// space.
    associatedtype Shape

    /// Create an object in the vector space with the specified shape by
    /// repeatedly filling the object with the specified value.
    ///
    /// - Parameters:
    ///   - repeatedValue: the value repeat for the specified shape
    ///   - shape: the shape
    init(repeating repeatedValue: Scalar, shape: Shape)

    /// The shape of this vector.
    var shape: Shape { get }

    /// Returns the scalar product of the vector.
    static func * (scale: Scalar, value: Self) -> Self
}
The Differentiable protocol

Now we define a protocol that "activates" a type's differentiability. At a first
glance, the conforming type must also be a VectorNumeric type. So we make this
protocol refine VectorNumeric. Since differentiation only makes sense on real
vectors, we add a constraint on the associated type Scalar such that it
conforms to FloatingPoint.
public protocol Differentiable: VectorNumeric where Scalar: FloatingPoint {
}
You may notice that Differentiable looks like a dummy protocol because it
doesn't have any requirements other than the ones inherited from
VectorNumeric. Although under the current assumptions we can completely omit
the Differentiable protocol and just have the AD system recognize
VectorNumeric-comforming types whose scalar elements comform to
FloatingPoint, we actually have theoretical and practical reasons to revise
the Differentiable protocol later on. So we keep Differentiable as a
separate protocol for now and build towards the final design at the end of this
document.
Part 2: Primitive Registration

We are aiming for an open and extensible system, so we made the compiler
agnostic of the actual operations - it does not have special knowledge of
numeric standard library functions or distinguish between primitive operators
and other functions. We recursively determine a function's differentiability
based on:


whether a function has a primitive differentiability as specified in the
standard or user-defined library, and


whether a function's definition (type signature and body) is differentiable by
applying the chain rule of differentiation.


As such we provide a syntactic way of specifying the differentiability of a
function, using either the function's linearity properties or a separate
function to specify the "tangent code", which specifies how to differentiate the
function in forward mode, or "adjoint code”, which specifies how to
differentiate the function in reverse mode.
The @differentiable attribute

We introduce a declaration attribute @differentiable to Swift's syntax. The
full grammar of @differentiable is defined as follows:
differentiation-mode = 'forward' | 'reverse' | 'bidirectional'
differentiability = differentiation-mode  | 'linear' | 'constant'
differentiability-wrt-self = 'wrt' ':' 'self'
differentiation-order = 'once'
differentiation-tangent-specifier = 'tangent' ':' declaration-name
differentiation-adjoint-specifier = 'adjoint' ':' declaration-name
differentiable-attribute = '@differentiable'
    '(' differentiability
    [ ',' differentiability-wrt-self ]
    [ ',' differentiation-once ]
    [ ',' differentiation-tangent-specifier ]
    [ ',' differentiation-adjoint-specifier ]
    ')'
declaration-attribute = differentiable-attribute
First Glance

The multiplication operator * is differentiable with respect to its two
arguments. Here's how we make it differentiable in the standard library.
extension FloatingPoint {
    @differentiable(bidirectional, tangent: tangentMul, adjoint: adjointMul)
    static func * (x: Self, y: Self) -> Self { ... }
    
    internal func tangentMul(
        x: (Self, Self), y: (Self, Self), originalResult: Self
    ) -> Self {
        return x.1 * y.0 + y.1 * x.0
    }
    
    internal func adjointMul(
        x: Self, y: Self, originalResult: Self, seed: Self
    ) -> (Self, Self) {
        return (seed * y, seed * x)
    }
}
In TensorFlow, the convolution operator is only differentiable with respect to
a subset of arguments. Here's how we make it differentiable so that it can be
used for back-propagation.
@differentiable(reverse, adjoint: adjointConv2D)
public func conv2d(_ input: Tensor<Float>, filter: Tensor<Float>,
                   strides: @nondiff (Int32, Int32, Int32, Int32),
                   padding: @nondiff Padding) -> Tensor<Float> {
    ...
}

func adjointConv2D(_ input: Tensor<Float>, filter: Tensor<Float>,
                   strides: (Int32, Int32, Int32, Int32),
                   padding: Padding) -> (Tensor<Float>, Tensor<Float>) {
    ...
}
Differentiation Parameters

Differentiation parameters are marked inline at each argument position in the
function declaration. By default, every argument of the funtion is to be
differentiated with-respect-to, unless marked as @nondiff.
When a differentiable attribute is applied on a method, or the getter of a
computed property in a type, the implicit self argument often needs to be
differentiated with respect to. In order to make a function a primitive
differentiable with respect to self, one can add wrt: self to
the @differentiable attribute.
Differentiability

There are five options for differentiability:


Forward: @differentiable(forward, tangent: ...)
This option says that the function is forward-mode differentiable.
Forward-mode differentiation requires the "tangent code" (or tangent
function) of this function, so that Swift knows how to compute the
function's directional derivatives in the direction specified by the
tangent vector that has been forward-propagated to the tangent function.
The compiler will expect the name of the tangent function, with an expected
type signature, to be specified later in the tangent: parameter in the
attribute.


Reverse: @differentiable(reverse, adjoint: ...)
This option says that the function is reverse-mode differentiable.
Reverse-mode differentiation requires the "adjoint code" (or adjoint
function) of this function, so that Swift knows how to compute the function's
vector-Jacobian products, where the vector, also called "adjoint vector", has
been back-propagated to the adjoint function.
The compiler will expect the identifier of the adjoint function, with an
expected type signature, to be specified later in the adjoint: parameter
in the attribute.


Bidirectional: @differentiable(bidirectional, tangent: ..., adjoint: ...)
This option says that the function is both forward-mode differentiable and
reverse-mode differentiable. The compiler will expect both the tangent
function and the adjoint function to be specified later in this attribute.


Constant: @differentiable(constant)
By definition, constant functions always have zero derivatives and are
differentiable at any arbitrary order. So differentiating this function will
result into a zero vector (or vectors, when the function has multiple
differentiation arguments) with the same shape as each differentiation
argument.


Linear: @differentiable(linear)
By definition, a linear map is always a unary function and its Jacobian is
the matrix associated with this linear transformation itself. In other
words, both its differential and its pullback are itself.


Associated Functions

As explained, differentiabilities have different functional requirements.


forward differentiability
When the differentiability is forward, the compiler expects a tangent:
label in the attribute followed by the name (qualified or unqualified)
of a tangent function that is to be associated with the original function.
If the original function declaration has type (T0, ..., Tn) -> U, then
the expected type of the tangent function is ((T0, T0), ..., (Tn, Tn), U) -> U. As we can see, every argument of the original function has become
a "dual number" in the tangent function represented as a tuple. The first
element of such a tuple is the original argument, the second argument the
forward-propagated directional derivatives, namely the the "vector" in
"Jacobian-vector product". The last argument to the tangent function is the
original function's result. The result of the tangent function is the
directional derivatives. If any of the original arguments is marked as
@nondiff, it will not become a dual number in the tangent function's argument
list but will remain as the original argument itself.


reverse differentiability
When the differentiability is reverse, the compiler expects an adjoint:
label in the attribute followed by the name (qualified or unqualified)
of an adjoint function that is to be associated with the original function.
If the original function declaration has type (T0, ..., Tn) -> U, then
the expected type of the adjoint function is (T0, ..., Tn, U, U) -> (T0, ..., Tn). As we can see, the first n arguments to the adjoint function,
T0, ..., Tn,  are the original arguments. The next argument is the
original function's result. The last argument is the back-propagated
partial derivatives at the original function's result,
namely the "vector" in "vector-Jacobian product". The result of the
adjoint function contains partial derivatives at each argument, if the
argument has not been marked as @nondiff.


bidirectional differentiability
When the differentiability is bidirectional, the compiler expects both
tangent: and adjoint: arguments to be specified.


Other differentiabilities
Other differentiabilities such as constant and linear do not require
any associated functions. However, users can choose to specify
tangent/adjoint function(s) for their own purposes such as custom
optimizations.


Differentiation Order

When a function is marked as @differentiable, Swift assumes it to be
higher-order differentiable, i.e. differentiable at all orders, unless once is
specified in the attribute, in which case Swift will not guarantee any
higher-order differentiability. If their associated functions (tangent or
adjoint) are serialized, then their derivatives may be differentiable via a
separate code transformation.
Differentiabilities linear and constant guarantee smoothness, and they do
not have to be serialized whatsoever because their derivatives do not depend on
any code transformation.
forward and reverse transitively require the tangent function and the
adjoint function, respectively, to be differentiable with respect to the
original arguments. When compiling such declarations, Swift will verify the
tangent/adjoint function is also differentiable by static analysis. If they are
not differentiable, the compiler will error out, prompting the user to insert
once in the @differentiable attribute.
Example 1. Linear functions are differentiable at any order.
public extension Tensor {
    @differentiable(linear, wrt: self)
    func transposed() -> Self {
        ...
    }
}
Example 2. A forward-mode primitive-differentiable function whose tangent is
closed-form is differentiable.
// Okay, the tangent function is differentiable.
@differentiable(forward, tangent: tangentFoo)
func foo(_ x: Vector<Float>) -> Float {
    return Vector(repeating: sin(x), shape: [2, 3])
}

func tangentFoo(_ dualX: (Float, Float), 
                originalResult: Vector<Float>) -> Vector<Float> {
    let (x, dx) = dualX
    // Differentiable because `Vector.init(repeating:shape:)`, `*`, `sin` and 
    // `cos` are all declared `@differentiable` and are differentiable.
    return Vector(repeating: cos(x) * dx, shape: [2, 3])
}
Example 3. A reverse-mode primitive-differentiable function is not
differentiable at a higher order because its adjoint is not differentiable.
@differentiable(reverse, adjoint: adjointBar)
func bar(_ x: Vector<Float>) -> Float {
    return sin(x)[0]
}

var someGlobalVariable: Vector<Float> = [1, 1, 1]

func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
    var ∂y∂x = Vector<Float>(repeating: 0, shape: x.shape)
    someGlobalVariable[0] = cos(x[0]) * adjoint
    ∂y∂x[0] = someGlobalVariable[0]
    return ∂y∂x
}
test.swift:3:35: error: function `bar` does not support higher-order differentiation 
because its adjoint is not differentiable; would you like to add `once`?
  @differentiable(reverse, adjoint: adjointBar)
                                    ^~~~~~~~~~
test.swift:8:6: note: `adjointBar` is defined here
  func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
       ^~~~~~~~~~
test.swift:10:9: note: operation is not differentiable
      ∂y∂x[0] = cos(x[0]) * adjoint
          ^~~~~~~~~~~~~~~~~~~~~~~~~
Part 3: Basic Differentiation

The application of the chain rule of differentiation gives us vector-Jacobian
products or Jacobian-vector products, given by functions. Now that we have
defined primitive differentiable functions, Swift can recursively differentiate
any function whose body is available to the compiler.
Start Simple: Gradient and Derivatives

We start by introducing the syntax of two raw differential operators:

#gradient(f): Produces the gradient of f, where f: ℝⁿ → ℝ.
#derivatives(f): Produces derivatives of f, where f: ℝ → ℝᵐ.

The syntax of these operators looks like macros, but we will generalize them and
make them look much nicer towards in the second half of this document.
Example:
func f(_ x: Vector<Float>, _ w: Vector<Float>) -> Float {
   return x • w
}

#gradient(f) // (Vector<Float>, Vector<Float>) -> (Vector<Float>, Vector<Float>)

func g(_ x: Float) -> (Vector<Float>, Vector<Float>) {
   return x • w
}

#derivatives(g) // (Float) -> (Vector<Float>, Vector<Float>)
The grammar of these raw differential operators is defined as follows:
derivatives-operator = '#derivatives'
gradient-operator = '#gradient'
raw-differential-operator = derivatives-operator | gradient-operator
autodiff-argument-index-specifier = '.' integer-literal
autodiff-expression =
    differential-operator '(' expression [ ',' 'wrt' ':' autodiff-argument-index-specifier ] ')'
expression = autodiff-expression
Embrace Generality: Vector-Jacobian Products and Jacobian-Vector Products

Gradient and derivatives are two special cases of differentiation where the
output or the result is a scalar, respectively. When they are not a scalar,
vector-Jacobian products and Jacobian-vector products are being computed with a
vector. These cases are not obvious, but are required for modular machine
learning APIs where each neural network layer defines a back-propagation method
that takes a partial derivative vector back-propagated from the previous layer.
As such, we add two extra differential operators which will be useful for
computing these products.

#differential(f): Produces a function that takes the original arguments and
returns the differential of f.
#pullback(f): Produces a function that takes the original arguments and
returns the pullback of f.

jvp-operator = '#differential'
vjp-operator = '#pullback'
raw-differential-operator = jvp-operator | vjp-operator
Example:
// A random generic function that is differentiable.
func f<T0, T1, U>(_ x: T0, _ y: T1) -> U
    where T0: Differentiable, T1: Differentiable, U: Differentiable {
    return someDifferentiableFunction(20, x + y)
}

#differential(f) // (T0, T1) -> (U) -> (U, (T0, T1))
// Description:
//   (T0, T1)       ->  (U)    ->   (U,          (T0, T1))
//    ^~~~~~             ^           ^           ^~~~~~~~
//  original args      vector      result    Jacobian-vector products

#pullback(f) // (T0, T1) -> (U, (U) -> (T0, T1))
// Description:
//   (T0, T1)       ->  (U,     (U)      ->  (T0, T1))
//    ^~~~~~             ^       ^           ^~~~~~~~
//  original args     result   vector   vector-Jacobian products
How It Works

The compiler type-checks a #gradient(f), as well as other differential
operators, by searching for the closest match given the contextual type. f is
expected to have a definition to be differentiable, and thus cannot be a
closure whose body is opaque to the compiler. If so, Swift reports an error.
Later in the compilation pipeline, the compiler recursively transforms the code
of f to its gradient function ∇f (or other functions in other modes of
differentiation), and replaces #gradient(f) with ∇f. Everything composes
together naturally. Now, differentiation works.
AD in Action

Automatic Differentiation based on raw differential operators is already
available and being incubated temporarily on the "tensorflow" branch of
Swift. Swift for TensorFlow
development
toolchains and
tutorials
are available for trying out this feature.
Part 4: Generalized Differentiability

Automatic differentiation relies on the definition (body) of a function to be
able to differentiate it. Differential operators like #gradient trigger the
differentiation of a function, and the differentiability of the function is
determined as differentiation goes. This works perfectly so far, but has a
number of problems.
Issues with Definition-Based Differentiability

Syntactic Weirdness

Raw differential operators adopt the pound-keyword syntax, which has been
previously used for accessing compiler builtins, e.g. #file and #dsohandle,
referring to IDE-specific objects, e.g. #colorLiteral and #imageLiteral, and
interoperating with "stringly-typed" Objective-C key paths, e.g.
#keyPath(...). The pound-keyword syntax does not have native parsing support
for syntactic features like trailing closures, so it is hard to make the closure
code short under differential operators like #gradient.
Example:
// Ideal
let dydx = gradient { x in
    sin(x) + cos(x)
}

// Reality
let dydx = #gradient({ x in
    sin(x) + cos(x)
})
A Higher-Order Function, But Not Quite

When we introduced AD in Swift earlier in this document, we defined the
differential operator as a higher-order function. Type checking and type
inference were just expected to work like any other functions.
However, since the compiler needs to reject functions that are not
differentiable and differentiability is not part of the type system, even if we
were to redefine #gradient as a higher-order function named gradient(of:),
the compiler would still have to maintain dedicated knowledge about this
function in order to reject invalid arguments.
Cross-Module Differentiability, Without Serialization

As of now, the differentiability of a function is determined solely through
two tests:

Is the function a primitive-differentiable function (@differentiable)?
Can the function's body be differentiated in the differentiation mode
associated with the differential operator applied?

This simple system works perfectly when differentiating concrete functions
defined in a local module, but does not allow differentiation of opaque function
values or methods required by protocols. While being free of serialization is
not a strict requirement for numerical computing libraries, not supporting
differentiation on protocol requirements fundamentally obstructs composable
high-level APIs that rely on AD, such as machine learning model APIs.
Opaque Closures are Non-Differentiable

There is no way to define a higher-order function that differentiates its
argument using #gradient. Here's an example:
func foo(_ f: (Float) -> Float) -> Float {
    return #gradient(f)(0)
}
test.swift:2:22: error: cannot differentiate an opaque closure
    return #gradient(f)(0)
           ~~~~~~~~~~^~
test.swift:1:12: note: value defined here
func foo(_ f: (Float) -> Float) -> Float {
           ^~~~~~~~~~~~~~~~~~~
Closure arguments and dynamic dispatch are non-differentiable through direct
source code transformation. The compiler does not statically know where f is
coming from, nor can it delegate the task of differentiation of argument f to
each callsite of foo because it cannot be expressed in the type system.
Solution: Differentiability in Function Types

As we can see, the core of the problem with definition-based differentiability
is the opacity of function. The restriction that differentiation depends on the
full definition of a function to be seen by the differential operator makes it
impossible to define protocol-oriented differentiable code, and is the primary
hindrance to modular, composable differentiation APIs.
Turns out, this is not a new problem - we should learning from how we deal with
calling conventions in Swift. Functions with different calling conventions have
different type signatures, e.g. @convention(thick) and @convention(thin),
and function convert back and forth through conversion thunks implicitly.
// A "thin" function that captures no variables.
// Its representation is `@convention(thin)` by default.
func f(x: Int) -> Int {
    return x
}

var globalVar = 30

// A "thick" function that captures the value of `globalVar`.
// Its representation is `@convention(thick)` by default.
let g = { x in globalVar + x }

// A higher-order function.
// The closure argument `h`'s representation is `@convention(thick)`, because it should
// be able to take closures that capture variables.
func takeFunc(_ h: (Float) -> Float) { ... }

takeFunc(f) // Implicitly converted function `f` to a `convention(thick)` closure by
            // creating a conversion thunk.
takeFunc(g) // `g` is thick already. No conversion needed.
Sometimes, different conventions have different binary representations for
storing captured variables and such, just like the example with f and g
above. In AD, the only difference between a non-differentiable function and a
differentiated function (say, in reverse mode) is whether the function carries a
few other function pointers that represent the function's adjoint code, so we
can model differentiable functions using a "thicker" function type, which
bundles the original function representation along with pointers to the original
function's Jacobian-vector product functions and/or vector-Jacobian product
functions. When a normal function with a visible body gets passed as an
@autodiff function, the function will be differentiated.
// `f` is a normal function that has type `(Float) -> Float`.
func f(x: Float) -> Float {
   return sin(x)
}

// `f` gets implcitly converted (or more accurately, differentiated).
let g = f as @autodiff (Float) -> Float

func takesFunc(_ someFunc: @autodiff (Float) -> Float) {
    #derivatives(someFunc)
    ...
}

// At the callsite of `takesFunc(_:)`, `f` gets implcitly differentiated to become
// `@autodiff (Float) -> Float`.
takesFunc(f)
If a normal function does not have a visible body, then it cannot be passed as
an @autodiff function. Swift will show an error at compile-time.
var normalFuncWithOpaqueBody: (Float) -> Float = ...

takesFunc(normalFuncWithOpaqueBody)
test.swift:19:11: error: function is not differentiable, but the contextual type is 
'@autodiff (Float) -> Float'
  takesFunc(normalFuncWithOpaqueBody)
            ^~~~~~~~~~~~~~~~~~~~~~~~

test.swift:17:4: note: value defined here
  var normalFuncWithOpaqueBody: (Float) -> Float = ...
      ^~~~~~~~~~~~~~~~~~~~~~~~
At first glance, this could even be an addition to the existing @convention
attribute as something like @convention(autodiff), however, differentiability
does not align semantically with @convention. First, when a function becomes
its differentiable (or differentiated) form, its original calling convention is
not changed. Second, functions with any convention is technically
differentiable, including thin, thick, method, etc. Third,
differentiability is not the only information that needs to be encoded --
there's also the order of differentiation. Therefore, we need a separate
dimension of "thickness" in the function type: differentiability.
We define a new formalization of differentiability in Swift's type system,
including an @autodiff function type attribute, an extension to functions'
layout, and new syntax for selecting differentiable arguments.
The @autodiff Function Type Attribute

The @autodiff attribute on a function type specifies the function's
differentiability and differentiation order, just like @differentiable on
function declarations. The biggest differences are


@differentiable contains associated functions (tangent/adjoint) statically,
but @autodiff functions carry those extra function pointers in their binary
representation as a runtime property. Any user of this function will be able
to differentiate it, with differentiability guaranteed formally by the type
system. With this addition to the type system, serialization/inlinability is
no longer necessary because functions can be passed around without losing
differentiability.


Differentiation order is no longer once vs. infinite. Instead, @autodiff
functions can specify a maximum order at which this function can be
differentiated, unless the function is linear or constant. This is because
function-representation-based differentiability requires functions to be
differentiated ahead of becoming a value and being passed around.


The grammar for @autodiff is defined as follows:
differentiation-order = 'order' ':' integer-literal
differentiability = 'forward' | 'reverse' | 'linear' | 'constant' | 'bidirectional'
autodiff-attr = '@autodiff' '(' [ differentiability ',' ] diff-order ')'
When a differentiability is specified on a function type, it's obvious that its
functions' differentiation behavior is akin to what's defined for the
@differentiable declaration attribute. If no differentiability is specified,
this function is both forward-mode and reverse-mode differentiable (same as
bidirectional).
Creating @autodiff Functions

It becomes increasingly clear that first-order differentiation will not, and
should not, require serialization, and only higher-order differentiation should
due to code size. In order to make the system consistent, we make each
@differentiable function declaration result in an @autodiff function.
Since we want to support differentiating opaque functions, we must support
creating one. The fact is, the user does not even need to know about @autodiff
or intentionally create differentiable functions if they are working with
functions in the current module. Whenever a local function declaration gets used
where the contextual type has an @autodiff attribute on it, Swift
differentiates it. If differentiation fails, Swift reports an error at
compile-time.
For public APIs, we relax the constraint on @differentiable so that it can be
applied to any function declaration without specifying a tangent or adjoint even
when the differentiability is forward/reverse. This is when Swift tries to
differentiate functions and export the derivatives as part of those public APIs: If
the function gets differentiated, its default type signature has @autodiff
attribute on it; otherwise, Swift reports an error to the user showing what's
non-differentiable.
Higher-Order Differentiation of Opaque Closures

In order for modular libraries to support opaque higher-order differentiation,
the differentiation order must be specified in the closure type signature, so
that the closure ABI is guaranteed to contain the higher-order derivative.
@autodiff(reverse, order: 2) (T) -> U
For example, function g takes a differentiable function that is differentiable
up to at least the 3rd order, then differentiates it 3 times in the body.
// In a separate module:
func g(_ h: @autodiff(reverse, order: 3) (Float) -> Float) -> Float {
    return #gradient(h)(1) +
           #gradient(#gradient(h))(1) +
           #gradient(#gradient(#gradient(h))(1)
}
We also extend the @differentiable attribute so that it can specify an
primitive-differentiable function can be forced to be differentiated to a
specific order ahead of time. For example, when Swift compiles function f
below, this function will have been differentiated 6 times, and gradient
functions will be preserved in f's ABI so that its derivatives can be called
from anywhere (any other Swift module, or even C). f's default type signature
is @autodiff(reverse, order: 6) (Float) -> Float.
@differentiable(reverse, order: 6)
public func f(_ x: Float) -> Float {
    return pow(x, 6)
}
Differentiable functions with a maximum differentiation order can be implicitly
"down-ordered", that is, differentiable functions with a higher maximum
differentiation order can be implicitly converted to a function with a lower
maximum differentiation order. For example, we can directly pass f as an
argument to g.
g(f) // 156
Conversion Between Differentiabilities

Because of their mathematical properties, differentiabilities can be converted
to one another statically without runtime overhead. For example, a constant
function is also a linear function when it's unary; a linear function is a
bidirectional-differentiable function whose tangent and adjoint are both
themselves; any differentiability can be completely dropped from a function
type, forming a "normal" function. This allows us to define generic algorithms
using differentiation, without specializing them on function types of each
differentiability.
The following table shows whether each differentiability (as a column label) can
be converted to another (as a row label).


Convertible to:
None
Linear
Constant
Forward
Reverse
Bidirectional


None
✔


Linear
✔
✔

✔
✔
✔


Constant
✔
✔ (unary)
✔
✔
✔
✔


Forward
✔


✔


Reverse
✔


✔


Bidirectional
✔


✔
✔
✔


What does differentiability conversion look like in real code? Just like
@convention conversion, differentiability conversion is implicit and has
little mental overhead to the user.
let linear: @autodiff(linear) (Float) -> Float = ...
let bidir: @autodiff (Float) -> Float = ...
let const: @autodiff(constant) (Float) -> Float = ...

func foo(_: @autodiff(reverse) (Float) -> Float) { ... }

foo(linear) // Okay! Implicitly converted to `@autodiff(reverse)`.
foo(bidir) // Okay! Implicitly converted to `@autodiff(reverse)`.
foo(const) // Okay! Implicitly converted to `@autodiff(reverse)`.
...
Part 5: True Differential Operators

Generalized Differentiability enabled us
to define custom differential operators in a functional way. Now it's time to
define the true differential operators.
Derivatives and Gradient

We start with functions that take a function and produce a function that
computes derivatives or gradient. Recall that we already had built-in syntax
#gradient and #derivatives for computing gradients and derivatives, but we
are exploring more expressive APIs enabled by Generalized Differentiability
which enabled us to differentiate function arguments that are functions.
Forward Differential Operators

We define two forward-mode differential operators for computing basic
derivatives:

derivatives(of:) computes a derivatives function that takes a value and
returns derivatives evaluated at the given value.
derivatives(at:in:) computes derivatives of a closure at a given value.

/// Computes derivatives of `body`.
func derivatives<T: FloatingPoint, R: Differentiable>(
    of body: @autodiff(forward) (T) throws -> R
) rethrows -> (T) -> R {
    return { x in #differential(body)(x)(1).1 } // seed = dx/dx = 1
}

/// Computes derivatives of `body` at scalar `x`.
func derivatives<T: FloatingPoint, R: Differentiable>(
    at x: T, in body: @autodiff(forward) (T) throws -> R
) rethrows -> R {
    return derivatives(of: body)(x)
}
Reverse Differential Operators

We also define two reverse-mode differential operators for computing basic
gradients:

gradient(of:) computes a gradient function that takes a value and returns
the gradient evaluated at the given value.
gradient(at:in:) computes the gradient of a closure evaluated at a given
value.

/// Computes the gradient of `body`.
func gradient<T: Differentiable, R: FloatingPoint>(
    of body: @autodiff(reverse) (T) throws -> R
) rethrows -> (T) -> T {
    return { x in #pullback(body)(x).1(1) } // seed = dx/dx = 1
}

/// Computes the gradient of `body` at `x`.
func gradient<T: Differentiable, R: FloatingPoint>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> T {
    return gradient(of: body)(x)
}
As we can see, since we are to differentiate a higher-order function's argument
(thanks to Generalized Differentiability), we can define derivatives(of:) and
gradient(of:) as Swift functions in terms of more general raw differential
operators, #differential and #pullback, to replace #derivatives and
#gradient!
These differential operators work seamlessly with closure captures,
error-throwing functions, or arbitrary side-effecting code that do not
contribute to the closure result. This looks quite like value-based automatic
differentiation while the math is actually fully functional. This achieves a
similar level of expressivity as imperative-style automatic differentiation
libraries: Instead of writing gradient(...) at the bottom of a forward pass,
one would just write it on top and have a trailing closure close over the
forward pass.
Example: Train a simple 2-layer perceptron. The snippet computes the gradient
w.r.t. each parameter at each training step, prints a loss, and optimizes
parameters.
struct Parameters: Differentiable, ParameterGroup {
    var w1 = Tensor<Float>(randomNormal: [784, 30])
    var b1 = Tensor<Float>(zeros: [30])
    var w2 = Tensor<Float>(randomNormal: [30, 10])
    var b2 = Tensor<Float>(zeros: [10])
}

var params = Parameters()
let minibatches = Dataset(...)
var optimizer = StochasticGradientDescent(learningRate: 0.1)
for (x, y) in minibatches {
    let grads = gradient(at: params) { params in
        let h1 = tanh(matmul(x, params.w1) + params.b1)
        let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
        let loss = (y - ŷ).squared().mean()
        print("Loss is \(loss)")
        return loss
    }
    optimizer.fit(&params, gradients: grads)
}
Preserving Original Result

Since the trailing closure as an argument to gradient(at:in:), the forward
computation is just as customizable as within operator-overloading AD systems.
Users can do whatever they want to intermediate values or the result in the
primal computation.
That said, we would like to provide a way to have the differentiation API return
the original result directly. Because of Generalized Differentiability, these
APIs can be defined entirely as library functions using primitive differential
operators.
/// Computes `body(x)` and derivatives of each scalar output of `body` at `x`.
func valueWithDerivatives<T: FloatingPoint, R: Differentiable>(
    at x: T, in body: @autodiff(forward) (T) throws -> R
) rethrows -> (value: R, derivatives: R) {
    return #differential(body)(x)(1)
}

/// Computes `body(x)` and the gradient of `body` at `x`.
func valueWithGradient<T: Differentiable, R: FloatingPoint>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> (value: R, gradient: T) {
    let (y, pullback) = #pullback(body)(x)
    return (y, pullback(1))
}
Jacobian-Vector Products and Vector-Jacobian Products

Jacobian-vector products (forward-mode) and vector-Jacobian products
(reverse-mode) are extremely useful differential operators for lots of tasks in
numerical computing.
/// Computes Jacobian-vector products of `body` at `x`.
func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: T,
    in body: @autodiff(forward) (T) throws -> R
) rethrows -> R {
    return #differential(body)(x)(vector)
}

/// Computes the vector-Jacobian products of `body` at `x`.
func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: R,
    in body: @autodiff(reverse) (T) throws -> R
) rethrows -> T {
    return #pullback(body)(x)(vector)
}
Differentials and Pullbacks

In some cases, computational tasks rely on fully extensible differential
operators as well as maximum efficiency, e.g. computing vector-Jacobian products
as well as the original function's result. Luckily, the two operators we
mentioned in the very beginning when we introduced Jacobians are the ones we
need: differential and pullback. We have already had their raw operators
supported in the syntax: #differential and #pullback, but we can make them
nicer using by redefining them as Swift functions.
Function differential(at:in:) computes the differential of a closure at a
certain point, and returns a linear map that takes a vector and returns
Jacobian-vector products.
/// Computes the differential of `body` at `x`.
func differential<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T) -> R {
    return #differential(body)(x).1
}
Function differentialWithResult(at:in:) computes the differential of a closure
at a certain point, and returns a linear map that takes a vector and returns
both the original function's result and Jacobian-vector products.
/// Computes the differential of `body` at `x` that also computes the value of
/// `body(x)`.
func differentialWithResult<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T) -> (originalResult: T, derivatives: R) {
    return #differential(body)(x)
}
Function pullback(at:in:) computes the pullback of a closure at a certain
point, and returns a linear map that takes a vector and returns vector-Jacobian
products.
/// Computes the pullback of `body` at `x`.
func pullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (R) -> T {
    return #pullback(body)(x).1
}
Function resultWithPullback(at:in:) computes the pullback of a closure at a
certain point, and returns the original function's result and a linear map that
takes a vector and returns vector-Jacobian products.
/// Computes the original value of `body(x)` and the pullback at `x`.
func resultWithPullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R) -> T) {
    return #pullback(body)(x)
}
It is amazing that we are able to define every differential operator in terms of
other differential operators. #differential and #pullback have become
unnecessary because the functional form is so much nicer, so we can teach the
compiler to recognize Swift functions differential(at:in:) and
pullback(at:in:) as the builtin "canonical" differential operator, and remove
all raw differential operators that start with a # from the language.
Examples:


Chain directional derivatives freely using differentials.
let x = 0.5
let df = differential(at: x) { x in
    sin(cos(x))
}
df(1) // (f(x), df/dx)
df(derivatives(of: log)(t)) // (f(x), df/dt)
df(derivatives(at: t, in: log)) // (f(x), df/dt)


Chain gradients freely using pullbacks.
let x = 0.5
let (y, df) = pullback(at: x) { x in
    cos(sin(x))
}

df(1) // dy/dx
df(gradient(of: log)(t)) // dy/dt
df(gradient(at: t, in: log)) // dy/dt


Hessian-Vector Products

Second-order optimization methods in machine learning make use of
Hessians and Hessian-vector
products, which can be hard to compute. Many AD libraries such as Autograd
already support Hessians by supporting arbitrarily nested
forward-mode/reverse-mode differentiation. Hessian-vector products can be
efficiently computed by applying "forward-on-reverse", namely applying the
composition of the forward-mode differential operator and the reverse-mode
differential operator on a function.


Just like other differential operators, we can define the Hessian-vector
products operator in a simple, functional way.
func hvp<T: Differentiable, R: FloatingPoint>(
    at x: T, in f: @autodiff(order: 2) (T) -> R
) -> @autodiff(linear) (T) -> T where T: Differentiable {
    return differential(at: x, in: gradient(of: f))
}
Nested differentiation without a careful implementation is prone to a bug known
as perturbation confusion
[1]
[2]. Language-integrated AD in Swift will
enforce tagging in compiler-generated code to guarantee the correctness of
higher-order derivatives.
Standard Library or an AutomaticDifferentiation Module?

Earlier in this document, we discussed enhancements to standard library
protocols and extensions to the standard library to model differentiable types.
These protocols are general enough for standard library types such as floating
point scalars (Float, Double, and Float80) and potentially SIMD
vectors.
However, in any general-purpose programming language, there is always a question
of how much math the standard library should have.
We think basic differential operators like gradient(of:) and
derivatives(of:) should be included in the standard library, because they are
common operators that one would find in college calculus, and they will make AD
feel more language-integrated along with standard library protocols
VectorNumeric and Differentiable.
We do believe that other operators that contain terms like "Jacobian" and
"differential" should be in a separate module, possibly called
"AutomaticDifferentiation" that ships with the Swift language.
Part 6: Generalized Types for Differentiation

We introduced the Differentiable protocol that makes a type represent a vector
space and be differentiable. However, there are a few scenarios where such a
protocol won't work well.


Customizable weight type
Orthogonal weight matrixes have shown advantages in neural network training
[1]
[2]. When differentiating through these
networks, gradients with respect to weights will no long stay orthogonal -
instead, they are skew-symmetric matrices. While we can represent both
orthogonal matrices and skew-symmetric matrices as values of a Matrix or
Tensor type and programmatically ensure its orthogonality, some researchers
have been seeking a way to represent this natively in the type system of a
programming language and still have AD produce the correct derivative.


Quantized training
Quantization techniques store and calculate numbers in more compact formats,
i.e. a fixed-point data type. Conceptually, a quantized tensor for a
real-valued Tensor can be defined as the following struct:
public struct Quantized<Dequantized: Quantizable, QuantizedScalar: FixedWidthInteger> {
    var data: Quantizable
    var range: Range<Dequantized.Scalar>
    var scale: QuantizedScalar
    var zeroPoint: Int
}
We can think of a scenario where the developer defines a neural network as a
function whose parameters are of type Quantized<Tensor<Float>>. When
training parameters to this neural network, gradients need to flow at a
significantly higher precision, but today's system cannot achieve that
because it assumes gradients to have the same type as the original arguments.


Generic optimizers
Optimization problems in machine learning can be generalized by optimization
on manifolds. Optimizers in most libraries assume the original space and the
loss space both to be vector spaces, and perform an implicit conversion from
cotangent vectors to tangent vectors and another conversion from tangent
vectors to the original weight type when performing θ -= η * ∂L/∂θ. While
this works for most cases, it won't generalize over typed orthogonal
matrices, because orthogonal matrices are not vector spaces, and a conversion
from an orthogonal matrix to a skew symmetric matrix cannot be implicit.


Revise Differentiable Protocol

To address concerns raised above, we've managed to find a more general answer to
modeling differentiable types. Instead of requiring them to be vector spaces
(VectorNumeric), we model them as differentiable
manifolds. Reverse-mode
differentiation on function over manifolds produces gradients vectors in its
cotangent bundle; forward-mode differentiation produces derivatives in its
tangent bundle. Note that we cannot represent tangent/cotangent bundles
separately from tangent/cotangent spaces inside each bundle, because Swift does
not have dependent types. By removing the restriction to VectorNumeric,
Differentiable is now fully extensible.
/// A type that mathematically represents a differentiable manifold whose
/// tangent spaces are finite-dimensional.
///
/// In automatic differentiation, differentiation will produce a Jacobian whose
/// elements are of `Tangent` type.
public protocol Differentiable {
    /// The tangent vector space of this differentiable manifold.
    associatedtype TangentVector: VectorNumeric
        where TangentVector.Scalar: FloatingPoint

    /// The cotangent space of this differentiable manifold.
    associatedtype CotangentVector: VectorNumeric
        where TangentVector.Scalar: FloatingPoint

    /// Returns `self` moved along the value space towards the given tangent
    /// vector. In Riemannian geometry (mathematics), this is usually equivalent
    /// to retraction or exponential map.
    func moved(toward direction: TangentVector) -> Self

    /// Convert a cotangent vector to its corresponding tangent vector.
    func tangentVector(from cotangent: CotangentVector) -> TangentVector
}
When the tangent vector of a differentiable manifold is equal to its cotangent
vector, we can simply provide a default implementation of
tangentVector(from:), which is just the identity function.
public extension Differentiable where TangentVector == CotangentVector { 
    func tangentVector(from cotangent: CotangentVector) -> TangentVector { 
        return cotangent 
    } 
} 
When a differentiable manifold is a vector space, it's tangent space is usually
itself. In these cases, we simply define moved(toward:) as vector addition.
public extension Differentiable 
    where Self: VectorNumeric, TangentVector == Self { 
    func moved(toward direction: TangentVector) -> Self { 
        return self + direction 
    } 
} 
Deriving Conformances to VectorNumeric and Differentiable

It is very common for numerical computing to deal with lots of parameters, each
of which is a vector or a matrix. In these cases, instead of manually specifying
each input in a differential operator's argument list, users would often like
to differentiate through structures and obtain a structure of partial
derivatives. It is important for the Swift to provide derived conformances for
core protocols for numerical computing: Differentiable and VectorNumeric.
Mathematically, it is straightforward to represent product types. A struct or
tuple in Swift corresponds to a product of sets; an enum in Swift
corresponds to an addition of sets.
struct Parameters: VectorNumeric, Differentiable {
    var a: Vector<Float>
    var b: Float
}
Struct Parameters is equivalent to a product of sets Vector<Float> and
Float, or a product of a real vector space ℝⁿ and a scalar field ℝ, namely
ℝⁿ ⨯ ℝ, which is also a vector space. To make Parameters obtain the traits
of a vector space, we extend the compiler to derive a conformance to
VectorNumeric similar to how Codable and Hashable conformances are
derived. When a conformance clause is given in the current file and when all
stored properties conform to VectorNumeric with the same Scalar, the
compiler synthesizes AST to make this type conform, with all protocol requirements
applying property-wise.
After deriving conformances to VectorNumeric:
struct Parameters: VectorNumeric {
    var a: Vector<Float>
    var b: Float

    // derived:
    typealias Scalar = Float

    // derived:
    struct Shape {
        var a: Vector<Float>.Shape
        var b: Float.Shape
    }

    // derived:
    static func + (lhs: Parameters, rhs: Parameters) -> Parameters {
        return Parameters(a: lhs.a + rhs.a, b: lhs.b + rhs.b)
    }
    // ...
}
In order for Parameters to be differentiable, it must also need to conform to
Differentiable. Deriving conformances to Differentiable can follow the same
rules.
struct MyShapes: Differentiable {
    var a: Circle // conforms to Differentiable
    var b: Square // conforms to Differentiable
}
After deriving conformances to Differentiable:
struct MyShapes: Differentiable {
    var a: Circle
    var b: Square

    // derived:
    struct TangentVector: VectorNumeric {
        var a: Circle.TangentVector
        var b: Square.TangentVector
    }
    // derived:
    struct CotangentVector: VectorNumeric {
        var a: Circle.CotangentVector
        var b: Square.CotangentVector
    }

    // derived:
    func moved(toward direction: TangentVector) -> MyShapes {
        return MyShapes(a: a.moved(toward: direction.a),
                        b: b.moved(toward: direction.b))
    }

    // derived:
    func tangentVector(from cotangent: CotangentVector) -> TangentVector {
        return TangentVector(a: a.tangentVector(from: cotangent.a)
                             b: b.tangentVector(from: cotangent.b))
    }
}
With derived conformances to these protocols, the user can now write arbitrarily
nested structs of differentiable manifolds, and make them differentiable with
trivial effort, greatly simplifying the development.
Generalized Differential Operators

In the new Differentiable protocol, we added Tangent and Cotangent types
to represent the type of Jacobian-vector products and vector-Jacobian products,
respectively. We make the following changes to the existing differential
operators we introduced.

Differential operators that return T as a forward-differentiated derivative
will return T.Tangent instead.
Differential operators that return T as a reverse-differentiated derivative
will return T.Cotangent instead.
Vectors T for computing Jacobian-vector products will become T.Tangent.
Vectors T for computing vector-Jacobian products will become T.Cotangent.

Here we list a few updated differential operators.
Jacobian-Vector Products and Vector-Jacobian Products

Jacobian-vector products (forward-mode) and vector-Jacobian products
(reverse-mode) are extremely useful differential operators for lots of tasks in
numerical computing.
/// Computes Jacobian-vector products of `body` at `x`.
func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: T.TangentVector,
    in body: @autodiff(forward) (T) throws -> R
) rethrows -> R.TangentVector {
    return #differential(body)(x)(vector)
}

/// Computes the vector-Jacobian products of `body` at `x`.
func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: R.CotangentVector,
    in body: @autodiff(reverse) (T) throws -> R
) rethrows -> T.CotangentVector {
    return #pullback(body)(x)(vector)
}
Differentials and Pullbacks

/// Computes the differential of `body` at `x`.
func differential<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T.TangentVector) -> R.TangentVector {
    return #differential(body)(x).1
}

/// Computes the differential of `body` at `x` that also computes the value of
/// `body(x)`.
func differentialWithResult<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T.TangentVector) -> (originalResult: T, derivatives: R.TangentVector) {
    return #differential(body)(x)
}

/// Computes the pullback of `body` at `x`.
func pullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (R.CotangentVector) -> T.CotangentVector {
    return #pullback(body)(x).1
}

/// Computes the value of `body(x)` and the pullback at `x`.
func resultWithPullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R.CotangentVector) -> T.CotangentVector) {
    return #pullback(body)(x)
}
Back to the Problems

Recall that the motivation of introducing a general, future-proof
Differentiable protocol is to be able to model the following use cases.


Neural network with orthogonal weights can now be differentiable. We can
define a type called OrthogonalMatrix to conform to Differentiable, and
another type SkewSymmetricMatrix to conform to both Differentiable and
VectorNumeric.
struct SkewSymmetricMatrix: Differentiable, VectorNumeric {
    typealias Scalar = Float
    ...
}
struct OrthogonalMatrix: Differentiable {
    ...
    typealias TangentSpace = SkewSymmetricMatrix
    typealias CotangentSpace = SkewSymmetricMatrix
}
When we differentiate a function (OrthogonalMatrix) -> Float using the
reverse-mode differential operator, we'll get a function (OrthogonalMatrix) -> SkewSymmetricMatrix. Everything falls out, without type safety
compromises.


Differentiating a quantized network is now possible with AD.
// `Quantized` is a vector space when the dequantized type is one.
extension Quantized: VectorNumeric where Dequantized: VectorNumeric {
    typealias Scalar = Dequantized.Scalar
    static func + (lhs: Quantized, rhs: Quantized) -> Quantized {
        // Custom code: Dequantize, add, and requantize!
    }
    static func * (lhs: Scalar, rhs: Quantized) -> Quantized {
        // Custom code: Dequantize, add, and requantize!
    }
}

// `Quantized` is a differentiable manifold when the dequantized type is one.
extension Quantized: Differentiable where Dequantized: Differentiable {
    typealias TangentVector = Dequantized.TangentVector
    typealias CotangentVector = Dequantized.CotangentVector

    func moved(toward tangent: Dequantized.TangentVector) -> QuantizedTensor {
        // Custom code: Dequantize, optimize, and requantize!
    }
}
With Quantized conforming to the new Differentiable protocol, when we
differentiate a function of type (Quantized<Tensor<Float>, Int8>) -> U, AD
produces a function of type (Quantized<Tensor<Float>, Int8>) -> Tensor<Float>, which is close to exactly what we need in quantized training
of neural networks.


Generic optimizers can be defined in terms of manifold optimization
functions, without implicit casting.
extension SGD {
    func fit(_ parameters: inout Parameters, gradients: Parameters) {
        parameters.update(withGradients: gradients) { θ, g in
            θ = θ.moved(toward: -θ.tangentVector(from: g) * learningRate)
        }
    }
}


Part 7. Customizable Differentiation

Some machine learning models require manipulating the gradient with respect to
certain values, e.g. gradient clipping.
Tangent provides such a feature as a syntax
extension in Python. Recurrent neural networks often suffer from the "exploding
gradient" problem, and a typical solution is to force the gradient of an RNN to
not exceed a certain value by performing gradient clipping.
func prediction(for input: Tensor<Float>) -> Float {
    var prediction = input
    for _ in 0...5 {
        // Clip gradient.
        prediction = prediction.withCustomizedGradient { grad in
            max(min(grad, 1), -1)
        }
        prediction = lstm.prediction(for: input)
    }
    return prediction
}
APIs withCustomizedGradient(_:) and withCustomizedDerivatives(_:) look like
a compiler-known function which makes Swift run customized code in
differentiated code. However, because of the generality of the differential
registration mechanism, these functions can be
defined entirely as a Swift function with no special support from the compiler.
Here's the implementation of these APIs.
public extension Differentiable {
    @differentiable(forward, wrt: self, tangent: tangentCustomizingDerivatives)
    func withCustomizedDerivatives(
        _ body: @nondiff (TangentVector) -> TangentVector
    ) -> Self {
        return self
    }

    internal func tangentCustomizingDerivatives(
        body: (TangentVector) -> TangentVector,
        originalResult: Self,
        tangent: TangentVector
    ) -> TangentVector {
        return body(tangent)
    }

    @differentiable(reverse, wrt: self, adjoint: adjointCustomizingGradient)
    func withCustomizedGradient(
        _ body: @nondiff (CotangentVector) -> CotangentVector
    ) -> Self {
        return self
    }

    internal func adjointCustomizingGradient(
        body: (CotangentVector) -> CotangentVector,
        originalResult: Self,
        adjoint: CotangentVector
    ) -> CotangentVector {
        return body(adjoint)
    }
}
This API supports many gradient manipulation tasks in machine learning
optimization. For example, the user can make gradient computation trigger a
break from the loop.
var prediction = input
for _ in 0...5 {
    // Stop loop when necessary.
    var shouldStop = false
    prediction = prediction.withCustomizedGradient { grad in
        if grad < lowerBound {
            shouldStop = true
        }
        return grad
    }
    if shouldStop {
        break
    }
    prediction = lstm.prediction(for: input)
}
Setting a mutable flag is not the most user-friendly way. We can create APIs
that wrap withCustomizedDerivatives(_:) and withCustomizedGradient(_:) and
return a Bool, so that later code can decide whether to break from the loop
based on the return value from that API. Or better, if Swift supports non-local
control flow, i.e. a branch from nested closures, the code can be written just
as a break.
var prediction = input
for _ in 0...5 {
    // Stop loop when necessary.
    prediction = prediction.withCustomizedGradient { grad in
        if grad < lowerBound {
            break
        }
        return grad
    }
    prediction = lstm.prediction(for: input)
}
Acknowledgements

The author would like to thank Dan Zheng, Chris Lattner, Alex Wiltschko, Bart
van Merriënboer, Gordon Plotkin, Dougal Maclaurin, Matthew Johnson, Casey Chu,
Tim Harley, Marc Rasi, and Dmitri Gribenko for their input to the initial design
of this powerful language feature.
	Syntax	Meaning
Functional	`let 𝝯f = gradient(of: f)` `𝝯f(x)`	Differentiating a function
Imperative	`let y = f(x)` `gradient(of: y, wrt: x)`	Differentiating code traced through data flow
Convertible to:	None	Linear	Constant	Forward	Reverse	Bidirectional
None	✔
Linear	✔	✔		✔	✔	✔
Constant	✔	✔ (unary)	✔	✔	✔	✔
Forward	✔			✔
Reverse	✔				✔
Bidirectional	✔			✔	✔	✔