Instantly share code, notes, and snippets.

Embed
What would you like to do?
First-Class Automatic Differentiation in Swift: A Manifesto

First-Class Automatic Differentiation in Swift: A Manifesto

This document is written for both the machine learning community and the Swift programming language design community, with a strong focus on language design.

Status: Currently undergoing major revision.

Table of Contents

Introduction

Automatic Differentiation (AD), also known as algorithmic differentiation, is a family of techniques used to obtain the derivative of a function. Functions can be represented as a composition of elementary operators whose derivatives are well-known. While partial derivatives can be computed through different techniques, the most common is a recursive application of the chain rule in the reverse direction, called reverse-mode AD. Reverse-mode AD computes vector-Jacobian products, i.e. partial derivatives with respect to each input parameter, and it has become a prerequisite for implementing gradient-based learning methods.

We aim to provide best-in-class AD, including the best optimizations, best error messages in failure cases, and the most flexibility and expressivity. To achieve this, we built support for AD right into the Swift compiler. This manifesto explains the design and vision of AD, and introduces to you the language extensions that will make Swift the world's first general-purpose differentiable programming language.

What is AD?

Basic Calculus

In basic calculus, differentiating a function of type produces a function that maps points onto their corresponding slopes.

In the context of Swift, differentiating a function (Float) -> Float produces (Float) -> Float. Functions with multiple arguments, such as (Float, Float) -> Float, can be thought of as a function whose input domain is a product of those arguments types, i.e. , so the derivative of such a function has type (Float, Float) -> (Float, Float). According to this typing rule, the differential operator can be declared as a higher-order function, overloaded for each number of arguments because a Swift function's argument list is not formally modeled as a tuple.

func 𝒟<T: FloatingPoint>(_ f: (T) -> T) -> (T) -> T
func 𝒟<T: FloatingPoint>(_ f: (T, T) -> T) -> (T, T) -> (T, T)
func 𝒟<T: FloatingPoint>(_ f: (T, T, T) -> T) -> (T, T, T) -> (T, T, T)
...
func f(_ x: Double, _ y: Double) -> Double {
    return tanh(x + y)
}
𝒟(f) // (Double, Double) -> (Double, Double)

Vectors and Jacobians

In numerical computing, users often write code that operates on high-dimensional mathematical objects. The basic typing rules that we defined on real scalars () can be generalized for module-like types such as vectors with extra consideration for shape. In vector calculus, the differentiation of a function is defined per scalar because there are multiple inputs and multiple outputs. Full differentiation of a vector-valued function will thus result in a matrix, each of whose entries is a function that computes the partial derivative of an output scalar with respect to an input scalar. This matrix is called a Jacobian. In this definition, the Jacobian matrix has type . For simplicity, we will model it as a function that maps vectors to real-valued matrices .

Automatic differentiation approaches.

While it is challenging to define this function with full type safety in Swift because shapes cannot be generic parameters yet, we can define a differential operator as the following, specialized on shapes.

func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
    where T: FloatingPoint

Computing the Jacobian of a function is often unnecessary in gradient-based optimization methods. Computing a full Jacobian will require repeated evaluations of some primitives in computer code: vector-Jacobian products (VJPs) or Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we need in practice. In these terms, "vector" refers to a vector of partial derivatives that are to be chained with the Jacobian by left-multiplication or right-multiplication. As we explain chaining next, we discuss how Automatic Differentiation comes in the picture.

Gradient and Reverse-Mode AD

When we let a one-hot row vector left-multiply a Jacobian matrix of type , we are selecting one row in the matrix, which is exactly the gradient of evaluated at , i.e. .

When vector in represents the gradient of another function at , namely , then the vector-Jacobian products represents . The linear function that takes a vector and left-multiplies it with the Jacobian is also called a pullback. We can define this function in Swift as a higher-order function shown below. The body of this function can be defined in terms of 𝒟, the differential operator that returns a Jacobian.

func pullback<T: FloatingPoint>(
    of f: (Vector2<T>) -> Vector3<T>,
    at x: Vector2<T>
) -> (Vector2<T>) -> Vector2<T>
    return { adjoint in matmul(adjoint, 𝒟(f)(x)) }
}

However, when computing gradients or general vector-Jacobian products, we do not need to compute the Jacobian at all: Automatic Differentiation is here to help.

The chain rule of differentiation can be interpreted in left-associative order, i.e. accumulating each function's partial derivatives from the final output, eventiually reaching each input.

Directional Derivatives and Forward-Mode AD

Similarly, when we let a column vector right-multiply a Jacobian value matrix of type , the result is a vector whose elements are exactly the directional derivatives of each evaluated at in direction .

The linear function that takes a vector and right-multiplies the Jacobian value matrix is called a differential, and it can also be defined in Swift as a higher-order function in terms of 𝒟.

func differential<T: FloatingPoint>(
    of f: (Vector2<T>) -> Vector3<T>,
    at x: Vector2<T>
) -> (Vector3<T>) -> Vector3<T> {
    return { tangent in matmul(𝒟(f)(x), tangent) }
}

Just like vector-Jacobian products, Jacobian-vector products are easy to compute using Automatic Differentiation. By simply applying the chain rule of differentiation from an input, we will accumulate each function's partial derivatives and reach each output.

AD has a rich background. For an in-depth introduction, here's some great documentation:

Why does Swift need AD?

Swift is a new programming language in the machine learning space. Recently, the Swift for TensorFlow project brought the full power of a machine learning framework into the Swift programming language. Numerical computing has a very different set of requirements than application development and systems development, and we believe that Swift needs to better address those requirements and improve the usability of numerical software. One of the most important building blocks in machine learning and numerical computing is the ability to differentiate math code. Automatic Differentiation has been implemented in many languages, but because of language constraints and design trade-offs, many existing AD systems have limitations. We would like to take this opportunity to improve Swift, and demonstrate what Swift can offer in all areas of numerical computing in the presence of a compiler and a static type system.

Why make AD first-class?

Automatic Differentiation has been a research topic in scientific computing and high-performance computing for nearly half a century. Traditional tools such as OpenAD, TAPENADE and ADIFOR are tools that transform existing source code. There are many advanced techniques that improved the performance of derivatives written in FORTRAN, but these tools have not gained wide adoption in the machine learning community. More recent AD systems like Stalin∇ (pronounced Stalingrad, available as a dialect of Scheme) achieved good usability by integrating the differential operator into the language, and are equipped with a complete set of AD features (such as forward/reverse, nested AD, Hessians, Jacobians, directional derivatives and checkpointing). Along with libraries such as DiffSharp (available in F#), and ad (available in Haskell), they combine AD closely with functional programming languages.

Researchers in the machine learning community have built many library implementations of AD in Python and C++, including Autograd, TensorFlow, Pytorch, etc.

As Automatic Differentiation is an integral part of any machine learning framework, traditional designs and implementations of AD have some limitations. Some of these libraries are implemented as a transformation on a standalone DSL (a graph) with a closed set of operators. Others are implemented using operator overloading directly on a subset of the source language. Although these libraries have gained wide adoption, the ones that leverage ahead-of-time AD do not expose an easy-to-use programming model, and the ones that have a friendlier programming model lack static analysis to perform more optimized AD.

Recent projects such as Tangent, Myia, and Zygote.jl based their AD upon source code transformation (SCT), a technique that was common in advanced AD systems before the deep learning era such as Stalin∇. The first two libraries parse a Python subset into ASTs and transform a function to its derivatives either in AST or in a functional IR, and Zygote hooks into the Julia compiler and transforms Julia's IR directly. These tools are pushing the boundaries of dynamic languages.

We would like our AD system to feel native and expressive. AD in Swift aims to solve real-world usability problems by providing the best generalizations, best error messages in failure cases, composable differential operators, and fully customizable types and derivatives. To achieve this, we built support for AD right into the Swift language. Even though AD has been incubated as part of the Swift for TensorFlow project, we believe its importance and impact is beyond machine learning, so we decided to propose it eventually through Swift Evolution into the core language.

Vision

Swift will be world's first general-purpose differentiable programming language.

Ease of Use

We expect Swift's language-integrated AD to be super easy to use in the context of machine learning, control in robotics, and scientific computing. AD is a general language feature that works seamlessly with third-party libraries such as TensorFlow.

struct Parameters: Differentiable, ParameterGroup {
    var w1 = Tensor<Float>(randomNormal: [784, 30])
    var b1 = Tensor<Float>(zeros: [30])
    var w2 = Tensor<Float>(randomNormal: [30, 10])
    var b2 = Tensor<Float>(zeros: [10])
}

var params = Parameters()
let minibatches = Dataset(...)
var optimizer = StochasticGradientDescent()
for (x, y) in minibatches {
    let grads = gradient(at: params) { params in
        let h1 = tanh(matmul(x, params.w1) + params.b1)
        let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
        let loss = (y - ŷ).squared().mean()
        print("Loss is \(loss)")
        return loss
    }
    optimizer.fit(&params, gradients: grads)
}

Full Extensibility: Custom Types and Derivatives

We want our AD system to be fully extensible to the point where users can request derivatives of a function taking their own user-defined numeric types, and even use this feature to implement structure-dependent algorithms such as tree-recursive neural networks. Therefore, when performing AD, Swift makes no special assumptions about individual math functions or the types it should support. We enable library designers and developers to easily define any type or differentiable functions, all in pure Swift code.

Swift supports protocol-oriented programming and first-class value semantics. AD is deeply integrated with value types and has full extensibility via protocol conformances. The user can make their custom data structures differentiable simply by declaring a conformance to Differentiable protocol:

extension MyType: Differentiable {
    ...
}

Or make an obviously non-differentiable function differentiable by using the @differentiable attribute, specifying a "tangent" function for computing its Jacobian-vector products, or an "adjoint" function for computing its vector-Jacobian products.

@differentiable(tangent: tangentFoo, adjoint: adjointFoo)
func foo(_ x: Float) -> Float {
    return Float(Int(x)) // obviously non-differentiable
}

func tangentFoo(_ x: (Float, Float), originalResult: Float) -> Float {
    // Insert custom code to compute the directional derivative
}

func adjointFoo(_ x: Float, originalResult: Float, adjoint: Float) -> Float {
    // Insert custom code to compute the gradient
}

Composable Differential Operators

With fully customizable data structures and derivatives, everything should feel native in the language. In addition, differential operators are functional and composable, and differentiability is naturally integrated in the type system. All differential operators are defined in Swift, and developers can create their own differential operators by composing existing ones. For example, the user can use the "forward-on-reverse" approach to compute Hessian-vector products, where the hvp(at:in:) operator is defined as a native Swift function. The @autodiff(order: 2) attribute in the closure type signature marks the closure argument as being differentiable up to at least the 2nd order, so that the caller of hvp(at:in:) will differentiate the actual closure argument as needed.so that the caller of this function will implicitly trigger differentiation as needed.

func hvp<T: Differentiable, R: FloatingPoint>(
    at x: T, in f: @autodiff(order: 2) (T) -> R
) -> @autodiff(linear) (T) -> T {
    return differential(at: x, in: gradient(of: f))
}

Static Analysis and Diagnostics

By building first-class AD into the programming language, we can provide better diagnostics about differentiability and numeric stability than any other dynamic languages, all at compile-time.

test.swift:58:10: error: function is not differentiable
  return #gradient(funcToDiff)(x)
         ^         ~~~~~~~~~~

test.swift:54:10: note: expression is not differentiable
  return middle2(x)
         ^

test.swift:50:10: note: when differentiating this function call
  return middle(x)
         ^

test.swift:46:10: note: when differentiating this function call
  return nested(y)
         ^

Flexible Functional-Style Differentiation

In common AD libraries, there are two differentiation styles: functional and imperative.

Syntax Meaning
Functional let 𝝯f = gradient(of: f)
𝝯f(x)
Differentiating a function
Imperative let y = f(x)
gradient(of: y, wrt: x)
Differentiating code traced through data flow

Functional-style AD is transforming one function to another, producing a function that takes original arguments and returns the partial derivatives evaluated at each argument. Imperative-style AD, on the other hand, is a value-value dependency analysis. Although we use both notations in mathematics, imperative AD comes at the cost of semantic inconsistency with the host language, for example:

let y = f(x)
x = 3
gradient(of: y, wrt: x) // undefined

Semantically, y is a value, but x is both a value and a reference to a memory location -- it is unclear what exactly we are differentiating with respect to. Though making y and x have reference types could make this particular example work out semantically, it would be fundamentally inconsistent with Swift's core design where mathematical objects have value types, and would also make scalar types like Float incompatible with automatic differentiation.

We believe Swift's AD can achieve the same level of expressivity as imperative AD while preserving functional properties, and use language integration to push developers' productivity to the next level.

Part 1: Differentiable Types

Swift is a general-purpose programming language. Therefore, not every function is mathematically differentiable, and not every type represents a real vector space to begin with. To make our system mathematically sound, we refine the Swift standard library to form a basis for automatic differentiation.

The starting point of this refinement is the fundamental numeric protocols. In this section, we talk about how we improve the Numeric protocol to support the addition of vector types and protocols. Then, we introduce a protocol to represent vector spaces as that would be a requirement for doing calculus. Finally, we design a protocol specific to differentiation.

Revising the Numeric protocol

The Numeric protocol today refines ExpressibleByIntegerLiteral. This makes sense for scalars, but is not compatible with vector data structures because type-checking would fail on the scalar multiplication operator.

On the Swift forum, we have discussed the fundamental blocker for vector types to conform to the existing Numeric protocol. The consensus was to introduce a weakening of the Numeric protocol to represent the abstractions shared between scalars and vectors: rng (ring without unity) (We assumed that vector spaces are rngs by endowing them with * as element-wise multiplication). The protocol will be called Arithmetic.

public protocol Arithmetic: Equatable {
    static var zero: Self { get }
    prefix static func + (x: Self) -> Self
    static func + (lhs: Self, rhs: Self) -> Self
    static func += (lhs: inout Self, rhs: Self) -> Self
    static func - (lhs: Self, rhs: Self) -> Self
    static func -= (lhs: inout Self, rhs: Self) -> Self
    static func * (lhs: Self, rhs: Self) -> Self
    static func *= (lhs: inout Self, rhs: Self) -> Self
}

The existing Numeric will be changed to refine (inherit from) Arithmetic, keeping all of its existing behavior.

public protocol Numeric: Arithmetic, ExpressibleByIntegerLiteral {
    associatedtype Magnitude: Comparable, Numeric
    init?<T>(exactly source: T) where T: BinaryInteger
    var magnitude: Magnitude { get }
}

The VectorNumeric protocol

After we introduce the Arithmetic protocol, which makes the standard library suitable for vector APIs and beyond, we can define a protocol that generalizes vectors. Mathematically, a vector space is a ring without unity if we endow them with * as element-wise multiplication. We represent vector spaces through the VectorNumeric protocol as follows. Scalar is the type of the elements of this vector space -- the field which the vector space is over. Shape is the shape of this vector space, which is customizable. The initializer takes a value of the Scalar type and a Shape and returns a vector of the specified shape.

/// A type that represents an unranked vector space. Values of this type are
/// elements in this vector space and with a specific shape.
public protocol VectorNumeric: Arithmetic {
    /// The type of scalars in the vector space.
    associatedtype Scalar: Numeric

    /// The type whose values specifies the shape of an object in the vector 
    /// space.
    associatedtype Shape

    /// Create an object in the vector space with the specified shape by
    /// repeatedly filling the object with the specified value.
    ///
    /// - Parameters:
    ///   - repeatedValue: the value repeat for the specified shape
    ///   - shape: the shape
    init(repeating repeatedValue: Scalar, shape: Shape)

    /// The shape of this vector.
    var shape: Shape { get }

    /// Returns the scalar product of the vector.
    static func * (scale: Scalar, value: Self) -> Self
}

The Differentiable protocol

Now we define a protocol that "activates" a type's differentiability. At a first glance, the conforming type must also be a VectorNumeric type. So we make this protocol refine VectorNumeric. Since differentiation only makes sense on real vectors, we add a constraint on the associated type Scalar such that it conforms to FloatingPoint.

public protocol Differentiable: VectorNumeric where Scalar: FloatingPoint {
}

You may notice that Differentiable looks like a dummy protocol because it doesn't have any requirements other than the ones inherited from VectorNumeric. Although under the current assumptions we can completely omit the Differentiable protocol and just have the AD system recognize VectorNumeric-comforming types whose scalar elements comform to FloatingPoint, we actually have theoretical and practical reasons to revise the Differentiable protocol later on. So we keep Differentiable as a separate protocol for now and build towards the final design at the end of this document.

Part 2: Primitive Registration

We are aiming for an open and extensible system, so we made the compiler agnostic of the actual operations - it does not have special knowledge of numeric standard library functions or distinguish between primitive operators and other functions. We recursively determine a function's differentiability based on:

  • whether a function has a primitive differentiability as specified in the standard or user-defined library, and

  • whether a function's definition (type signature and body) is differentiable by applying the chain rule of differentiation.

As such we provide a syntactic way of specifying the differentiability of a function, using either the function's linearity properties or a separate function to specify the "tangent code", which specifies how to differentiate the function in forward mode, or "adjoint code”, which specifies how to differentiate the function in reverse mode.

The @differentiable attribute

We introduce a declaration attribute @differentiable to Swift's syntax. The full grammar of @differentiable is defined as follows:

differentiation-mode = 'forward' | 'reverse' | 'bidirectional'
differentiability = differentiation-mode  | 'linear' | 'constant'
differentiability-wrt-self = 'wrt' ':' 'self'
differentiation-order = 'once'
differentiation-tangent-specifier = 'tangent' ':' declaration-name
differentiation-adjoint-specifier = 'adjoint' ':' declaration-name
differentiable-attribute = '@differentiable'
    '(' differentiability
    [ ',' differentiability-wrt-self ]
    [ ',' differentiation-once ]
    [ ',' differentiation-tangent-specifier ]
    [ ',' differentiation-adjoint-specifier ]
    ')'
declaration-attribute = differentiable-attribute

First Glance

The multiplication operator * is differentiable with respect to its two arguments. Here's how we make it differentiable in the standard library.

extension FloatingPoint {
    @differentiable(bidirectional, tangent: tangentMul, adjoint: adjointMul)
    static func * (x: Self, y: Self) -> Self { ... }
    
    internal func tangentMul(
        x: (Self, Self), y: (Self, Self), originalResult: Self
    ) -> Self {
        return x.1 * y.0 + y.1 * x.0
    }
    
    internal func adjointMul(
        x: Self, y: Self, originalResult: Self, seed: Self
    ) -> (Self, Self) {
        return (seed * y, seed * x)
    }
}

In TensorFlow, the convolution operator is only differentiable with respect to a subset of arguments. Here's how we make it differentiable so that it can be used for back-propagation.

@differentiable(reverse, adjoint: adjointConv2D)
public func conv2d(_ input: Tensor<Float>, filter: Tensor<Float>,
                   strides: @nondiff (Int32, Int32, Int32, Int32),
                   padding: @nondiff Padding) -> Tensor<Float> {
    ...
}

func adjointConv2D(_ input: Tensor<Float>, filter: Tensor<Float>,
                   strides: (Int32, Int32, Int32, Int32),
                   padding: Padding) -> (Tensor<Float>, Tensor<Float>) {
    ...
}

Differentiation Parameters

Differentiation parameters are marked inline at each argument position in the function declaration. By default, every argument of the funtion is to be differentiated with-respect-to, unless marked as @nondiff.

When a differentiable attribute is applied on a method, or the getter of a computed property in a type, the implicit self argument often needs to be differentiated with respect to. In order to make a function a primitive differentiable with respect to self, one can add wrt: self to the @differentiable attribute.

Differentiability

There are five options for differentiability:

  1. Forward: @differentiable(forward, tangent: ...)

    This option says that the function is forward-mode differentiable. Forward-mode differentiation requires the "tangent code" (or tangent function) of this function, so that Swift knows how to compute the function's directional derivatives in the direction specified by the tangent vector that has been forward-propagated to the tangent function.

    The compiler will expect the name of the tangent function, with an expected type signature, to be specified later in the tangent: parameter in the attribute.

  2. Reverse: @differentiable(reverse, adjoint: ...)

    This option says that the function is reverse-mode differentiable. Reverse-mode differentiation requires the "adjoint code" (or adjoint function) of this function, so that Swift knows how to compute the function's vector-Jacobian products, where the vector, also called "adjoint vector", has been back-propagated to the adjoint function.

    The compiler will expect the identifier of the adjoint function, with an expected type signature, to be specified later in the adjoint: parameter in the attribute.

  3. Bidirectional: @differentiable(bidirectional, tangent: ..., adjoint: ...)

    This option says that the function is both forward-mode differentiable and reverse-mode differentiable. The compiler will expect both the tangent function and the adjoint function to be specified later in this attribute.

  4. Constant: @differentiable(constant)

    By definition, constant functions always have zero derivatives and are differentiable at any arbitrary order. So differentiating this function will result into a zero vector (or vectors, when the function has multiple differentiation arguments) with the same shape as each differentiation argument.

  5. Linear: @differentiable(linear)

    By definition, a linear map is always a unary function and its Jacobian is the matrix associated with this linear transformation itself. In other words, both its differential and its pullback are itself.

Associated Functions

As explained, differentiabilities have different functional requirements.

  1. forward differentiability

    When the differentiability is forward, the compiler expects a tangent: label in the attribute followed by the name (qualified or unqualified) of a tangent function that is to be associated with the original function. If the original function declaration has type (T0, ..., Tn) -> U, then the expected type of the tangent function is ((T0, T0), ..., (Tn, Tn), U) -> U. As we can see, every argument of the original function has become a "dual number" in the tangent function represented as a tuple. The first element of such a tuple is the original argument, the second argument the forward-propagated directional derivatives, namely the the "vector" in "Jacobian-vector product". The last argument to the tangent function is the original function's result. The result of the tangent function is the directional derivatives. If any of the original arguments is marked as @nondiff, it will not become a dual number in the tangent function's argument list but will remain as the original argument itself.

  2. reverse differentiability

    When the differentiability is reverse, the compiler expects an adjoint: label in the attribute followed by the name (qualified or unqualified) of an adjoint function that is to be associated with the original function. If the original function declaration has type (T0, ..., Tn) -> U, then the expected type of the adjoint function is (T0, ..., Tn, U, U) -> (T0, ..., Tn). As we can see, the first n arguments to the adjoint function, T0, ..., Tn, are the original arguments. The next argument is the original function's result. The last argument is the back-propagated partial derivatives at the original function's result, namely the "vector" in "vector-Jacobian product". The result of the adjoint function contains partial derivatives at each argument, if the argument has not been marked as @nondiff.

  3. bidirectional differentiability

    When the differentiability is bidirectional, the compiler expects both tangent: and adjoint: arguments to be specified.

  4. Other differentiabilities

    Other differentiabilities such as constant and linear do not require any associated functions. However, users can choose to specify tangent/adjoint function(s) for their own purposes such as custom optimizations.

Differentiation Order

When a function is marked as @differentiable, Swift assumes it to be higher-order differentiable, i.e. differentiable at all orders, unless once is specified in the attribute, in which case Swift will not guarantee any higher-order differentiability. If their associated functions (tangent or adjoint) are serialized, then their derivatives may be differentiable via a separate code transformation.

Differentiabilities linear and constant guarantee smoothness, and they do not have to be serialized whatsoever because their derivatives do not depend on any code transformation.

forward and reverse transitively require the tangent function and the adjoint function, respectively, to be differentiable with respect to the original arguments. When compiling such declarations, Swift will verify the tangent/adjoint function is also differentiable by static analysis. If they are not differentiable, the compiler will error out, prompting the user to insert once in the @differentiable attribute.

Example 1. Linear functions are differentiable at any order.

public extension Tensor {
    @differentiable(linear, wrt: self)
    func transposed() -> Self {
        ...
    }
}

Example 2. A forward-mode primitive-differentiable function whose tangent is closed-form is differentiable.

// Okay, the tangent function is differentiable.
@differentiable(forward, tangent: tangentFoo)
func foo(_ x: Vector<Float>) -> Float {
    return Vector(repeating: sin(x), shape: [2, 3])
}

func tangentFoo(_ dualX: (Float, Float), 
                originalResult: Vector<Float>) -> Vector<Float> {
    let (x, dx) = dualX
    // Differentiable because `Vector.init(repeating:shape:)`, `*`, `sin` and 
    // `cos` are all declared `@differentiable` and are differentiable.
    return Vector(repeating: cos(x) * dx, shape: [2, 3])
}

Example 3. A reverse-mode primitive-differentiable function is not differentiable at a higher order because its adjoint is not differentiable.

@differentiable(reverse, adjoint: adjointBar)
func bar(_ x: Vector<Float>) -> Float {
    return sin(x)[0]
}

var someGlobalVariable: Vector<Float> = [1, 1, 1]

func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
    var yx = Vector<Float>(repeating: 0, shape: x.shape)
    someGlobalVariable[0] = cos(x[0]) * adjoint
    yx[0] = someGlobalVariable[0]
    return yx
}
test.swift:3:35: error: function `bar` does not support higher-order differentiation 
because its adjoint is not differentiable; would you like to add `once`?
  @differentiable(reverse, adjoint: adjointBar)
                                    ^~~~~~~~~~
test.swift:8:6: note: `adjointBar` is defined here
  func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
       ^~~~~~~~~~
test.swift:10:9: note: operation is not differentiable
      ∂y∂x[0] = cos(x[0]) * adjoint
          ^~~~~~~~~~~~~~~~~~~~~~~~~

Part 3: Basic Differentiation

The application of the chain rule of differentiation gives us vector-Jacobian products or Jacobian-vector products, given by functions. Now that we have defined primitive differentiable functions, Swift can recursively differentiate any function whose body is available to the compiler.

Start Simple: Gradient and Derivatives

We start by introducing the syntax of two raw differential operators:

  • #gradient(f): Produces the gradient of f, where f: ℝⁿ → ℝ.
  • #derivatives(f): Produces derivatives of f, where f: ℝ → ℝᵐ.

The syntax of these operators looks like macros, but we will generalize them and make them look much nicer towards in the second half of this document.

Example:

func f(_ x: Vector<Float>, _ w: Vector<Float>) -> Float {
   return x  w
}

#gradient(f) // (Vector<Float>, Vector<Float>) -> (Vector<Float>, Vector<Float>)

func g(_ x: Float) -> (Vector<Float>, Vector<Float>) {
   return x  w
}

#derivatives(g) // (Float) -> (Vector<Float>, Vector<Float>)

The grammar of these raw differential operators is defined as follows:

derivatives-operator = '#derivatives'
gradient-operator = '#gradient'
raw-differential-operator = derivatives-operator | gradient-operator
autodiff-argument-index-specifier = '.' integer-literal
autodiff-expression =
    differential-operator '(' expression [ ',' 'wrt' ':' autodiff-argument-index-specifier ] ')'
expression = autodiff-expression

Embrace Generality: Vector-Jacobian Products and Jacobian-Vector Products

Gradient and derivatives are two special cases of differentiation where the output or the result is a scalar, respectively. When they are not a scalar, vector-Jacobian products and Jacobian-vector products are being computed with a vector. These cases are not obvious, but are required for modular machine learning APIs where each neural network layer defines a back-propagation method that takes a partial derivative vector back-propagated from the previous layer. As such, we add two extra differential operators which will be useful for computing these products.

  • #differential(f): Produces a function that takes the original arguments and returns the differential of f.
  • #pullback(f): Produces a function that takes the original arguments and returns the pullback of f.
jvp-operator = '#differential'
vjp-operator = '#pullback'
raw-differential-operator = jvp-operator | vjp-operator

Example:

// A random generic function that is differentiable.
func f<T0, T1, U>(_ x: T0, _ y: T1) -> U
    where T0: Differentiable, T1: Differentiable, U: Differentiable {
    return someDifferentiableFunction(20, x + y)
}

#differential(f) // (T0, T1) -> (U) -> (U, (T0, T1))
// Description:
//   (T0, T1)       ->  (U)    ->   (U,          (T0, T1))
//    ^~~~~~             ^           ^           ^~~~~~~~
//  original args      vector      result    Jacobian-vector products

#pullback(f) // (T0, T1) -> (U, (U) -> (T0, T1))
// Description:
//   (T0, T1)       ->  (U,     (U)      ->  (T0, T1))
//    ^~~~~~             ^       ^           ^~~~~~~~
//  original args     result   vector   vector-Jacobian products

How It Works

The compiler type-checks a #gradient(f), as well as other differential operators, by searching for the closest match given the contextual type. f is expected to have a definition to be differentiable, and thus cannot be a closure whose body is opaque to the compiler. If so, Swift reports an error.

Later in the compilation pipeline, the compiler recursively transforms the code of f to its gradient function ∇f (or other functions in other modes of differentiation), and replaces #gradient(f) with ∇f. Everything composes together naturally. Now, differentiation works.

AD in Action

Automatic Differentiation based on raw differential operators is already available and being incubated temporarily on the "tensorflow" branch of Swift. Swift for TensorFlow development toolchains and tutorials are available for trying out this feature.

Part 4: Generalized Differentiability

Automatic differentiation relies on the definition (body) of a function to be able to differentiate it. Differential operators like #gradient trigger the differentiation of a function, and the differentiability of the function is determined as differentiation goes. This works perfectly so far, but has a number of problems.

Issues with Definition-Based Differentiability

Syntactic Weirdness

Raw differential operators adopt the pound-keyword syntax, which has been previously used for accessing compiler builtins, e.g. #file and #dsohandle, referring to IDE-specific objects, e.g. #colorLiteral and #imageLiteral, and interoperating with "stringly-typed" Objective-C key paths, e.g. #keyPath(...). The pound-keyword syntax does not have native parsing support for syntactic features like trailing closures, so it is hard to make the closure code short under differential operators like #gradient.

Example:

// Ideal
let dydx = gradient { x in
    sin(x) + cos(x)
}

// Reality
let dydx = #gradient({ x in
    sin(x) + cos(x)
})

A Higher-Order Function, But Not Quite

When we introduced AD in Swift earlier in this document, we defined the differential operator as a higher-order function. Type checking and type inference were just expected to work like any other functions.

However, since the compiler needs to reject functions that are not differentiable and differentiability is not part of the type system, even if we were to redefine #gradient as a higher-order function named gradient(of:), the compiler would still have to maintain dedicated knowledge about this function in order to reject invalid arguments.

Cross-Module Differentiability, Without Serialization

As of now, the differentiability of a function is determined solely through two tests:

  • Is the function a primitive-differentiable function (@differentiable)?
  • Can the function's body be differentiated in the differentiation mode associated with the differential operator applied?

This simple system works perfectly when differentiating concrete functions defined in a local module, but does not allow differentiation of opaque function values or methods required by protocols. While being free of serialization is not a strict requirement for numerical computing libraries, not supporting differentiation on protocol requirements fundamentally obstructs composable high-level APIs that rely on AD, such as machine learning model APIs.

Opaque Closures are Non-Differentiable

There is no way to define a higher-order function that differentiates its argument using #gradient. Here's an example:

func foo(_ f: (Float) -> Float) -> Float {
    return #gradient(f)(0)
}
test.swift:2:22: error: cannot differentiate an opaque closure
    return #gradient(f)(0)
           ~~~~~~~~~~^~
test.swift:1:12: note: value defined here
func foo(_ f: (Float) -> Float) -> Float {
           ^~~~~~~~~~~~~~~~~~~

Closure arguments and dynamic dispatch are non-differentiable through direct source code transformation. The compiler does not statically know where f is coming from, nor can it delegate the task of differentiation of argument f to each callsite of foo because it cannot be expressed in the type system.

Solution: Differentiability in Function Types

As we can see, the core of the problem with definition-based differentiability is the opacity of function. The restriction that differentiation depends on the full definition of a function to be seen by the differential operator makes it impossible to define protocol-oriented differentiable code, and is the primary hindrance to modular, composable differentiation APIs.

Turns out, this is not a new problem - we should learning from how we deal with calling conventions in Swift. Functions with different calling conventions have different type signatures, e.g. @convention(thick) and @convention(thin), and function convert back and forth through conversion thunks implicitly.

// A "thin" function that captures no variables.
// Its representation is `@convention(thin)` by default.
func f(x: Int) -> Int {
    return x
}

var globalVar = 30

// A "thick" function that captures the value of `globalVar`.
// Its representation is `@convention(thick)` by default.
let g = { x in globalVar + x }

// A higher-order function.
// The closure argument `h`'s representation is `@convention(thick)`, because it should
// be able to take closures that capture variables.
func takeFunc(_ h: (Float) -> Float) { ... }

takeFunc(f) // Implicitly converted function `f` to a `convention(thick)` closure by
            // creating a conversion thunk.
takeFunc(g) // `g` is thick already. No conversion needed.

Sometimes, different conventions have different binary representations for storing captured variables and such, just like the example with f and g above. In AD, the only difference between a non-differentiable function and a differentiated function (say, in reverse mode) is whether the function carries a few other function pointers that represent the function's adjoint code, so we can model differentiable functions using a "thicker" function type, which bundles the original function representation along with pointers to the original function's Jacobian-vector product functions and/or vector-Jacobian product functions. When a normal function with a visible body gets passed as an @autodiff function, the function will be differentiated.

// `f` is a normal function that has type `(Float) -> Float`.
func f(x: Float) -> Float {
   return sin(x)
}

// `f` gets implcitly converted (or more accurately, differentiated).
let g = f as @autodiff (Float) -> Float

func takesFunc(_ someFunc: @autodiff (Float) -> Float) {
    #derivatives(someFunc)
    ...
}

// At the callsite of `takesFunc(_:)`, `f` gets implcitly differentiated to become
// `@autodiff (Float) -> Float`.
takesFunc(f)

If a normal function does not have a visible body, then it cannot be passed as an @autodiff function. Swift will show an error at compile-time.

var normalFuncWithOpaqueBody: (Float) -> Float = ...

takesFunc(normalFuncWithOpaqueBody)
test.swift:19:11: error: function is not differentiable, but the contextual type is 
'@autodiff (Float) -> Float'
  takesFunc(normalFuncWithOpaqueBody)
            ^~~~~~~~~~~~~~~~~~~~~~~~

test.swift:17:4: note: value defined here
  var normalFuncWithOpaqueBody: (Float) -> Float = ...
      ^~~~~~~~~~~~~~~~~~~~~~~~

At first glance, this could even be an addition to the existing @convention attribute as something like @convention(autodiff), however, differentiability does not align semantically with @convention. First, when a function becomes its differentiable (or differentiated) form, its original calling convention is not changed. Second, functions with any convention is technically differentiable, including thin, thick, method, etc. Third, differentiability is not the only information that needs to be encoded -- there's also the order of differentiation. Therefore, we need a separate dimension of "thickness" in the function type: differentiability.

We define a new formalization of differentiability in Swift's type system, including an @autodiff function type attribute, an extension to functions' layout, and new syntax for selecting differentiable arguments.

The @autodiff Function Type Attribute

The @autodiff attribute on a function type specifies the function's differentiability and differentiation order, just like @differentiable on function declarations. The biggest differences are

  • @differentiable contains associated functions (tangent/adjoint) statically, but @autodiff functions carry those extra function pointers in their binary representation as a runtime property. Any user of this function will be able to differentiate it, with differentiability guaranteed formally by the type system. With this addition to the type system, serialization/inlinability is no longer necessary because functions can be passed around without losing differentiability.

  • Differentiation order is no longer once vs. infinite. Instead, @autodiff functions can specify a maximum order at which this function can be differentiated, unless the function is linear or constant. This is because function-representation-based differentiability requires functions to be differentiated ahead of becoming a value and being passed around.

The grammar for @autodiff is defined as follows:

differentiation-order = 'order' ':' integer-literal
differentiability = 'forward' | 'reverse' | 'linear' | 'constant' | 'bidirectional'
autodiff-attr = '@autodiff' '(' [ differentiability ',' ] diff-order ')'

When a differentiability is specified on a function type, it's obvious that its functions' differentiation behavior is akin to what's defined for the @differentiable declaration attribute. If no differentiability is specified, this function is both forward-mode and reverse-mode differentiable (same as bidirectional).

Creating @autodiff Functions

It becomes increasingly clear that first-order differentiation will not, and should not, require serialization, and only higher-order differentiation should due to code size. In order to make the system consistent, we make each @differentiable function declaration result in an @autodiff function.

Since we want to support differentiating opaque functions, we must support creating one. The fact is, the user does not even need to know about @autodiff or intentionally create differentiable functions if they are working with functions in the current module. Whenever a local function declaration gets used where the contextual type has an @autodiff attribute on it, Swift differentiates it. If differentiation fails, Swift reports an error at compile-time.

For public APIs, we relax the constraint on @differentiable so that it can be applied to any function declaration without specifying a tangent or adjoint even when the differentiability is forward/reverse. This is when Swift tries to differentiate functions and export the derivatives as part of those public APIs: If the function gets differentiated, its default type signature has @autodiff attribute on it; otherwise, Swift reports an error to the user showing what's non-differentiable.

Higher-Order Differentiation of Opaque Closures

In order for modular libraries to support opaque higher-order differentiation, the differentiation order must be specified in the closure type signature, so that the closure ABI is guaranteed to contain the higher-order derivative.

@autodiff(reverse, order: 2) (T) -> U

For example, function g takes a differentiable function that is differentiable up to at least the 3rd order, then differentiates it 3 times in the body.

// In a separate module:
func g(_ h: @autodiff(reverse, order: 3) (Float) -> Float) -> Float {
    return #gradient(h)(1) +
           #gradient(#gradient(h))(1) +
           #gradient(#gradient(#gradient(h))(1)
}

We also extend the @differentiable attribute so that it can specify an primitive-differentiable function can be forced to be differentiated to a specific order ahead of time. For example, when Swift compiles function f below, this function will have been differentiated 6 times, and gradient functions will be preserved in f's ABI so that its derivatives can be called from anywhere (any other Swift module, or even C). f's default type signature is @autodiff(reverse, order: 6) (Float) -> Float.

@differentiable(reverse, order: 6)
public func f(_ x: Float) -> Float {
    return pow(x, 6)
}

Differentiable functions with a maximum differentiation order can be implicitly "down-ordered", that is, differentiable functions with a higher maximum differentiation order can be implicitly converted to a function with a lower maximum differentiation order. For example, we can directly pass f as an argument to g.

g(f) // 156

Conversion Between Differentiabilities

Because of their mathematical properties, differentiabilities can be converted to one another statically without runtime overhead. For example, a constant function is also a linear function when it's unary; a linear function is a bidirectional-differentiable function whose tangent and adjoint are both themselves; any differentiability can be completely dropped from a function type, forming a "normal" function. This allows us to define generic algorithms using differentiation, without specializing them on function types of each differentiability.

The following table shows whether each differentiability (as a column label) can be converted to another (as a row label).

Convertible to: None Linear Constant Forward Reverse Bidirectional
None
Linear
Constant ✔ (unary)
Forward
Reverse
Bidirectional

What does differentiability conversion look like in real code? Just like @convention conversion, differentiability conversion is implicit and has little mental overhead to the user.

let linear: @autodiff(linear) (Float) -> Float = ...
let bidir: @autodiff (Float) -> Float = ...
let const: @autodiff(constant) (Float) -> Float = ...

func foo(_: @autodiff(reverse) (Float) -> Float) { ... }

foo(linear) // Okay! Implicitly converted to `@autodiff(reverse)`.
foo(bidir) // Okay! Implicitly converted to `@autodiff(reverse)`.
foo(const) // Okay! Implicitly converted to `@autodiff(reverse)`.
...

Part 5: True Differential Operators

Generalized Differentiability enabled us to define custom differential operators in a functional way. Now it's time to define the true differential operators.

Derivatives and Gradient

We start with functions that take a function and produce a function that computes derivatives or gradient. Recall that we already had built-in syntax #gradient and #derivatives for computing gradients and derivatives, but we are exploring more expressive APIs enabled by Generalized Differentiability which enabled us to differentiate function arguments that are functions.

Forward Differential Operators

We define two forward-mode differential operators for computing basic derivatives:

  • derivatives(of:) computes a derivatives function that takes a value and returns derivatives evaluated at the given value.
  • derivatives(at:in:) computes derivatives of a closure at a given value.
/// Computes derivatives of `body`.
func derivatives<T: FloatingPoint, R: Differentiable>(
    of body: @autodiff(forward) (T) throws -> R
) rethrows -> (T) -> R {
    return { x in #differential(body)(x)(1).1 } // seed = dx/dx = 1
}

/// Computes derivatives of `body` at scalar `x`.
func derivatives<T: FloatingPoint, R: Differentiable>(
    at x: T, in body: @autodiff(forward) (T) throws -> R
) rethrows -> R {
    return derivatives(of: body)(x)
}

Reverse Differential Operators

We also define two reverse-mode differential operators for computing basic gradients:

  • gradient(of:) computes a gradient function that takes a value and returns the gradient evaluated at the given value.
  • gradient(at:in:) computes the gradient of a closure evaluated at a given value.
/// Computes the gradient of `body`.
func gradient<T: Differentiable, R: FloatingPoint>(
    of body: @autodiff(reverse) (T) throws -> R
) rethrows -> (T) -> T {
    return { x in #pullback(body)(x).1(1) } // seed = dx/dx = 1
}

/// Computes the gradient of `body` at `x`.
func gradient<T: Differentiable, R: FloatingPoint>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> T {
    return gradient(of: body)(x)
}

As we can see, since we are to differentiate a higher-order function's argument (thanks to Generalized Differentiability), we can define derivatives(of:) and gradient(of:) as Swift functions in terms of more general raw differential operators, #differential and #pullback, to replace #derivatives and #gradient!

These differential operators work seamlessly with closure captures, error-throwing functions, or arbitrary side-effecting code that do not contribute to the closure result. This looks quite like value-based automatic differentiation while the math is actually fully functional. This achieves a similar level of expressivity as imperative-style automatic differentiation libraries: Instead of writing gradient(...) at the bottom of a forward pass, one would just write it on top and have a trailing closure close over the forward pass.

Example: Train a simple 2-layer perceptron. The snippet computes the gradient w.r.t. each parameter at each training step, prints a loss, and optimizes parameters.

struct Parameters: Differentiable, ParameterGroup {
    var w1 = Tensor<Float>(randomNormal: [784, 30])
    var b1 = Tensor<Float>(zeros: [30])
    var w2 = Tensor<Float>(randomNormal: [30, 10])
    var b2 = Tensor<Float>(zeros: [10])
}

var params = Parameters()
let minibatches = Dataset(...)
var optimizer = StochasticGradientDescent(learningRate: 0.1)
for (x, y) in minibatches {
    let grads = gradient(at: params) { params in
        let h1 = tanh(matmul(x, params.w1) + params.b1)
        let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
        let loss = (y - ŷ).squared().mean()
        print("Loss is \(loss)")
        return loss
    }
    optimizer.fit(&params, gradients: grads)
}

Preserving Original Result

Since the trailing closure as an argument to gradient(at:in:), the forward computation is just as customizable as within operator-overloading AD systems. Users can do whatever they want to intermediate values or the result in the primal computation.

That said, we would like to provide a way to have the differentiation API return the original result directly. Because of Generalized Differentiability, these APIs can be defined entirely as library functions using primitive differential operators.

/// Computes `body(x)` and derivatives of each scalar output of `body` at `x`.
func valueWithDerivatives<T: FloatingPoint, R: Differentiable>(
    at x: T, in body: @autodiff(forward) (T) throws -> R
) rethrows -> (value: R, derivatives: R) {
    return #differential(body)(x)(1)
}

/// Computes `body(x)` and the gradient of `body` at `x`.
func valueWithGradient<T: Differentiable, R: FloatingPoint>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> (value: R, gradient: T) {
    let (y, pullback) = #pullback(body)(x)
    return (y, pullback(1))
}

Jacobian-Vector Products and Vector-Jacobian Products

Jacobian-vector products (forward-mode) and vector-Jacobian products (reverse-mode) are extremely useful differential operators for lots of tasks in numerical computing.

/// Computes Jacobian-vector products of `body` at `x`.
func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: T,
    in body: @autodiff(forward) (T) throws -> R
) rethrows -> R {
    return #differential(body)(x)(vector)
}

/// Computes the vector-Jacobian products of `body` at `x`.
func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: R,
    in body: @autodiff(reverse) (T) throws -> R
) rethrows -> T {
    return #pullback(body)(x)(vector)
}

Differentials and Pullbacks

In some cases, computational tasks rely on fully extensible differential operators as well as maximum efficiency, e.g. computing vector-Jacobian products as well as the original function's result. Luckily, the two operators we mentioned in the very beginning when we introduced Jacobians are the ones we need: differential and pullback. We have already had their raw operators supported in the syntax: #differential and #pullback, but we can make them nicer using by redefining them as Swift functions.

Function differential(at:in:) computes the differential of a closure at a certain point, and returns a linear map that takes a vector and returns Jacobian-vector products.

/// Computes the differential of `body` at `x`.
func differential<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T) -> R {
    return #differential(body)(x).1
}

Function differentialWithResult(at:in:) computes the differential of a closure at a certain point, and returns a linear map that takes a vector and returns both the original function's result and Jacobian-vector products.

/// Computes the differential of `body` at `x` that also computes the value of
/// `body(x)`.
func differentialWithResult<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T) -> (originalResult: T, derivatives: R) {
    return #differential(body)(x)
}

Function pullback(at:in:) computes the pullback of a closure at a certain point, and returns a linear map that takes a vector and returns vector-Jacobian products.

/// Computes the pullback of `body` at `x`.
func pullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (R) -> T {
    return #pullback(body)(x).1
}

Function resultWithPullback(at:in:) computes the pullback of a closure at a certain point, and returns the original function's result and a linear map that takes a vector and returns vector-Jacobian products.

/// Computes the original value of `body(x)` and the pullback at `x`.
func resultWithPullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R) -> T) {
    return #pullback(body)(x)
}

It is amazing that we are able to define every differential operator in terms of other differential operators. #differential and #pullback have become unnecessary because the functional form is so much nicer, so we can teach the compiler to recognize Swift functions differential(at:in:) and pullback(at:in:) as the builtin "canonical" differential operator, and remove all raw differential operators that start with a # from the language.

Examples:

  1. Chain directional derivatives freely using differentials.

    let x = 0.5
    let df = differential(at: x) { x in
        sin(cos(x))
    }
    df(1) // (f(x), df/dx)
    df(derivatives(of: log)(t)) // (f(x), df/dt)
    df(derivatives(at: t, in: log)) // (f(x), df/dt)
  2. Chain gradients freely using pullbacks.

    let x = 0.5
    let (y, df) = pullback(at: x) { x in
        cos(sin(x))
    }
    
    df(1) // dy/dx
    df(gradient(of: log)(t)) // dy/dt
    df(gradient(at: t, in: log)) // dy/dt

Hessian-Vector Products

Second-order optimization methods in machine learning make use of Hessians and Hessian-vector products, which can be hard to compute. Many AD libraries such as Autograd already support Hessians by supporting arbitrarily nested forward-mode/reverse-mode differentiation. Hessian-vector products can be efficiently computed by applying "forward-on-reverse", namely applying the composition of the forward-mode differential operator and the reverse-mode differential operator on a function.

Just like other differential operators, we can define the Hessian-vector products operator in a simple, functional way.

func hvp<T: Differentiable, R: FloatingPoint>(
    at x: T, in f: @autodiff(order: 2) (T) -> R
) -> @autodiff(linear) (T) -> T where T: Differentiable {
    return differential(at: x, in: gradient(of: f))
}

Nested differentiation without a careful implementation is prone to a bug known as perturbation confusion [1] [2]. Language-integrated AD in Swift will enforce tagging in compiler-generated code to guarantee the correctness of higher-order derivatives.

Standard Library or an AutomaticDifferentiation Module?

Earlier in this document, we discussed enhancements to standard library protocols and extensions to the standard library to model differentiable types. These protocols are general enough for standard library types such as floating point scalars (Float, Double, and Float80) and potentially SIMD vectors. However, in any general-purpose programming language, there is always a question of how much math the standard library should have.

We think basic differential operators like gradient(of:) and derivatives(of:) should be included in the standard library, because they are common operators that one would find in college calculus, and they will make AD feel more language-integrated along with standard library protocols VectorNumeric and Differentiable.

We do believe that other operators that contain terms like "Jacobian" and "differential" should be in a separate module, possibly called "AutomaticDifferentiation" that ships with the Swift language.

Part 6: Generalized Types for Differentiation

We introduced the Differentiable protocol that makes a type represent a vector space and be differentiable. However, there are a few scenarios where such a protocol won't work well.

  1. Customizable weight type

    Orthogonal weight matrixes have shown advantages in neural network training [1] [2]. When differentiating through these networks, gradients with respect to weights will no long stay orthogonal - instead, they are skew-symmetric matrices. While we can represent both orthogonal matrices and skew-symmetric matrices as values of a Matrix or Tensor type and programmatically ensure its orthogonality, some researchers have been seeking a way to represent this natively in the type system of a programming language and still have AD produce the correct derivative.

  2. Quantized training

    Quantization techniques store and calculate numbers in more compact formats, i.e. a fixed-point data type. Conceptually, a quantized tensor for a real-valued Tensor can be defined as the following struct:

    public struct Quantized<Dequantized: Quantizable, QuantizedScalar: FixedWidthInteger> {
        var data: Quantizable
        var range: Range<Dequantized.Scalar>
        var scale: QuantizedScalar
        var zeroPoint: Int
    }

    We can think of a scenario where the developer defines a neural network as a function whose parameters are of type Quantized<Tensor<Float>>. When training parameters to this neural network, gradients need to flow at a significantly higher precision, but today's system cannot achieve that because it assumes gradients to have the same type as the original arguments.

  3. Generic optimizers

    Optimization problems in machine learning can be generalized by optimization on manifolds. Optimizers in most libraries assume the original space and the loss space both to be vector spaces, and perform an implicit conversion from cotangent vectors to tangent vectors and another conversion from tangent vectors to the original weight type when performing θ -= η * ∂L/∂θ. While this works for most cases, it won't generalize over typed orthogonal matrices, because orthogonal matrices are not vector spaces, and a conversion from an orthogonal matrix to a skew symmetric matrix cannot be implicit.

Revise Differentiable Protocol

To address concerns raised above, we've managed to find a more general answer to modeling differentiable types. Instead of requiring them to be vector spaces (VectorNumeric), we model them as differentiable manifolds. Reverse-mode differentiation on function over manifolds produces gradients vectors in its cotangent bundle; forward-mode differentiation produces derivatives in its tangent bundle. Note that we cannot represent tangent/cotangent bundles separately from tangent/cotangent spaces inside each bundle, because Swift does not have dependent types. By removing the restriction to VectorNumeric, Differentiable is now fully extensible.

/// A type that mathematically represents a differentiable manifold whose
/// tangent spaces are finite-dimensional.
///
/// In automatic differentiation, differentiation will produce a Jacobian whose
/// elements are of `Tangent` type.
public protocol Differentiable {
    /// The tangent vector space of this differentiable manifold.
    associatedtype TangentVector: VectorNumeric
        where TangentVector.Scalar: FloatingPoint

    /// The cotangent space of this differentiable manifold.
    associatedtype CotangentVector: VectorNumeric
        where TangentVector.Scalar: FloatingPoint

    /// Returns `self` moved along the value space towards the given tangent
    /// vector. In Riemannian geometry (mathematics), this is usually equivalent
    /// to retraction or exponential map.
    func moved(toward direction: TangentVector) -> Self

    /// Convert a cotangent vector to its corresponding tangent vector.
    func tangentVector(from cotangent: CotangentVector) -> TangentVector
}

When the tangent vector of a differentiable manifold is equal to its cotangent vector, we can simply provide a default implementation of tangentVector(from:), which is just the identity function.

public extension Differentiable where TangentVector == CotangentVector { 
    func tangentVector(from cotangent: CotangentVector) -> TangentVector { 
        return cotangent 
    } 
} 

When a differentiable manifold is a vector space, it's tangent space is usually itself. In these cases, we simply define moved(toward:) as vector addition.

public extension Differentiable 
    where Self: VectorNumeric, TangentVector == Self { 
    func moved(toward direction: TangentVector) -> Self { 
        return self + direction 
    } 
} 

Deriving Conformances to VectorNumeric and Differentiable

It is very common for numerical computing to deal with lots of parameters, each of which is a vector or a matrix. In these cases, instead of manually specifying each input in a differential operator's argument list, users would often like to differentiate through structures and obtain a structure of partial derivatives. It is important for the Swift to provide derived conformances for core protocols for numerical computing: Differentiable and VectorNumeric.

Mathematically, it is straightforward to represent product types. A struct or tuple in Swift corresponds to a product of sets; an enum in Swift corresponds to an addition of sets.

struct Parameters: VectorNumeric, Differentiable {
    var a: Vector<Float>
    var b: Float
}

Struct Parameters is equivalent to a product of sets Vector<Float> and Float, or a product of a real vector space ℝⁿ and a scalar field , namely ℝⁿ ⨯ ℝ, which is also a vector space. To make Parameters obtain the traits of a vector space, we extend the compiler to derive a conformance to VectorNumeric similar to how Codable and Hashable conformances are derived. When a conformance clause is given in the current file and when all stored properties conform to VectorNumeric with the same Scalar, the compiler synthesizes AST to make this type conform, with all protocol requirements applying property-wise.

After deriving conformances to VectorNumeric:

struct Parameters: VectorNumeric {
    var a: Vector<Float>
    var b: Float

    // derived:
    typealias Scalar = Float

    // derived:
    struct Shape {
        var a: Vector<Float>.Shape
        var b: Float.Shape
    }

    // derived:
    static func + (lhs: Parameters, rhs: Parameters) -> Parameters {
        return Parameters(a: lhs.a + rhs.a, b: lhs.b + rhs.b)
    }
    // ...
}

In order for Parameters to be differentiable, it must also need to conform to Differentiable. Deriving conformances to Differentiable can follow the same rules.

struct MyShapes: Differentiable {
    var a: Circle // conforms to Differentiable
    var b: Square // conforms to Differentiable
}

After deriving conformances to Differentiable:

struct MyShapes: Differentiable {
    var a: Circle
    var b: Square

    // derived:
    struct TangentVector: VectorNumeric {
        var a: Circle.TangentVector
        var b: Square.TangentVector
    }
    // derived:
    struct CotangentVector: VectorNumeric {
        var a: Circle.CotangentVector
        var b: Square.CotangentVector
    }

    // derived:
    func moved(toward direction: TangentVector) -> MyShapes {
        return MyShapes(a: a.moved(toward: direction.a),
                        b: b.moved(toward: direction.b))
    }

    // derived:
    func tangentVector(from cotangent: CotangentVector) -> TangentVector {
        return TangentVector(a: a.tangentVector(from: cotangent.a)
                             b: b.tangentVector(from: cotangent.b))
    }
}

With derived conformances to these protocols, the user can now write arbitrarily nested structs of differentiable manifolds, and make them differentiable with trivial effort, greatly simplifying the development.

Generalized Differential Operators

In the new Differentiable protocol, we added Tangent and Cotangent types to represent the type of Jacobian-vector products and vector-Jacobian products, respectively. We make the following changes to the existing differential operators we introduced.

  • Differential operators that return T as a forward-differentiated derivative will return T.Tangent instead.
  • Differential operators that return T as a reverse-differentiated derivative will return T.Cotangent instead.
  • Vectors T for computing Jacobian-vector products will become T.Tangent.
  • Vectors T for computing vector-Jacobian products will become T.Cotangent.

Here we list a few updated differential operators.

Jacobian-Vector Products and Vector-Jacobian Products

Jacobian-vector products (forward-mode) and vector-Jacobian products (reverse-mode) are extremely useful differential operators for lots of tasks in numerical computing.

/// Computes Jacobian-vector products of `body` at `x`.
func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: T.TangentVector,
    in body: @autodiff(forward) (T) throws -> R
) rethrows -> R.TangentVector {
    return #differential(body)(x)(vector)
}

/// Computes the vector-Jacobian products of `body` at `x`.
func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: R.CotangentVector,
    in body: @autodiff(reverse) (T) throws -> R
) rethrows -> T.CotangentVector {
    return #pullback(body)(x)(vector)
}

Differentials and Pullbacks

/// Computes the differential of `body` at `x`.
func differential<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T.TangentVector) -> R.TangentVector {
    return #differential(body)(x).1
}

/// Computes the differential of `body` at `x` that also computes the value of
/// `body(x)`.
func differentialWithResult<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T.TangentVector) -> (originalResult: T, derivatives: R.TangentVector) {
    return #differential(body)(x)
}

/// Computes the pullback of `body` at `x`.
func pullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (R.CotangentVector) -> T.CotangentVector {
    return #pullback(body)(x).1
}

/// Computes the value of `body(x)` and the pullback at `x`.
func resultWithPullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R.CotangentVector) -> T.CotangentVector) {
    return #pullback(body)(x)
}

Back to the Problems

Recall that the motivation of introducing a general, future-proof Differentiable protocol is to be able to model the following use cases.

  1. Neural network with orthogonal weights can now be differentiable. We can define a type called OrthogonalMatrix to conform to Differentiable, and another type SkewSymmetricMatrix to conform to both Differentiable and VectorNumeric.

    struct SkewSymmetricMatrix: Differentiable, VectorNumeric {
        typealias Scalar = Float
        ...
    }
    struct OrthogonalMatrix: Differentiable {
        ...
        typealias TangentSpace = SkewSymmetricMatrix
        typealias CotangentSpace = SkewSymmetricMatrix
    }

    When we differentiate a function (OrthogonalMatrix) -> Float using the reverse-mode differential operator, we'll get a function (OrthogonalMatrix) -> SkewSymmetricMatrix. Everything falls out, without type safety compromises.

  2. Differentiating a quantized network is now possible with AD.

    // `Quantized` is a vector space when the dequantized type is one.
    extension Quantized: VectorNumeric where Dequantized: VectorNumeric {
        typealias Scalar = Dequantized.Scalar
        static func + (lhs: Quantized, rhs: Quantized) -> Quantized {
            // Custom code: Dequantize, add, and requantize!
        }
        static func * (lhs: Scalar, rhs: Quantized) -> Quantized {
            // Custom code: Dequantize, add, and requantize!
        }
    }
    
    // `Quantized` is a differentiable manifold when the dequantized type is one.
    extension Quantized: Differentiable where Dequantized: Differentiable {
        typealias TangentVector = Dequantized.TangentVector
        typealias CotangentVector = Dequantized.CotangentVector
    
        func moved(toward tangent: Dequantized.TangentVector) -> QuantizedTensor {
            // Custom code: Dequantize, optimize, and requantize!
        }
    }

    With Quantized conforming to the new Differentiable protocol, when we differentiate a function of type (Quantized<Tensor<Float>, Int8>) -> U, AD produces a function of type (Quantized<Tensor<Float>, Int8>) -> Tensor<Float>, which is close to exactly what we need in quantized training of neural networks.

  3. Generic optimizers can be defined in terms of manifold optimization functions, without implicit casting.

    extension SGD {
        func fit(_ parameters: inout Parameters, gradients: Parameters) {
            parameters.update(withGradients: gradients) { θ, g in
                θ = θ.moved(toward: -θ.tangentVector(from: g) * learningRate)
            }
        }
    }

Part 7. Customizable Differentiation

Some machine learning models require manipulating the gradient with respect to certain values, e.g. gradient clipping. Tangent provides such a feature as a syntax extension in Python. Recurrent neural networks often suffer from the "exploding gradient" problem, and a typical solution is to force the gradient of an RNN to not exceed a certain value by performing gradient clipping.

func prediction(for input: Tensor<Float>) -> Float {
    var prediction = input
    for _ in 0...5 {
        // Clip gradient.
        prediction = prediction.withCustomizedGradient { grad in
            max(min(grad, 1), -1)
        }
        prediction = lstm.prediction(for: input)
    }
    return prediction
}

APIs withCustomizedGradient(_:) and withCustomizedDerivatives(_:) look like a compiler-known function which makes Swift run customized code in differentiated code. However, because of the generality of the differential registration mechanism, these functions can be defined entirely as a Swift function with no special support from the compiler. Here's the implementation of these APIs.

public extension Differentiable {
    @differentiable(forward, wrt: self, tangent: tangentCustomizingDerivatives)
    func withCustomizedDerivatives(
        _ body: @nondiff (TangentVector) -> TangentVector
    ) -> Self {
        return self
    }

    internal func tangentCustomizingDerivatives(
        body: (TangentVector) -> TangentVector,
        originalResult: Self,
        tangent: TangentVector
    ) -> TangentVector {
        return body(tangent)
    }

    @differentiable(reverse, wrt: self, adjoint: adjointCustomizingGradient)
    func withCustomizedGradient(
        _ body: @nondiff (CotangentVector) -> CotangentVector
    ) -> Self {
        return self
    }

    internal func adjointCustomizingGradient(
        body: (CotangentVector) -> CotangentVector,
        originalResult: Self,
        adjoint: CotangentVector
    ) -> CotangentVector {
        return body(adjoint)
    }
}

This API supports many gradient manipulation tasks in machine learning optimization. For example, the user can make gradient computation trigger a break from the loop.

var prediction = input
for _ in 0...5 {
    // Stop loop when necessary.
    var shouldStop = false
    prediction = prediction.withCustomizedGradient { grad in
        if grad < lowerBound {
            shouldStop = true
        }
        return grad
    }
    if shouldStop {
        break
    }
    prediction = lstm.prediction(for: input)
}

Setting a mutable flag is not the most user-friendly way. We can create APIs that wrap withCustomizedDerivatives(_:) and withCustomizedGradient(_:) and return a Bool, so that later code can decide whether to break from the loop based on the return value from that API. Or better, if Swift supports non-local control flow, i.e. a branch from nested closures, the code can be written just as a break.

var prediction = input
for _ in 0...5 {
    // Stop loop when necessary.
    prediction = prediction.withCustomizedGradient { grad in
        if grad < lowerBound {
            break
        }
        return grad
    }
    prediction = lstm.prediction(for: input)
}

Acknowledgements

The author would like to thank Dan Zheng, Chris Lattner, Alex Wiltschko, Bart van Merriënboer, Gordon Plotkin, Dougal Maclaurin, Matthew Johnson, Casey Chu, Tim Harley, Marc Rasi, and Dmitri Gribenko for their input to the initial design of this powerful language feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment