Skip to content

Instantly share code, notes, and snippets.

@marcrasi
Last active February 25, 2021 13:25
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save marcrasi/a932d569b934dfeeb75d0bd4dedb0051 to your computer and use it in GitHub Desktop.
Save marcrasi/a932d569b934dfeeb75d0bd4dedb0051 to your computer and use it in GitHub Desktop.
Unifying the differentiation modes

Unifying the differentiation modes

Today, Swift AD has two separate transforms that produce forward mode derivatives (JVPs) and reverse mode derivatives (VJPs).

We can generalize this to a single transform that produces a single "derivative function" that can evaluate derivatives in both modes.

The "derivative function" is generic over a "derivative type" that determines the derivative mode.

The advantages over the existing system are:

  1. One transform is simpler than two transforms. (high confidence)
  2. We can implement other modes very easily, in Swift. (e.g. immediate mode, jacobian mode) (high confidence)
  3. Derivatives can (statically) operate on different tangent representations. (medium confidence)
  4. It might be easier to make it optimize well. (needs further investigation)
  5. Maybe we can implement higher order differentiation with this. (needs further investigation)

This is a big, possibly API-breaking, change, so I'm not proposing that we do it any time soon.

I do propose:

  1. Let's work more on the design and investigate some of the open questions.
  2. Before embarking on any big AD projects related to this idea, let's consider achieving them by implementing this idea.

Details

See "demo.swift" for a full working example. Here's a pseudocode summary.

/// Original function.
func foo(_ x: Float, _ y: Float) -> Float

/// Derivative function.
func dFoo<D<_>: Derivative>(_ x: Float, _ y: Float, _ dargs: D<(Float, Float)>) -> (out: Float, dout: D<Float>)

D<_> determines the derivative mode:

  • The usual tangent vector gives you the immediate forwards mode derivative.
  • A matrix gives you the immediate Jacobian.
  • A differential closure gives you the delayed forwards mode derivative.
  • A pullback closure gives you the reverse mode derivative.

Swift doesn't let me actually write D<_>: Derivative. I can think of ways to get around this but I'm not sure if any are "good enough". This needs more investigation. The demo does something simple but not very general.

Open questions

Is this similar to anything in the literature? Can we learn from that?

What's the right way to make the derivative functions generic over the derivative values? (aka how to express something like D<_>: Derivative).

Can it optimize well?

What should the user-facing APIs look like?

  • how to register derivatives
  • how to take derivatives
  • how do derived conformances work?
  • how do differentiable function values work?
  • how to specify new differentiation modes

Can we implement all of the current AD system's features in terms of this?

  • fully customizable derivatives
  • derivatives of mutating functions
  • mutating derivative functions

Can we implement higher order differentiation?

Motivating use case

In SwiftFusion, we need full Jacobians of functions (A) -> B. Today we compute the full Jacobians like this:

  1. Call the JVP, which returns a closure (A.TangentVector) -> B.TangentVector, called the "differential".
  2. Evaluate the "differential" on the basis vectors [1, 0, 0, ... 0], [0, 1, 0, ..., 0], etc to get the columns of the Jacobian.

In the new system, we can get the same result with less redundant computation by calling the derivative function with a matrix.

import TensorFlow
// MARK: - Protocols
/// Element of an abstract vector space.
protocol Vector {
static func * (lhs: Float, rhs: Self) -> Self
static func + (lhs: Self, rhs: Self) -> Self
static var zero: Self { get }
}
/// An environment containing all the types and linear operations required for defining
/// derivatives.
protocol Derivative {
/// The derivative of a `Float`-valued function.
associatedtype DFloat: Vector
/// The derivative of a `(Float, Float)`-valued function.
associatedtype DFloat2: Vector
/// Returns the derivative of `f(_).0`, given the derivative `t` of `f(_)`.
static func d0(_ t: DFloat2) -> DFloat
/// Returns the derivative of `f(_).1`, given the derivative `t` of `f(_)`.
static func d1(_ t: DFloat2) -> DFloat
/// Returns the derivative of `(f0(_), f1(_))` given the derivative `t0` of `f0` and the
/// derivative `t1` of `f1`.
static func dInit(_ t0: DFloat, _ t1: DFloat) -> DFloat2
}
// It is unfortunate that I have to define `associatedtype D*` for all types that I want to use
// in differentiation.
// MARK: - Derivative functions
// These look almost exactly like JVPs.
// It should be easy to automatically produce them with a forward mode autodiff transform.
// The exciting thing is that we can evaluate them in any differentiation mode we want, by
// defining different `Derivative`s.
extension Derivative {
static func sum(_ x: Float, _ y: Float, _ tIn: DFloat2) -> (value: Float, tOut: DFloat) {
(x + y, d0(tIn) + d1(tIn))
}
static func product(_ x: Float, _ y: Float, _ tIn: DFloat2) -> (value: Float, tOut: DFloat) {
(x * y, y * d0(tIn) + x * d1(tIn))
}
static func square(_ x: Float, _ tIn: DFloat) -> (value: Float, tOut: DFloat) {
product(x, x, dInit(tIn, tIn))
}
/// x^2 + 3x
static func example1(_ x: Float, _ tIn: DFloat) -> (value: Float, tOut: DFloat) {
let (v1, t1) = square(x, tIn)
let (v2, t2) = product(3, x, dInit(DFloat.zero, tIn))
return sum(v1, v2, dInit(t1, t2))
}
/// x^2 + 10y^2 + xy
static func example2(_ x: Float, _ y: Float, _ tIn: DFloat2) -> (value: Float, tOut: DFloat) {
let dx = d0(tIn)
let dy = d1(tIn)
let (xSq, dxSq) = square(x, dx)
let (ySq, dySq) = square(y, dy)
let (ySq10, dySq10) = product(10, ySq, dInit(DFloat.zero, dySq))
let (xy, dxy) = product(x, y, dInit(dx, dy))
let (r1, dr1) = sum(xSq, ySq10, dInit(dxSq, dySq10))
return sum(r1, xy, dInit(dr1, dxy))
}
}
// MARK: - Derivative modes
enum ForwardTangent: Derivative {
struct DFloat: Vector {
var t: Float
init(_ t: Float) { self.t = t }
static func * (lhs: Float, rhs: Self) -> Self { .init(lhs * rhs.t) }
static func + (lhs: Self, rhs: Self) -> Self { .init(lhs.t + rhs.t) }
static var zero: Self { .init(0) }
}
struct DFloat2: Vector {
var t0: Float
var t1: Float
init(_ t0: Float, _ t1: Float) { self.t0 = t0; self.t1 = t1 }
static func * (lhs: Float, rhs: Self) -> Self { .init(lhs * rhs.t0, lhs * rhs.t1) }
static func + (lhs: Self, rhs: Self) -> Self { .init(lhs.t0 + rhs.t0, lhs.t1 + rhs.t1) }
static var zero: Self { .init(0, 0) }
}
static func d0(_ t: DFloat2) -> DFloat { .init(t.t0) }
static func d1(_ t: DFloat2) -> DFloat { .init(t.t1) }
static func dInit(_ t0: DFloat, _ t1: DFloat) -> DFloat2 { .init(t0.t, t1.t) }
}
enum ForwardJacobian: Derivative {
struct DFloat: Vector {
// Shape [1, n], or the zero scalar.
var t: Tensor<Float>
init(_ t: Tensor<Float>) {
precondition(t == Tensor(0) || (t.rank == 2 && t.shape[0] == 1))
self.t = t
}
static func * (lhs: Float, rhs: Self) -> Self { .init(lhs * rhs.t) }
static func + (lhs: Self, rhs: Self) -> Self { .init(lhs.t + rhs.t) }
static var zero: Self { .init(Tensor(0)) }
}
struct DFloat2: Vector {
// Shape [2, n], or the zero scalar.
var t: Tensor<Float>
init(_ t: Tensor<Float>) {
precondition(t == Tensor(0) || (t.rank == 2 && t.shape[0] == 2))
self.t = t
}
static func * (lhs: Float, rhs: Self) -> Self { .init(lhs * rhs.t) }
static func + (lhs: Self, rhs: Self) -> Self { .init(lhs.t + rhs.t) }
static var zero: Self { .init(Tensor(0)) }
}
static func d0(_ t: DFloat2) -> DFloat {
.init(t.t[0].expandingShape(at: 0))
}
static func d1(_ t: DFloat2) -> DFloat {
.init(t.t[1].expandingShape(at: 0))
}
static func dInit(_ t0: DFloat, _ t1: DFloat) -> DFloat2 {
if t0.t == Tensor(0) && t1.t == Tensor(0) { return .init(Tensor(0)) }
let row0 = t0.t == Tensor(0) ? Tensor(zerosLike: t1.t) : t0.t
let row1 = t1.t == Tensor(0) ? Tensor(zerosLike: t0.t) : t1.t
return .init(Tensor(concatenating: [row0, row1]))
}
}
enum Reverse<Base: Derivative, In: Vector>: Derivative {
struct Pullback<Out: Vector>: Vector {
var f: (Out) -> In
init(_ f: @escaping (Out) -> In) { self.f = f }
static func * (lhs: Float, rhs: Self) -> Self {
.init({ rhs.f(lhs * $0) })
}
static func + (lhs: Self, rhs: Self) -> Self {
.init({ lhs.f($0) + rhs.f($0) })
}
static var zero: Self { .init({ _ in .zero }) }
}
typealias DFloat = Pullback<Base.DFloat>
typealias DFloat2 = Pullback<Base.DFloat2>
static func d0(_ t: DFloat2) -> DFloat {
.init({ t.f(Base.dInit($0, .zero)) })
}
static func d1(_ t: DFloat2) -> DFloat {
.init({ t.f(Base.dInit(.zero, $0)) })
}
static func dInit(_ t0: DFloat, _ t1: DFloat) -> DFloat2 {
.init({ t0.f(Base.d0($0)) + t1.f(Base.d1($0)) })
}
}
// MARK: - Example invocations
// Optimizes to plain float operations.
func example1ForwardDerivative(_ x: Float) -> Float {
let (_, d) = ForwardTangent.example1(x, .init(1))
return d.t
}
// Optimizes to:
// - plain float operations that produce the result
// - a bunch of closure allocations and deallocations that are not used in the result (!?)
func example1ReverseDerivative(_ x: Float) -> Float {
typealias System = Reverse<ForwardTangent, ForwardTangent.DFloat>
let (_, d) = System.example1(x, .init({ $0 }))
return d.f(.init(1)).t
}
func example2ReverseDerivative(_ x: Float, _ y: Float) -> ForwardTangent.DFloat2 {
typealias System = Reverse<ForwardTangent, ForwardTangent.DFloat2>
let (_, d) = System.example2(x, y, .init({ $0 }))
return d.f(.init(1))
}
func example2ForwardJacobian(_ x: Float, _ y: Float) -> Tensor<Float> {
let (_, d) = ForwardJacobian.example2(x, y, .init(eye(rowCount: 2)))
return d.t
}
func example2ReverseJacobian(_ x: Float, _ y: Float) -> Tensor<Float> {
typealias System = Reverse<ForwardJacobian, ForwardJacobian.DFloat2>
let (_, d) = System.example2(x, y, .init({ $0 }))
return d.f(.init(eye(rowCount: 1))).t
}
print(example1ForwardDerivative(2))
print(example1ReverseDerivative(2))
print(example2ReverseDerivative(1, 5))
print(example2ForwardJacobian(1, 5))
print(example2ReverseJacobian(1, 5))
@oxinabox
Copy link

I don't remember why I am reading this. I just remember encountering it and adding it to me "To Read".
Do excuse me if I am commenting unwelcomely. I am almost certainly going to say things you already know.

This is very neat.

Is this similar to anything in the literature?

I haven't seen this whole overall approach before, but I am not that well-versed.

I have seen the approach for computing jacobians via propagation of a matrix called chunked mode before, though idk if that is a standard name.
E.g. in ForwardDiff.jl
It's legit and should be faster because matmul is faster that a loop of matrix vector multiplications, (once big enough Strassen's Algorithm applies).
Further improves cache locality, makes redunant work more obvious to the compiler for common subexpression elimination etc.

Pairs nicely with work on sparsity, which comutes the jacobian on a compressed basis (though you cna do that via looping also)

In general it makes sense because one is constructing a functional form of the jacobian matrix.
Basis transforms are applied to matrixes via matrix-matrix products.
the pushforward function is the function form of that linear operator that can be written as a pushforward(x) = J*x which is identical code regardless of if x is a matrix or a vector.

Can we implement higher order differentiation?

I think so?
Seems you just need zero, + and * defined.
Which is well defined on polynomials.
Full details of that is in chapter 13 of Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (Second. ed.).
but the actual operations are not obscure,
addition is term addition of coefficients, multiplication is convolution of coeffients (or just multiple them by hand and throw away higher order terms)
See e.g TaylorSeries.jl a julia package that does higher order AD like this.
So one can define a HigherOrderTangent that does that, and it looks like that would just work with your code?


Everything gets fiddly though one extending to work not on scalars but on vector or structs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment