Skip to content

Instantly share code, notes, and snippets.

@luiarthur
Created November 9, 2022 00:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save luiarthur/2cc7abd0acbf9659a7cb8896dd0c61ce to your computer and use it in GitHub Desktop.
Save luiarthur/2cc7abd0acbf9659a7cb8896dd0c61ce to your computer and use it in GitHub Desktop.
Automatic Differentiation Benchmarks
# Comparing AD backends.
# - Zygote, ForwardDiff, ReverseDiff, Complex-step differentiation
using ForwardDiff, ReverseDiff, Zygote, Tracker
using StatsFuns
using Random
using ProgressMeter
using BenchmarkTools
import LinearAlgebra.I as eye
normlpdf(z) = -0.5 * (log(2 * pi) + z^2)
function grad_complexstep(f, dim)
EYE = eye(dim) * im * eps()
return x -> imag(f(x .+ EYE)) / eps()
end
function run_experiment(N; seed=0)
Random.seed!(seed)
data = let
m, s = (2, 1)
y = randn(N) * s .+ m
(N=N, m=m, s=s, y=y)
end
theta = randn(N)
f(theta) = sum(normlogpdf.(theta, data.s, data.y))
fcs(theta) = vec(sum(normlpdf.((theta .- data.y) / data.s), dims=2))
grad_cs = grad_complexstep(fcs, data.N)
grad_forward(theta) = ForwardDiff.gradient(f, theta)
grad_reverse(theta) = ReverseDiff.gradient(f, theta)
grad_zygote(theta) = Zygote.gradient(f, theta)
grad_tracker(theta) = Tracker.gradient(f, Tracker.param(theta))[1]
theta = randn(data.N)
@assert grad_forward(theta) ≈ grad_reverse(theta)
print("f: ")
# @time for _ in 1:n f(theta) end
@btime $f($theta) seconds=1
print("Tracker: ")
@btime $grad_tracker($theta) seconds=1
print("ReverseDiff: ")
@btime $grad_reverse($theta) seconds=1
print("Zygote: ")
@btime $grad_zygote($theta) seconds=1
print("ForwardDiff: ")
@btime $grad_forward($theta) seconds=1
print("Compelx-step: ")
@btime $grad_cs($theta) seconds=1
end
# JIT-compile
println("JIT compile:")
run_experiment(1)
println()
# Speed test:
for n in (10, 100, 1000)
println("Speed test for (n=$n):")
run_experiment(n)
println()
end
# NOTE: Make sure the results are correct.
# est_theta = let
# t = zeros(data.N)
# eta = 1e-2
# @showprogress for _ in 1:10000
# # g = grad_reverse(t)
# # g = grad_forward(t)
# g = grad_zygote(t)[1]
# # g = grad_cs(t)
# t .+= eta * g
# end
# t
# end
# [est_theta data.y]
Speed test for (n=10):
f: 106.360 ns (1 allocation: 144 bytes)
Tracker: 13.962 μs (140 allocations: 7.18 KiB)
ReverseDiff: 1.026 μs (21 allocations: 1.78 KiB)
Zygote: 2.092 μs (55 allocations: 2.42 KiB)
ForwardDiff: 743.102 ns (7 allocations: 3.80 KiB)
Compelx-step: 1.101 μs (13 allocations: 7.72 KiB)
Speed test for (n=100):
f: 725.620 ns (1 allocation: 896 bytes)
Tracker: 18.124 μs (140 allocations: 15.26 KiB)
ReverseDiff: 4.828 μs (21 allocations: 8.39 KiB)
Zygote: 5.455 μs (55 allocations: 8.22 KiB)
ForwardDiff: 29.474 μs (15 allocations: 106.38 KiB)
Compelx-step: 50.554 μs (17 allocations: 628.86 KiB)
Speed test for (n=1000):
f: 7.058 μs (1 allocation: 7.94 KiB)
Tracker: 58.194 μs (140 allocations: 92.95 KiB)
ReverseDiff: 42.516 μs (21 allocations: 71.95 KiB)
Zygote: 37.722 μs (56 allocations: 64.52 KiB)
ForwardDiff: 2.513 ms (175 allocations: 8.44 MiB)
Compelx-step: 9.184 ms (17 allocations: 61.07 MiB)
@luiarthur
Copy link
Author

luiarthur commented Nov 9, 2022

Consider the logpdf of the standard Normal distribution:

$$ \phi(z) = -(\log(2\pi) + z ^ 2) / 2. $$

We compute timings for taking the gradient of

$$ f(\theta) = \sum_{i=1}^N \phi((y_i - \theta_i) / s) $$

w.r.t. the vector $\theta$, keeping constant the vector $y$ and scalar $s$.

The AD backends used are:

Note that the first two backends implement forward-mode AD, whereas the last three backends implement reverse-mode AD (at least by default).

Zygote is a new implementation of Tracker.

Forward-mode AD is generally preferred when the number of parameters is much smaller than the number of outputs; reverse-mode AD is preferred when the number of parameters is much larger than the number of outputs. (Hence, reverse-mode is preferred for most ML applications, where a scalar cost function is optimized w.r.t. a large number of parameters.) Forward-mode AD is easier to implement and requires a forward pass through the cost function ($f$) for each parameter; for each output, reverse-mode AD requires a forward pass through the cost function (to compute intermediate gradients) and a backward pass through a graph or trace of the cost function (to apply the chain rule properly).

Below, we vary the size $n$ of $\theta$ and compute the gradient of $f$ w.r.t. $\theta$. Note that when $n=10$, ForwardDiff is actually the fastest, due to the overhead of computing a graph and doing a backward pass in reverse-AD. Complex-step differentiation, due to redundant/superfluous computations, leads to more computations here. ReverseDiff is slightly faster than complex-step, followed by Zygote and Tracker.

The benefits of reverse AD shows when $n=100$. In terms of computational efficiency,

Zygote > ReverseDiff > Tracker > ForwardDiff > complex-step.

The literature indicates that, in general, a taking the gradient of $f$ w.r.t. to parameters using a good implementation of reverse AD takes at most 6x more FLOPs than when computing $f$. In practice, some packages can take 3-4x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment