Created
November 9, 2022 00:20
-
-
Save luiarthur/2cc7abd0acbf9659a7cb8896dd0c61ce to your computer and use it in GitHub Desktop.
Automatic Differentiation Benchmarks
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Comparing AD backends. | |
# - Zygote, ForwardDiff, ReverseDiff, Complex-step differentiation | |
using ForwardDiff, ReverseDiff, Zygote, Tracker | |
using StatsFuns | |
using Random | |
using ProgressMeter | |
using BenchmarkTools | |
import LinearAlgebra.I as eye | |
normlpdf(z) = -0.5 * (log(2 * pi) + z^2) | |
function grad_complexstep(f, dim) | |
EYE = eye(dim) * im * eps() | |
return x -> imag(f(x .+ EYE)) / eps() | |
end | |
function run_experiment(N; seed=0) | |
Random.seed!(seed) | |
data = let | |
m, s = (2, 1) | |
y = randn(N) * s .+ m | |
(N=N, m=m, s=s, y=y) | |
end | |
theta = randn(N) | |
f(theta) = sum(normlogpdf.(theta, data.s, data.y)) | |
fcs(theta) = vec(sum(normlpdf.((theta .- data.y) / data.s), dims=2)) | |
grad_cs = grad_complexstep(fcs, data.N) | |
grad_forward(theta) = ForwardDiff.gradient(f, theta) | |
grad_reverse(theta) = ReverseDiff.gradient(f, theta) | |
grad_zygote(theta) = Zygote.gradient(f, theta) | |
grad_tracker(theta) = Tracker.gradient(f, Tracker.param(theta))[1] | |
theta = randn(data.N) | |
@assert grad_forward(theta) ≈ grad_reverse(theta) | |
print("f: ") | |
# @time for _ in 1:n f(theta) end | |
@btime $f($theta) seconds=1 | |
print("Tracker: ") | |
@btime $grad_tracker($theta) seconds=1 | |
print("ReverseDiff: ") | |
@btime $grad_reverse($theta) seconds=1 | |
print("Zygote: ") | |
@btime $grad_zygote($theta) seconds=1 | |
print("ForwardDiff: ") | |
@btime $grad_forward($theta) seconds=1 | |
print("Compelx-step: ") | |
@btime $grad_cs($theta) seconds=1 | |
end | |
# JIT-compile | |
println("JIT compile:") | |
run_experiment(1) | |
println() | |
# Speed test: | |
for n in (10, 100, 1000) | |
println("Speed test for (n=$n):") | |
run_experiment(n) | |
println() | |
end | |
# NOTE: Make sure the results are correct. | |
# est_theta = let | |
# t = zeros(data.N) | |
# eta = 1e-2 | |
# @showprogress for _ in 1:10000 | |
# # g = grad_reverse(t) | |
# # g = grad_forward(t) | |
# g = grad_zygote(t)[1] | |
# # g = grad_cs(t) | |
# t .+= eta * g | |
# end | |
# t | |
# end | |
# [est_theta data.y] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Speed test for (n=10): | |
f: 106.360 ns (1 allocation: 144 bytes) | |
Tracker: 13.962 μs (140 allocations: 7.18 KiB) | |
ReverseDiff: 1.026 μs (21 allocations: 1.78 KiB) | |
Zygote: 2.092 μs (55 allocations: 2.42 KiB) | |
ForwardDiff: 743.102 ns (7 allocations: 3.80 KiB) | |
Compelx-step: 1.101 μs (13 allocations: 7.72 KiB) | |
Speed test for (n=100): | |
f: 725.620 ns (1 allocation: 896 bytes) | |
Tracker: 18.124 μs (140 allocations: 15.26 KiB) | |
ReverseDiff: 4.828 μs (21 allocations: 8.39 KiB) | |
Zygote: 5.455 μs (55 allocations: 8.22 KiB) | |
ForwardDiff: 29.474 μs (15 allocations: 106.38 KiB) | |
Compelx-step: 50.554 μs (17 allocations: 628.86 KiB) | |
Speed test for (n=1000): | |
f: 7.058 μs (1 allocation: 7.94 KiB) | |
Tracker: 58.194 μs (140 allocations: 92.95 KiB) | |
ReverseDiff: 42.516 μs (21 allocations: 71.95 KiB) | |
Zygote: 37.722 μs (56 allocations: 64.52 KiB) | |
ForwardDiff: 2.513 ms (175 allocations: 8.44 MiB) | |
Compelx-step: 9.184 ms (17 allocations: 61.07 MiB) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Consider the logpdf of the standard Normal distribution:
We compute timings for taking the gradient of
w.r.t. the vector$\theta$ , keeping constant the vector $y$ and scalar $s$ .
The AD backends used are:
Note that the first two backends implement forward-mode AD, whereas the last three backends implement reverse-mode AD (at least by default).
Zygote is a new implementation of Tracker.
Forward-mode AD is generally preferred when the number of parameters is much smaller than the number of outputs; reverse-mode AD is preferred when the number of parameters is much larger than the number of outputs. (Hence, reverse-mode is preferred for most ML applications, where a scalar cost function is optimized w.r.t. a large number of parameters.) Forward-mode AD is easier to implement and requires a forward pass through the cost function ($f$ ) for each parameter; for each output, reverse-mode AD requires a forward pass through the cost function (to compute intermediate gradients) and a backward pass through a graph or trace of the cost function (to apply the chain rule properly).
Below, we vary the size$n$ of $\theta$ and compute the gradient of $f$ w.r.t. $\theta$ . Note that when $n=10$ , ForwardDiff is actually the fastest, due to the overhead of computing a graph and doing a backward pass in reverse-AD. Complex-step differentiation, due to redundant/superfluous computations, leads to more computations here. ReverseDiff is slightly faster than complex-step, followed by Zygote and Tracker.
The benefits of reverse AD shows when$n=100$ . In terms of computational efficiency,
The literature indicates that, in general, a taking the gradient of$f$ w.r.t. to parameters using a good implementation of reverse AD takes at most 6x more FLOPs than when computing $f$ . In practice, some packages can take 3-4x.