This example is a 4-dimensional geometric brownian motion. The code
for the torchsde version is pulled directly from the
torchsde README
so that it would be a fair comparison against the author's own code.
The only change to that example is the addition of a dt
choice so that
the simulation method and time step matches between the two different programs.
The SDE is solved 100 times. The summary of the results is as follows:
- torchsde: 1.87 seconds
- DifferentialEquations.jl: 0.00115 seconds
This demonstrates a 1,600x performance difference in favor of Julia on the Python library's README example. Further testing against torchsde was not able to be completed because of these performance issues.
We note that the performance difference in the context of neural SDEs is likely smaller due to the ability to time spent in matrix multiplication kernels. However, given that full SDE training examples like demonstrated here generally take about a minute, we still highly expect a major performance difference but currently do not have the compute time to run a full demonstration.
Thanks! I updated the timing: it was closer to 2 seconds on my machine. Nicely done.
I would be cautious with saying only your application are real: there are probably orders of magnitude more people doing mathematical finance, model-informed drug development, and systems biology with SDEs than training neural SDEs for image processing (at least right now), and those disciplines naturally arrive at more heterogeneous models that cannot always be expressed in matmuls. So that statement that the only real SDEs are dominated by matmuls is probably more inflammatory than you mean it to be, so I'd tone it down a bit.
That of course doesn't contradict the fact that as
f
org
grows in cost, like in a neural SDE, the cost of the other pieces are covered by the matmuls and so you can just count drift and diffusion calls to estimate the total cost: that is asymptotically true. I don't think it's an issue to point out that this can be an issue for someone just trying to pick up torchsde as a general-purpose SDE solver, even though it is outside of the realm you generally focus on. I did this little bit of benchmarking because some particle physicists were curious how it would benchmark on 3 complex SDE systems (so 6 real SDEs), meaning that this benchmark is quite close to their use case and is a notable result. That said, the fact that torchsde does not store the noise for the reverse pass of an adjoint does mean that if the noise is sufficiently large enough to not fit in memory (and bigger issue for GPU memory), then there are cases where torchsde will be faster. It's not incompatible statements to say different algorithms are optimized for different performance regimes.That said, I will note that the current results are being nice to torchsde by not fully specializing on this domain. It is known that the noise format used in StochasticDiffEq is ~5x from optimal on fixed time step problems, so I could accelerate these benchmarks more by dispatching over to another implementation which is only for fixed time step (https://diffeq.sciml.ai/stable/solvers/sde_solve/#BridgeDiffEq.jl-1), but that felt unnecessary to benchmark because the case I wanted to get to was adaptivity anyways (though for small enough equations I think the difference is clear enough). That just goes to showcase though that it's not about single implementations but optimizing a whole range of implementations to different performance regimes under different assumptions.
What to do about torchsde on these CPU-only jobs? I'm not sure: the torch JIT doesn't seem to do very well on this regime at all. I'd take the IR and write my own SLP vectorizer on it.