Skip to content

Instantly share code, notes, and snippets.

@sdwfrost
Last active July 2, 2022 14:15
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save sdwfrost/7c660322c6c33961297a826df4cbc30d to your computer and use it in GitHub Desktop.
Save sdwfrost/7c660322c6c33961297a826df4cbc30d to your computer and use it in GitHub Desktop.
Comparing simple simulations in Julia, Nim, C++ and R

This gist compares the performance of Julia, Nim, C++ and R - the latter using either POMP, or LibBi in a simple simulation of an SIR epidemiological model. In addition to keeping track of susceptibles, infecteds and recovereds, I also store the cumulative number of infections. Time moves in discrete steps, and the algorithm avoids language-specific syntax features to make the comparison as fair as possible, including using the same algorithm for generating binomial random numbers and the same random number generator; the exception are the R versions, POMP uses the standard R Mersenne Twister for the random number generator; I'm not sure what LibBi uses. The algorithm for generating random binomial numbers is only really suitable for small np.

Benchmarks were run on a Mac Pro (Late 2013), with 3 Ghz 8-core Intel Xeon E3, 64GB 1866 Mhz RAM, running OSX v 10.11.3 (El Capitan), using Julia v0.6.1, Nim v0.17.3, clang v5.0.0, emscripten v1.37.34, node v8.9.1, and R v3.3.2.

Nim version

Native

Compile using:

nim c -d:release --passC="-flto -ffast-math" sir

Run:

./sir

Output:

0.038819
20.75

The first number is time taken in s, the second the result of averaging the cumulative number of infections (Y in the code).

Native, optimised version

Based on the comments below, I wrote an optimised version, sir_opt.nim.

nim c -d:release --passC="-flto -ffast-math" sir_opt
./sir_opt

Output:

0.031404
20.964

Webassembly version

Nim can also compile to Webassembly via Emscripten; using nim.cfg linked below:

nim c -d:release -d:wasm -o:sir_opt.js  sir_opt
node ./sir_opt.js

Output:

0.052
20.964

Julia version

Run:

julia sir.jl

Output:

0.058961 seconds (2.09 k allocations: 232.998 KiB)
21.019

There is also a modified version that uses tuples (and hence is also a bit neater), adds in a couple of macros, and uses an algorithm to work out random binomials that is faster for small np. Thanks to @DNF2 and @ChrisRackauckas (see below) for tips here.

julia sir_opt.jl
0.045802 seconds (87 allocations: 14.123 KiB)
21.028

C++ version

Native build

Compilation:

g++ -std=c++11 -O3 -o sir_cpp sir.cpp

Running:

./sir_cpp

Output:

0.050181
20.708

Webassembly

To compile the C++ code into Webassembly, I use Emscripten.

Compilation:

em++ -s WASM=1 -s SINGLE_FILE=1 -std=c++11 -O3 -o sir_cpp.js sir.cpp

Running:

node sir_cpp.js

Output:

0.07
20.708

This C++ code uses code from random_real.c, copyright Taylor G. Campbell (2014) and a portable implementation of clzll.

R, using POMP

Running:

Rscript sir_pomp.R

Output:

Loading required package: methods
   user  system elapsed 
  0.741   0.108   0.849

R, using LibBi

Running:

Rscript sir_bi.R

Output:

Model:  sir 
Run time:  0.073982  seconds
Number of samples:  1000 
[1] 20.251

Note that it takes much longer in practice than above for rbi to run the model, due to the need for compilation, which takes a lot longer than Julia. Thanks to @sbfnk for some excellent examples.

Any suggestions for improvements welcome!

@if asmjs or wasm:
d:emscripten
@end
@if emscripten or asmjs or wasm:
o:"index.html"
@if not wasm:
d:asmjs
@end
cc = clang
clang.exe = "emcc"
clang.linkerexe = "emcc"
clang.options.linker = ""
cpu = "i386"
@if wasm:
passC = "-s WASM=1 -s 'BINARYEN_METHOD=\"native-wasm\"' -Iemscripten -s SINGLE_FILE=1 "
passL = "-s WASM=1 -s 'BINARYEN_METHOD=\"native-wasm\"' -Lemscripten -s SINGLE_FILE=1 -s ALLOW_MEMORY_GROWTH=1 "
@elif asmjs:
passC = "-s ASM_JS=1 -Iemscripten" #-s USE_PTHREADS=1
passL = "-s ASM_JS=1 -Lemscripten" #-s ALLOW_MEMORY_GROWTH=1"
@end
@if release:
passC %= "-O3 -flto -ffast-math"
passL %= "-O3 -flto -ffast-math"
@end
#Fix _setjmp/longjmp problem. https://irclogs.nim-lang.org/24-09-2017.html#12:19:50
d:nimStdSetjmp # https://irclogs.nim-lang.org/24-09-2017.html#20:13:18
@end
#include <array>
#include <vector>
#include <cmath>
#include <iostream>
#include <climits>
#include <limits>
#include <iterator>
#include <numeric>
#include <chrono>
using namespace std;
// The following is used to turn the uint64 random numbers into a double
// between 0.0 and 1.0
int clz64(uint64_t v) {
const unsigned char y[] = {
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v |= v >> 32;
return ((sizeof(uint64_t) * CHAR_BIT) - 1) -
y[(uint64_t)(v * 0x03F79D71B4CB0A89ULL) >> 58];
}
double random_real(uint64_t x)
{
int exponent = -64;
uint64_t significand;
unsigned shift;
while ((significand = x) == 0) {
exponent -= 64;
if (exponent < -1074) return 0;
}
shift = clz64(significand);
if (shift != 0) {
exponent -= shift;
significand <<= shift;
significand |= (x >> (64 - shift));
}
significand |= 1;
return ldexp((double)significand, exponent);
}
// Random number generator
uint64_t xorshift128plus(array<uint64_t,2> &rng) {
uint64_t x = rng[0];
uint64_t const y = rng[1];
rng[0] = y;
x ^= x << 23; // a
rng[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
uint64_t z = rng[1] + y;
return(z);
}
double randunif(array<uint64_t,2> &rng){
uint64_t z = xorshift128plus(rng);
double dz = random_real(z);
return dz;
}
int64_t randbn(int64_t n, double p, array<uint64_t,2> &rng)
{
const double log_q = log(1.0 - p);
int64_t x = 0;
double sum = 0.0;
while(1){
sum += log(randunif(rng))/static_cast<double>(n - x);
if(sum < log_q) break;
x += 1;
}
return(x);
}
void sir(int64_t t, array<int64_t,4> &u, array<int64_t,4> &du, const array<double,5> parms, array<uint64_t,2> &rng)
{
int64_t S=u[0];
int64_t I=u[1];
int64_t R=u[2];
int64_t Y=u[3];
double beta=parms[0];
double gamma=parms[1];
double iota=parms[2];
double N=parms[3];
double dt=parms[4];
double lambd=beta*(static_cast<double>(I)+iota)/N;
double ifrac=1.0-exp(-lambd*dt);
double rfrac=1.0-exp(-gamma*dt);
int64_t infection=randbn(S,ifrac,rng);
int64_t recovery=randbn(I,rfrac,rng);
du[0]=S-infection;
du[1]=I+infection-recovery;
du[2]=R+recovery;
du[3]=Y+infection;
return;
}
double simulate()
{
array<double,5> parms = {2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1};
int64_t tf=540;
int64_t nsims = 1000;
array<int64_t,1000> yvec;
array<uint64_t,2> rng = {123,456};
for(int64_t i=0;i<nsims;i++){
array<int64_t,4> u = {60,1,342,0};
array<int64_t,4> du = {0,0,0,0};
for(int64_t j=1;j<=tf;j++){
sir(j,u,du,parms,rng);
u=du;
}
yvec[i] = u[3];
}
int64_t numerator = accumulate(begin(yvec),end(yvec),0);
double result = static_cast<double>(numerator)/static_cast<double>(nsims);
return(result);
}
int main(int argc, char* argv[]) {
std::clock_t startcputime = std::clock();
double m=simulate();
double cpu_duration = (std::clock() - startcputime) / (double)CLOCKS_PER_SEC;
cout << cpu_duration << "\n";
cout << m << "\n";
}
using RandomNumbers
function randbn(n, p, rng)
log_q = log(1.0 - p)
x = 0
sum = 0.0
while true
sum += log(rand(rng)) / (n - x)
sum < log_q && break
x += 1
end
return x
end
function sir(t,u,du,parms,rng)
S=u[1]
I=u[2]
R=u[3]
Y=u[4]
β=parms[1]
γ=parms[2]
ι=parms[3]
N=parms[4]
δt=parms[5]
λ=β*(convert(Float64,I)+ι)/N
ifrac=1.0-exp(-λ*δt)
rfrac=1.0-exp(-γ*δt)
infection=randbn(S,ifrac,rng)
recovery=randbn(I,rfrac,rng)
du[1]=S-infection
du[2]=I+infection-recovery
du[3]=R+recovery
du[4]=Y+infection
end
function simulate()
parms = [2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1]
seed=123
r = Xorshifts.Xorshift128Plus(seed)
tf = 540
nsims = 1000
yvec = Vector{Int64}(nsims)
for i in 1:nsims
u = [60,1,342,0]
du = [0,0,0,0]
for j in 1:tf
sir(j,u,du,parms,r)
u=du
end
yvec[i]=du[4]
end
convert(Float64,sum(yvec))/convert(Float64,nsims)
end
simulate()
@time m=simulate()
println(m)
import random, random.xorshift, random.common
import math
import times
proc randbn(n: int64; p: float64; rng: var RNG): int64 =
let logq:float64 = math.ln(1.0-p)
var x:int64 = 0
var sum:float64 = 0.0
while true:
sum += math.ln(rng.random())/float64(n - x)
if sum < log_q:
return x
x=x+1
proc sir(t:int64;u:array[4,int64];du: var array[4,int64]; parms:array[5,float64]; r: var RNG): int64 {.discardable.}=
let S=u[0]
let I=u[1]
let R=u[2]
let Y=u[3]
let beta=parms[0]
let gamma=parms[1]
let iota=parms[2]
let N=parms[3]
let dt=parms[4]
let lambd=beta*(float64(I)+iota)/N
let ifrac=1.0-exp(-lambd*dt)
let rfrac=1.0-exp(-gamma*dt)
let infection=int(randbn(S,ifrac,r))
let recovery=int(randbn(I,rfrac,r))
du[0]=S-infection
du[1]=I+infection-recovery
du[2]=R+recovery
du[3]=Y+infection
result = 1
proc simulate():float64 =
let parms:array[5,float64] = [2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1]
var seed:uint64 = 123
var r = initXorshift128Plus(seed)
let tf:int64=540
let nsims:int64 = 1000
var yvec:array[1000,int64]
for i in 1..nsims:
var u:array[4,int64] = [60.int64,1,342,0]
var du:array[4,int64] = [0.int64,0,0,0]
for j in 1..tf:
discard sir(j,u,du,parms,r)
u=du
yvec[i-1] = u[3]
result = float64(sum(yvec))/float64(nsims)
let t0=cpuTime()
let m = simulate()
let t1=cpuTime()
echo t1-t0
echo m
library(rbi)
sir.bi <- "
model sir {
const N = 403
const timestep = 0.1
state S, I, R, Y
param bet, gamm, iota
noise infection, recovery
sub parameter {
bet <- 2.62110617498984
gamm <- 0.5384615384615384
iota <- 0.5
}
sub initial {
S <- 60
I <- 1
R <- 342
Y <- 0
}
sub transition (delta=timestep) {
inline lambd = bet*(I+iota)/N
inline ifrac = 1.0-exp(-lambd*timestep)
inline rfrac = 1.0-exp(-gamm*timestep)
infection ~ binomial(S,ifrac)
recovery ~ binomial(I,rfrac)
S <- S - infection
I <- I + infection - recovery
R <- R + recovery
Y <- Y + infection
}
}
"
sir.bi.model <- bi_model(lines=sir.bi)
sir.bi.obj <- libbi(model=sir.bi.model)
sir.bi.sample <- sample(sir.bi.obj,
target="joint",
end_time=54.0,
nsamples=1000,
seed=1)
sir.bi.sample
sir.bi.results <- bi_read(sir.bi.sample$output_file_name)
mean(sir.bi.results$Y$value)
using RandomNumbers
@inline @fastmath function randbn(n,p,rng)
q = 1.0 - p
s = p/q
a = (n+1)*s
r = exp(n*log(q))
x = 0
u = rand(rng)
while true
if (u < r)
return x
end
u -= r
x += 1
r *= (a/x)-s
end
end
@inline @fastmath function sir(t, u, parms, rng)
(S, I, R, Y) = u
(β, γ, ι, N, δt) = parms
λ = β * (I + ι) / N
ifrac = 1.0 - exp(-λ * δt)
rfrac = 1.0 - exp(-γ * δt)
infection = randbn(S, ifrac, rng)
recovery = randbn(I, rfrac, rng)
return (S - infection, I + infection - recovery, R + recovery, Y + infection)
end
function simulate()
parms = (2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1)
seed = 123
r = Xorshifts.Xorshift128Plus(seed)
tf = 540
nsims = 1000
yvec = Vector{Int64}(nsims)
u0 = (60, 1, 342, 0)
for i in 1:nsims
u = u0
for j in 1:tf
u = sir(j, u, parms, r)
end
yvec[i] = u[4]
end
return sum(yvec) / nsims
end
simulate()
@time m=simulate()
println(m)
import random, random.xorshift, random.common
import math
import times
type
Params = tuple[beta,gamma,iota,N,dt: float64]
State = tuple[S,I,R,Y: int]
proc randbn(n: int; p: float64; rng: var RNG): int {.inline.} =
let q:float64 = 1.0-p
let s:float64 = p/q
let a:float64 = (float64(n)+1.0)*s
var r:float64 = math.exp(float64(n)*math.ln(q))
var x:int = 0
var u:float64 = rng.random()
while true:
if u < r:
return x
u = u-r
x = x+1
r = r*((a/float64(x))-s)
proc sir(t:int; u: State;du: var State; parms: Params; r: var RNG): int {.inline.}=
let (S,I,R,Y) = u
let (beta,gamma,iota,N,dt)=parms
let lambd=beta*(float64(I)+iota)/N
let ifrac=1.0-exp(-lambd*dt)
let rfrac=1.0-exp(-gamma*dt)
let infection=int(randbn(S,ifrac,r))
let recovery=int(randbn(I,rfrac,r))
du.S=S-infection
du.I=I+infection-recovery
du.R=R+recovery
du.Y=Y+infection
result = 1
proc simulate():float64 =
let parms:Params = (2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1)
var seed:array[2,uint64] = [123.uint64,456]
var r = initXorshift128Plus(seed)
let tf:int=540
let nsims:int = 1000
var yvec:array[1000,int]
for i in 1..nsims:
var u:State = (60,1,342,0)
var du:State = (0,0,0,0)
for j in 1..tf:
discard sir(j,u,du,parms,r)
u=du
yvec[i-1] = u[3]
result = float64(sum(yvec))/float64(nsims)
when isMainModule:
let t0 = cpuTime()
let m = simulate()
let t1 = cpuTime()
echo t1 - t0
echo m
library(pomp)
sir.step <- "
double lambd = bet * (I+iota) / N;
double ifrac = 1.0 - exp(-lambd *dt);
double rfrac = 1.0 - exp(-gamm*dt);
double infection = rbinom(S, ifrac);
double recovery = rbinom(I, rfrac);
S += -infection;
I += infection - recovery;
R += recovery;
Y += infection;
"
rmeas <- "
Z = Y;
"
pomp(
data=data.frame(time=seq(0,54,by=0.1),Z=rep(0,541)),
times="time",
t0=0,
rmeasure=Csnippet(rmeas),
rprocess=euler.sim(
step.fun=Csnippet(sir.step),
delta.t=0.1
),
statenames=c("S","I","R","Y"),
paramnames=c(
"bet","gamm","iota", "N"
),
initializer=function(params, t0, ...) {
x0 <- c(S=60,I=1,R=342,Y=0)
x0
},
params=c(bet=2.62110617498984, gamm=0.5384615384615384, iota=0.5, N=403.0)
) -> sir
set.seed(1)
system.time(replicate(1000,simulate(sir)))
@ChrisRackauckas
Copy link

There is a random sampling from Binomials through Distributions.jl which should probably be preferred here.

@mratsim
Copy link

mratsim commented Dec 2, 2017

Here is the Nim code that should be equivalent to the julia one (no printing of the array result and a first warmup simulate to make sure CPU scheduler in max perf mode).

Note that int/int64 are default.

import random, random.xorshift, random.common
import math
import times

proc randbn(n: int; p: float64; rng: var RNG): int =
  let logq: float64 = ln(1.0-p)
  var x = 0
  var sum: float64 = 0.0
  while true:
    sum = ln(rng.random())/float64(n - x)
    if sum < log_q:
      return x
    x=x+1

proc sir(t:int; u:array[4,int]; du: var array[4,int]; parms: array[5,float64]; r: var RNG): int =
    let
      S = u[0]
      I = u[1]
      R = u[2]
      Y = u[3]
      beta = parms[0]
      gamma = parms[1]
      iota = parms[2]
      N = parms[3]
      dt = parms[4]
      lambd = beta*(float64(I)+iota)/N
      ifrac = 1.0 - exp(-lambd*dt)
      rfrac = 1.0 - exp(-gamma*dt)
      infection = int(randbn(S,ifrac,r))
      recovery = int(randbn(I,rfrac,r))
    du[0] = S - infection
    du[1] = I + infection - recovery
    du[2] = R + recovery
    du[3] = Y + infection
    result = 1

proc simulate():float64 =
    let parms: array[5,float64] = [2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1]
    var seed: uint = 123
    var r = initXorshift128Plus(seed)
    let tf = 540
    let nsims = 1000
    var yvec: array[1000,int]

    for i in 1..nsims:
      var u = [60,1,342,0]
      var du = [0,0,0,0]
      for j in 1 .. tf:
        discard sir(j,u,du,parms,r)
        u=du
      yvec[i-1] = u[3]
    # echo yvec ## This is very slow
    result = float64(sum(yvec))/float64(nsims)

# Warmup: necessary if cpu is on "ondemand" governor
discard simulate()

let t0 = cpuTime()
let m = simulate()
let t1 = cpuTime()
echo t1-t0
echo m

Speed for both on my i5-5257U (3.8 GHz mobile broadwell) is
0.056~0.063

@mratsim
Copy link

mratsim commented Dec 2, 2017

I can cut Nim code to 0.04 by using OpenMP:

proc simulate():float64 =
    let parms: array[5,float64] = [2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1]
    var seed: uint = 123
    var r = initXorshift128Plus(seed)
    let tf = 540
    let nsims = 1000
    var yvec: array[1000,int]

    for i in 1 || nsims: # OpenMP parallel loop --> compile with --passC:-fopenmp --passL:-fopenmp
      var u = [60,1,342,0]
      var du = [0,0,0,0]
      for j in 1 .. tf:
        discard sir(j,u,du,parms,r)
        u=du
      yvec[i-1] = u[3]
    # echo yvec ## This is very slow
    result = float64(sum(yvec))/float64(nsims)

@sdwfrost
Copy link
Author

sdwfrost commented Dec 3, 2017

Dear All,

Thanks for all the comments, that's really helpful. I'll go through them in turn.

  1. @hoehleatsu @ChrisRauckackas There is a higher level function for sampling from binomials in Julia - see this gist for the 'standard' way of doing it. However, you can only use Distributions.jl with the standard random number generator (a double precision Mersenne Twister), which also happens to be not thread-safe. So, instead I use RandomNumbers.jl, which does have thread-safe generators, including Xorshift ones, that are much faster. This is at the cost of not being able to access the standard binomial draw in Distributions.jl - see this thread. The implementation for the random binomial is in LightGraphs.jl and is referenced thus:
  1. @DNF2 @giordano The Julia code can be cleaned up a lot - see this gist again for tuple unpacking. I wanted code to be as comparable as possible between Nim and Julia, so I brought the Julia code closer to Nim, which doesn't do array unpacking out of the box (see here). I also wanted to keep arrays rather than tuples for the state variables, as I wanted to convert the code to operate on arrays for the state variables to do things in parallel. Thanks for the tip about StaticArrays!

  2. @mratsim Thanks for the code cleanup - however, on my laptop, I still get c. 0.2s for the Nim code vs. 0.06s for the Julia code.

  3. @mratsim Parallelisation! This is easy in both Julia (using Threads.@threads in front of the loop) and Nim. I haven't determined the best way to parallelise yet.

  4. @hoehleatsu R's nmath module does use a different algorithm (actually a set) for drawing random numbers. I'll try some porting and look around for some faster alternatives.

  5. @DNF2. I want to keep changing the state variables, du, in-place to avoid allocations.

@sdwfrost
Copy link
Author

sdwfrost commented Dec 3, 2017

FYI, here are comparisons of drawing 10^6 random numbers in Nim and Julia; Julia is c. 4x faster in this case:

bntest.nim:

import random, random.xorshift, random.common
import math
import times

proc randbn(n: int; p: float64; rng: var RNG): int =
  let logq: float64 = ln(1.0-p)
  var x = 0
  var sum: float64 = 0.0
  while true:
    sum += ln(rng.random())/float64(n - x)
    if sum < log_q:
      return x
    x=x+1

proc simulate():int {.discardable.} =
  var seed: uint = 123
  var r = initXorshift128Plus(seed)
  var s = 0
  for i in 0..999999:
    s = randbn(100,0.05,r)
  result=s

let t0 = cpuTime()
let m = simulate()
let t1 = cpuTime()
echo t1-t0

bntest.jl:

using RandomNumbers

function randbn(n, p, rng)
    log_q = log(1.0 - p)
    x = 0
    sum = 0.0
    while (sum += log(rand(rng)) / (n - x)) >= log_q
        x += 1
    end
    return x
end

function simulate()
    seed=123
    r = Xorshifts.Xorshift128Plus(seed)
    s = 0
    for i in 1:1000000
        s = randbn(100,0.05,r)
    end
    return s
end

simulate()
@time simulate()

@sdwfrost
Copy link
Author

sdwfrost commented Dec 3, 2017

As it turns out, the benchmarks were much closer on my Mac Pro, so I've edited the readme accordingly. Still trying to work out why there was a difference on my laptop...

@DNF2
Copy link

DNF2 commented Dec 3, 2017

@DNF2. I want to keep changing the state variables, du, in-place to avoid allocations.

This is a misunderstanding. My code dramatically reduces allocations, exactly by making the state variables into tuples. Check it out for yourself! :) It also does not preclude parallelization, e.g. by @threads.

@sdwfrost
Copy link
Author

sdwfrost commented Dec 4, 2017

That's interesting; I thought I would avoid allocations, even using dynamic arrays, by changing things in place rather than using immutable tuples. I also see that SVector (in StaticArrays) uses a tuple under the hood.

@ChrisRackauckas
Copy link

ChrisRackauckas commented Dec 4, 2017

I created a version which uses stack-allocated variables instead of dynamic arrays.

using RandomNumbers
using StaticArrays

@inline @fastmath function randbn(n, p, rng)
    log_q = log(1.0 - p)
    x = 0
    while true
        log_q -= log(rand(rng)) / (n - x)
        log_q > 0.0 && break
        x += 1
    end
    return x
end

@inline @fastmath function sir(t,u,parms,rng)
    S=u[1]
    I=u[2]
    R=u[3]
    Y=u[4]
    β=parms[1]
    γ=parms[2]
    ι=parms[3]
    N=parms[4]
    δt=parms[5]
    λ=β*(I+ι)/N
    ifrac=1.0-exp(-λ*δt)
    rfrac=1.0-exp(-γ*δt)
    infection=randbn(S,ifrac,rng)
    recovery=randbn(I,rfrac,rng)
    @SVector [S-infection,I+infection-recovery,R+recovery,Y+infection]
end

function simulate()
    parms = @SVector [2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1]
    seed=123
    r = Xorshifts.Xorshift128Plus(seed)
    tf = 540
    nsims = 1000
    yvec = Vector{Int64}(nsims)
    for i in 1:nsims
        u = @SVector [60,1,342,0]
        for j in 1:tf
            u = sir(j,u,parms,r)
        end
        yvec[i]=u[4]
    end
    sum(yvec)/nsims
end

simulate()
@time m=simulate()
println(m)

I get down to ~0.045 seconds now with your original ~0.060 seconds. Getting to this level also required inlining because things were going fast enough now. Chopping off 25% should make it easily beat C here.

@ChrisRackauckas
Copy link

@hoehleatsu @chrisrauckackas There is a higher level function for sampling from binomials in Julia - see this gist for the 'standard' way of doing it. However, you can only use Distributions.jl with the standard random number generator (a double precision Mersenne Twister), which also happens to be not thread-safe. So, instead I use RandomNumbers.jl, which does have thread-safe generators, including Xorshift ones, that are much faster.

Makes sense. Yes, RandomNumbers.jl is a great library and Distributions is a dinosaur that needs to be updated to support the more generic code people use these days.

@ChrisRackauckas
Copy link

For fun I multithreaded it:

function simulate()
    parms = @SVector [2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1]
    seed=123
    r = Xorshifts.Xorshift128Plus(seed)
    tf = 540
    nsims = 1000
    yvec = Vector{Int64}(nsims)
    Threads.@threads for i in 1:nsims
        u = @SVector [60,1,342,0]
        for j in 1:tf
            u = sir(j,u,parms,r)
        end
        yvec[i]=u[4]
    end
    sum(yvec)/nsims
end

simulate()
@time m=simulate()
println(m)

and it should be the most efficient way to do it now that it doesn't allocate. I'm now at 0.025651 seconds (8 allocations: 8.234 KiB).

@ChrisRackauckas
Copy link

I should note though that these tricks that I did at the end are problem-dependent. Static vectors in this kind of usage will stack-allocate and thus avoid heap allocations when the immutables are small enough (<16?). Thus for many small problems they are the right choice. This gets a little tricky to support in generic algorithms because one needs to be inplace with mutables while the other needs to be out of place with immutables.

For DiffEq we detect at problem construction time (https://github.com/JuliaDiffEq/DiffEqBase.jl/blob/master/src/problems/ode_problems.jl#L27) whether your function is f(t,u) or f(t,u,du) for pre-allocation. Then we have a fast path for each algorithm on equations on immutables/SVectors/etc. (https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/integrators/low_order_rk_integrators.jl#L12-L30) vs mutables (https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/integrators/low_order_rk_integrators.jl#L42-L60). It's a bit of work but it's nifty. The result is hyper-specialization of the small commonly solved problems, and scaling to large problems.

One other thing to note is that a Vector{SVector} is bit-equivalent to a matrix where the SVectors are columns (since an SVector is immutable and thus inlines the numbers), so it can be converted to a matrix via reinterpret without allocating. These kinds of tricks are able to be used by the compiler as well, and this pull request (JuliaLang/julia#24362) will enable a lot more pushing through and optimizing static constants. So the JIT paired with this constant propagation definitely make Julia even faster on this sort of problem (and able to JIT optimize on some things which would be hard to do with static compilation).

Small examples like this are easy to get right, it's scaling to library code that can get difficult. But Julia's JIT strategy plus IPO should handle a lot of that. I haven't found anything else that can build packages so well. Cheers!

@DNF2
Copy link

DNF2 commented Dec 4, 2017

You can save allocations by doing:

function simulate()
    parms = @SVector [2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1]
    seed=123
    r = Xorshifts.Xorshift128Plus(seed)
    tf = 540
    nsims = 1000
    yvec = zeros(Int64, Threads.nthreads())  # <- shorter vector
    Threads.@threads for i in 1:nsims
        u = @SVector [60,1,342,0]
        for j in 1:tf
            u = sir(j,u,parms,r)
        end
        yvec[Threads.threadid()] += u[4]
    end
    return sum(yvec)/nsims
end

But you might as well use tuples instead of StaticArrays and save a dependency (and have cleaner code), with no performance loss.

(Note, by the way, that even if you fix the random seed, you get different results each time with threading, since the execution order is non-deterministic.)

@sdwfrost
Copy link
Author

sdwfrost commented Dec 4, 2017

Thanks @DNF2 @ChrisRackauckas

I was trying to use the DiffEq interface, for just the sort of specialisation mentioned, but things aren't working - see this gist.

@sdwfrost
Copy link
Author

sdwfrost commented Dec 4, 2017

Harmonising across comments, here's what I have for Julia. The @inline and @fastmath macros help a lot too.

using RandomNumbers

@inline @fastmath function randbn(n,p,rng)
    q = 1.0 - p
    s = p/q
    a = (n+1)*s
    r = exp(n*log(q))
    x = 0
    u = rand(rng)
    while true
        if (u < r)
            return x
        end
        u -= r
        x += 1
        r *= (a/x)-s
    end
end

@inline @fastmath function sir(t, u, parms, rng)
    (S, I, R, Y) = u
    (β, γ, ι, N, δt) = parms
    λ = β * (I + ι) / N
    ifrac = 1.0 - exp(-λ * δt)
    rfrac = 1.0 - exp(-γ * δt)
    infection = randbn(S, ifrac, rng)
    recovery = randbn(I, rfrac, rng)
    return (S - infection, I + infection - recovery, R + recovery, Y + infection)
end

function simulate()
    parms = (2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1)
    seed = 123
    r = Xorshifts.Xorshift128Plus(seed)
    tf = 540
    nsims = 1000
    yvec = Vector{Int64}(nsims)
    u0 = (60, 1, 342, 0)
    for i in 1:nsims
        u = u0
        for j in 1:tf
            u = sir(j, u, parms, r)
        end
        yvec[i] = u[4]
    end
    return sum(yvec) / nsims
end

simulate()
@time m=simulate()
println(m)

@sdwfrost
Copy link
Author

sdwfrost commented Dec 4, 2017

Here is my threaded version, with only a minor change to the above; however, it tends to be slower with multiple threads...

using RandomNumbers

@inline @fastmath function randbn(n,p,rng)
    q = 1.0 - p
    s = p/q
    a = (n+1)*s
    r = exp(n*log(q))
    x = 0
    u = rand(rng)
    while true
        if (u < r)
            return x
        end
        u -= r
        x += 1
        r *= (a/x)-s
    end
end

@inline @fastmath function sir(t, u, parms, rng)
    (S, I, R, Y) = u
    (β, γ, ι, N, δt) = parms
    λ = β * (I + ι) / N
    ifrac = 1.0 - exp(-λ * δt)
    rfrac = 1.0 - exp(-γ * δt)
    infection = randbn(S, ifrac, rng)
    recovery = randbn(I, rfrac, rng)
    return (S - infection, I + infection - recovery, R + recovery, Y + infection)
end

function simulate()
    parms = (2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1)
    seed = 123
    r = Xorshifts.Xorshift128Plus(seed)
    tf = 540
    nsims = 10000
    yvec = Vector{Int64}(nsims)
    u0 = (60, 1, 342, 0)
    Threads.@threads for i in 1:nsims
        u = u0
        for j in 1:tf
            u = sir(j, u, parms, r)
        end
        yvec[i] = u[4]
    end
    return sum(yvec) / nsims
end

simulate()
@time m=simulate()
println(m)

@DNF2
Copy link

DNF2 commented Dec 4, 2017

Have you made sure to set the environment variable for number of threads? (https://docs.julialang.org/en/stable/manual/parallel-computing/#Multi-Threading-(Experimental)-1) On my computer I get >2x speed-up with 4 threads.

And, as mentioned, you don't need to make the yvec that long, it's enough to have length equal to the number of threads (and for single-threaded it can just be a scalar.) Otherwise, it can scale quite badly, if you, say, want to run it 100 million times.

@ChrisRackauckas
Copy link

    (S, I, R, Y) = u
    (β, γ, ι, N, δt) = parms

is nicer, but I measured a penalty from it.

@sdwfrost
Copy link
Author

sdwfrost commented Dec 4, 2017

Yes, sadly tuple unpacking comes with a penalty..

@sdwfrost
Copy link
Author

sdwfrost commented Dec 4, 2017

@DNF2 Yes, I'm setting the threads OK, and I've tried on Linux and OSX.

@sdwfrost
Copy link
Author

Here's my tidied Nim code, that on my Mac Pro, runs faster than the Julia version above (0.035 vs 0.045s for 1000 simulations). Compile with

nim c -d:release --passC=-flto sir
import random, random.xorshift, random.common
import math
import times

type
  Params = tuple[beta,gamma,iota,N,dt: float64]
  State = tuple[S,I,R,Y: int]

proc randbn(n: int; p: float64; rng: var RNG): int {.inline.} =
  let q:float64 = 1.0-p
  let s:float64 = p/q
  let a:float64 = (float64(n)+1.0)*s
  var r:float64 = math.exp(float64(n)*math.ln(q))
  var x:int = 0
  var u:float64 = rng.random()
  while true:
    if u < r:
      return x
    u = u-r
    x = x+1
    r = r*((a/float64(x))-s)

proc sir(t:int; u: State;du: var State; parms: Params; r: var RNG): int {.inline.}=
  let (S,I,R,Y) = u
  let (beta,gamma,iota,N,dt)=parms
  let lambd=beta*(float64(I)+iota)/N
  let ifrac=1.0-exp(-lambd*dt)
  let rfrac=1.0-exp(-gamma*dt)
  let infection=int(randbn(S,ifrac,r))
  let recovery=int(randbn(I,rfrac,r))
  du.S=S-infection
  du.I=I+infection-recovery
  du.R=R+recovery
  du.Y=Y+infection
  result = 1

proc simulate():float64 =
  let parms:Params = (2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1)
  var seed:array[2,uint64] = [123.uint64,456]
  var r = initXorshift128Plus(seed)
  let tf:int=540
  let nsims:int = 1000
  var yvec:array[1000,int]

  for i in 1..nsims:
    var u:State = (60,1,342,0)
    var du:State = (0,0,0,0)
    for j in 1..tf:
      discard sir(j,u,du,parms,r)
      u=du
    yvec[i-1] = u[3]
  result = float64(sum(yvec))/float64(nsims)

let t0 = cpuTime()
let m = simulate()
let t1 = cpuTime()
echo t1 - t0
echo m

@mratsim
Copy link

mratsim commented Dec 14, 2017

What's the result without flto? flto tends to blow up compile-time unfortunately :/.

@sdwfrost
Copy link
Author

sdwfrost commented Dec 29, 2017

Results with and without -flto and -ffast-math:

nim c -d:release sir
0.04359
nim c -d:release --passC="-flto" sir
0.035472
nim c -d:release --passC="-ffast-math" sir
0.038581
nim c -d:release --passC="-flto -ffast-math" sir
0.031059

The above are with clang. Here's what I get with Homebrew gcc-6.

nim c --cc:gcc --gcc.exe=gcc-6 -d:release --passC="-flto -ffast-math" sir
0.02515

@sebmenard
Copy link

sebmenard commented Jul 26, 2019

I was interested to compare actual performances between Nim and Crystal.

I cleaned the original version a bit and fixed the errors:

import math, random, times

type
  Params = tuple[beta, gamma, iota, N, dt: float64]
  State = tuple[S, I, R, Y: int]

proc randbn(n: int, p: float64, rng: var Rand): int {.inline.} =
  let q = 1.0 - p
  let s = p / q
  let a = (float64(n) + 1.0) * s
  var r = math.exp(float64(n) * math.ln(q))
  var x = 0
  var u = rng.rand(1.0)
  while true:
    if u < r:
      return x
    u -= r
    inc x
    r *= (a / float64(x)) - s

proc sir(t: int, u: State, du: var State, parms: Params, r: var Rand) {.
  inline.} =
  let lambd = parms.beta * (float64(u.I) + parms.iota) / parms.N
  let ifrac = 1.0 - math.exp(-lambd * parms.dt)
  let rfrac = 1.0 - math.exp(-parms.gamma * parms.dt)
  let infection = int(randbn(u.S, ifrac, r))
  let recovery = int(randbn(u.I, rfrac, r))
  du.S = u.S - infection
  du.I = u.I + infection - recovery
  du.R = u.R + recovery
  du.Y = u.Y + infection

proc simulate(): float64 =
  let parms: Params = (2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1)
  let seed = 123
  var r = initRand(seed)
  let tf = 540
  let nsims = 1000
  var yvec: array[1000, int]

  for i in 1 .. nsims:
    var u: State = (60, 1, 342, 0)
    var du: State = (0, 0, 0, 0)
    for j in 1 .. tf:
      sir(j, u, du, parms, r)
      u = du
    yvec[i - 1] = u[3]
  result = float64(sum(yvec)) / float64(nsims)

let t0 = cpuTime()
let m = simulate()
let t1 = cpuTime()
echo t1 - t0
echo m

Translated version in Crystal using NamedTuples:

def randbn(n, p, ra)
  q = 1.0 - p
  s = p / q
  a = (n.to_f + 1.0) * s
  r = Math.exp(n.to_f * Math.log(q))
  x = 0
  u = ra.rand
  while true
    if u < r
      return x
    end
    u -= r
    x += 1
    r *= (a / x.to_f) - s
  end
end

def sir(t, u, du, parms, ra)
  lambd = parms[:beta] * (u[:i].to_f + parms[:iota]) / parms[:n]
  ifrac = 1.0 - Math.exp(-lambd * parms[:dt])
  rfrac = 1.0 - Math.exp(-parms[:gamma] * parms[:dt])
  infection = randbn(u[:s], ifrac, ra).to_i
  recovery = randbn(u[:i], rfrac, ra).to_i
  {s: u[:s] - infection, i: u[:i] + infection - recovery, r: u[:r] + recovery, y: u[:y] + infection}
end

def simulate
  parms = {beta: 2.62110617498984, gamma: 0.5384615384615384, iota: 0.5, n: 403.0, dt: 0.1}
  seed = 123
  ra = Random.new(seed)
  tf = 540
  nsims = 1000
  yvec = StaticArray(Int32, 1000).new(0)

  1.upto(nsims) do |i|
    u = {s: 60, i: 1, r: 342, y: 0}
    du = {s: 0, i: 0, r: 0, y: 0}
    1.upto(tf) do |j|
      du = sir(j, u, du, parms, ra)
      u = du
    end
    yvec[i - 1] = u[:y]
  end

  yvec.sum.to_f / nsims.to_f
end

m = 0.0

elapsed_time = Time.measure do
  m = simulate()
end

puts elapsed_time
puts m
$ nim -v
Nim Compiler Version 0.20.2 [Linux: amd64]
Compiled at 2019-07-17
Copyright (c) 2006-2019 by Andreas Rumpf

git hash: 88a0edba4b1a3d535b54336fd589746add54e937
active boot switches: -d:release
$ nim c -d:release -r sirfix.nim
0.031255943
20.817

Bonus:

$ nim c -d:release --newruntime -r sirfix.nim
0.031317962
20.817

To compare with Crystal on a fair ground:

$ nim c --cc:clang --gc:boehm -d:release -r sirfix.nim
0.029965718
20.817
$ crystal -v
Crystal 0.29.0 (2019-06-06)

LLVM: 6.0.1
Default target: x86_64-unknown-linux-gnu
$ crystal sirtup.cr --release
00:00:00.031871625
20.66

@sdwfrost
Copy link
Author

Thanks @gcat432! I keep meaning to add more benchmarks to this (Go, hey, even V...). I haven't had the chance to play much with Crystal.

@sdwfrost
Copy link
Author

Here's an example in V:

import math
import time

fn clz64(v u64) u32 {
    return C.__builtin_clzll(v)
}

fn xorshift128plus(rng mut []u64) u64 {
    mut x := rng[0]
    y := rng[1]
    rng[0] = y
    x ^= x << 23
    rng[1] = u64(x) ^ u64(y) ^ u64(x >> u64(17)) ^ u64(u64(y) >> u64(26))
    z := rng[1] + y
    return z
}

fn randunif(rng mut []u64) f64 {
    mut significand := xorshift128plus(mut rng)
    mut exponent := int(-64)
    for significand == 0 {
        exponent -= 64
        if exponent < -1074 {
            return f64(0.0)
        }
        significand = xorshift128plus(mut rng)
    }
    shift := clz64(significand)
    if shift != 0 {
        exponent -= int(shift)
        significand <<= u64(shift)
        significand |= u64(significand >> u64(64 - int(shift)))
    }
    significand |= u64(1)
    return C.ldexp(f64(significand),exponent)
}

fn randbn(n i64, p f64, rng mut []u64 ) i64 {
    log_q := math.log(1.0-p)
    mut x := i64(0)
    mut sum := f64(0.0)
    for sum >= log_q {
        sum += math.log(randunif(mut rng))/f64(n-x)
        if sum < log_q {
            break
        }
        x += 1
    }
    return x
}

fn sir(u []i64, du mut []i64, parms []f64, rng mut []u64) {
    s := u[0]
    i := u[1]
    r := u[2]
    y := u[3]
    beta := parms[0]
    gamma := parms[1]
    iota := parms[2]
    n := parms[3]
    dt := parms[4]
    lambd := beta*(f64(i)+iota)/n
    ifrac := 1.0-math.exp(-lambd*dt)
    rfrac := 1.0-math.exp(-gamma*dt)
    infection := randbn(s, ifrac, mut rng)
    recovery := randbn(i, rfrac, mut rng)
    du[0] = s - infection
    du[1] = i + infection - recovery
    du[2] = r + recovery
    du[3] = y + infection
}

fn sum(y []i64) i64 {
    mut sm := i64(0)
    for x in y {
        sm += x
    }
    return sm
}

fn simulate() f64 {
    parms := [f64(2.62110617498984), 0.5384615384615384, 0.5, 403.0, 0.1]
    tf := 540
    nsims := 1000
    mut yvec := [i64(0)].repeat(nsims)
    mut rng := [u64(123),456]
    mut i := 0
    mut j := 1
    mut u := [i64(0),0,0,0]
    mut du := [i64(0),0,0,0]
    i = 0
    for i < nsims {
        u = [i64(60),1,342,0]
        du = [i64(0),0,0,0]
        j = 1
        for j <= tf {
            sir(u, mut du, parms, mut rng)
            u = du
            j++
        }
        yvec[i] = u[3]
        i++
    }
    numerator := sum(yvec)
    result := f64(numerator)/f64(nsims)
    return result
}

fn main(){
    t0 := time.ticks()
    m := simulate()
    t1 := time.ticks()
    tdiff := f32(t1-t0)/1000.0
    println('$tdiff')
    println('$m')
}

@sdwfrost
Copy link
Author

And a vanilla C version.

// gcc -O3 -o sir sir.c -lm
#include <math.h>
#include <limits.h>
#include <stdio.h>
#include <stdint.h>

// The following is used to turn the uint64 random numbers into a double
// between 0.0 and 1.0

int clz64(uint64_t v) {
  const unsigned char y[] = {
     0, 47,  1, 56, 48, 27,  2, 60, 57, 49, 41, 37, 28, 16,  3, 61,
    54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11,  4, 62,
    46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
    25, 39, 14, 33, 19, 30,  9, 24, 13, 18,  8, 12,  7,  6,  5, 63
  };

  v |= v >> 1;
  v |= v >> 2;
  v |= v >> 4;
  v |= v >> 8;
  v |= v >> 16;
  v |= v >> 32;

  return ((sizeof(uint64_t) * CHAR_BIT) - 1) -
	y[(uint64_t)(v * 0x03F79D71B4CB0A89ULL) >> 58];
}

double random_real(uint64_t x)
{
	int exponent = -64;
	uint64_t significand;
	unsigned shift;
	while ((significand = x) == 0) {
		exponent -= 64;
		if (exponent < -1074) return 0;
	}
	shift = clz64(significand);
	if (shift != 0) {
		exponent -= shift;
		significand <<= shift;
		significand |= (x >> (64 - shift));
	}
	significand |= 1;
	return ldexp((double)significand, exponent);
}

// Random number generator
uint64_t xorshift128plus(uint64_t *rng) {
	uint64_t x = rng[0];
	uint64_t const y = rng[1];
	rng[0] = y;
	x ^= x << 23; // a
	rng[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
	uint64_t z = rng[1] + y;
  return(z);
}

double randunif(uint64_t *rng){
	uint64_t z = xorshift128plus(rng);
	double dz = random_real(z);
	return dz;
}

int64_t randbn(int64_t n, double p, uint64_t *rng)
{
    const double log_q = log(1.0 - p);
    int64_t x = 0;
    double sum = 0.0;
    while(1){
        sum += log(randunif(rng))/(double)(n - x);
        if(sum < log_q) break;
        x += 1;
    }
    return(x);
}

void sir(int64_t t, int64_t *u, int64_t *du, const double *parms, uint64_t *rng)
{
    int64_t S=u[0];
    int64_t I=u[1];
    int64_t R=u[2];
    int64_t Y=u[3];
    double beta=parms[0];
    double gamma=parms[1];
    double iota=parms[2];
    double N=parms[3];
    double dt=parms[4];
    double lambd=beta*((double)(I)+iota)/N;
    double ifrac=1.0-exp(-lambd*dt);
    double rfrac=1.0-exp(-gamma*dt);
    int64_t infection=randbn(S,ifrac,rng);
    int64_t recovery=randbn(I,rfrac,rng);
    du[0]=S-infection;
    du[1]=I+infection-recovery;
    du[2]=R+recovery;
    du[3]=Y+infection;
    return;
}

double simulate()
{
    double parms[5] = {2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1};
    int64_t tf=540;
    int64_t nsims = 1000;
    int64_t yvec[1000];
    uint64_t rng[2] = {123,456};
    for(int64_t i=0;i<nsims;i++){
        int64_t u[4] = {60,1,342,0};
        int64_t du[4] = {0,0,0,0};
        for(int64_t j=1;j<=tf;j++){
           sir(j,u,du,parms,rng);
           for(int k=0;k<4;k++){
             u[k]=du[k];
           }
        }
    yvec[i] = u[3];
  }
  int64_t numerator = 0;
  for(int i=0;i<1000;i++){
      numerator += yvec[i];
  }
  double result = (double)(numerator)/(double)(nsims);
  return(result);
}

int main(int argc, char* argv[]) {
  double m=simulate();
  printf("%f\n",m);
}

@snawaz
Copy link

snawaz commented Aug 9, 2020

@sdwfrost,

To be fair, you should use -ffast-math for C (and C++) program as well. There are couple of other things that you could also use to improve the C and C++ code. I believe C and C++ program can be made more than 2x more performant. Also, C++ code can do the whole computation in compilation time if you tweak few things (and so it can run without taking any time!).

Not sure if you wanna incorporate them.

@sdwfrost
Copy link
Author

sdwfrost commented Aug 9, 2020

Feel free to suggest additions/changes! I'm no expert in C++.

@snawaz
Copy link

snawaz commented Aug 10, 2020

@sdwfrost,

I did the following things:

  • I copied the C++ code, and compiled it with clang++ -O3 as you did, and on my machine it took ~0.056s. I consider this original speed on my machine. Now let's see how much we can improve it with minor tweaks.
  • I passed -ffast-math as well, and nothing else; it took ~0.036s.
  • I passed -march=native as well, and nothing else; it took ~0.029s.
  • I used __attribute__((always_inline)) for all functions, except main() and simulate(), and it took ~0.024s.

So going from 0.056s to 0.024s is more than 2x performant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment