Use clang -Ofast xoshiro256plusplus.c -march=native -shared -fpic
to compile.
Use gcc -Ofast xoshiro256plusplus.c -march=native -S
or clang -Ofast xoshiro256plusplus.c -march=native -emit-llvm -S
to see the generated code, and weep.
Output of julia ./xoshiro256plusplus.jl
on my admittedly anemic machine:
julia dsfmt 1024 x UInt64
2.209 μs (0 allocations: 0 bytes)
ref impl, 1 x interleaved, 1024 x UInt64
1.579 μs (0 allocations: 0 bytes)