Comparison of the performance of FFI vs XS zeromq bindings. For FFI the
ZMQ::FFI
bindings are used, first using FFI::Raw
on the backend and then
using FFI::Platypus
. For XS ZMQ::LibZMQ3
is used.
Comparison is done using the zeromq weather station example, first by timing
wuclient.pl using the various implementations, and then by profiling
wuserver.pl using Devel::NYTProf
. When profiling the server is changed to
simply publish 1 million messages and exit.
Weather station example code was lightly optimized (e.g. don't declare vars in loop) and modified to be more consistent.
Additionally, a more direct benchmark and comparison of FFI::Platypus
vs XS
xsubs is also done.
C and Python implementation results are provided as a baseline for performance.
All the code that was created or modified for these benchmarks is listed at the end (C/Python wuclient/wuserver code can be found in the zmq guide).
CPU: Intel Core Quad i7-2600K CPU @ 3.40GHz
Mem: 4GB
OS: Arch Linux
ZMQ: 4.0.5
Perl: 5.20.1
ZMQ::FFI = 0.19 (FFI::Raw backend), dev (FFI::Platypus backend)
FFI::Raw = 0.32
FFI::Platypus = 0.31
ZMQ::LibZMQ3 = 1.19
I've removed all questionable optimizations (compiler flags, PDO, __builtin_expect) and the results are still pretty good, though I've seen no repeat of that 6% number. I've also hacked in a few more tests with the loop itself in C (Inline or TinyCC), Perl, or Python. However, while the results I'm getting are good, the variance is huge, possibly because the system is not really idle.
The precise revisions I'm using are
https://gist.github.com/pipcet/1644cbd05e3300e5cec4/365ca95b5556bdbfc381b578cde88335353891d1 and
https://github.com/pipcet/FFI-Platypus/tree/27e3159ef365161128360c74d8e779f8138a2424
Example output:
There are quite a few things that are weird about that, including how all tests (including the unchanged XS test) seem to be somewhat slower than yesterday. Is it the phase of the moon?
ETA: it occurs to me that this might not be the phase of the moon, but the size of %main::, the package stash/hash, which increased as I added other tests. Which only supports the contention that we're overoptimizing, because Perl's speed matters more than how many clock cycles dispatching to ffi_call takes once we hit XS.