Comparison of the performance of FFI vs XS zeromq bindings. For FFI the
ZMQ::FFI
bindings are used, first using FFI::Raw
on the backend and then
using FFI::Platypus
. For XS ZMQ::LibZMQ3
is used.
Comparison is done using the zeromq weather station example, first by timing
wuclient.pl using the various implementations, and then by profiling
wuserver.pl using Devel::NYTProf
. When profiling the server is changed to
simply publish 1 million messages and exit.
Weather station example code was lightly optimized (e.g. don't declare vars in loop) and modified to be more consistent.
Additionally, a more direct benchmark and comparison of FFI::Platypus
vs XS
xsubs is also done.
C and Python implementation results are provided as a baseline for performance.
All the code that was created or modified for these benchmarks is listed at the end (C/Python wuclient/wuserver code can be found in the zmq guide).
CPU: Intel Core Quad i7-2600K CPU @ 3.40GHz
Mem: 4GB
OS: Arch Linux
ZMQ: 4.0.5
Perl: 5.20.1
ZMQ::FFI = 0.19 (FFI::Raw backend), dev (FFI::Platypus backend)
FFI::Raw = 0.32
FFI::Platypus = 0.31
ZMQ::LibZMQ3 = 1.19
I've been able to reduce the difference very slightly on my lazy branch at https://github.com/pipcet/FFI-Platypus/tree/lazy, if you forgive the shameless self-promotion:
(I tried to select a run that got close to the maximum value for both FFI2 and XS).
FFI2 is the relevant test. However, the optimizations required for that are somewhat ugly (but given the predominance of int/long/pointer return types over the others I think we can live with that), and I see very little further room for improvement. We're talking about 1015 vs 1300 CPU clock cycles per call in my case, and most of those are spent inside the Perl and ZMQ code.
Optimizations:
If we switch to multiple implementations, one of those implementations might very well generate C code at runtime for the FFI XSUB, and use either TinyCC or (if you don't mind the huge start-up delay) Inline::C to compile it into code that would look virtually identical to the XS code, in this case.
I'd be very curious to see what the difference is like on other machines; the lazy branch is using indirect function calls (one per argument, we're skipping the one for the return value), @plicease 's branch is using switch statements.
The prefetch thing demonstrates something, though: we're better than XS, even if we aren't faster right now, because it's actually a feasible project to tweak a few lines in the call routine to do prefetches, while no one is going to sift through a large XS library doing that in thousands of places. Similarly, someone with more patience than me might figure out just the right compiler switches, (in one place, used to compile code only three times), to make the call routine as fast as possible.