Skip to content

Instantly share code, notes, and snippets.

@calid
Last active October 3, 2022 10:45
Show Gist options
  • Save calid/17df5bcfb81c83786d6f to your computer and use it in GitHub Desktop.
Save calid/17df5bcfb81c83786d6f to your computer and use it in GitHub Desktop.
ZeroMQ Perl Performance Comparison: FFI vs XS bindings

ØMQ Perl Performance Comparison: FFI vs XS bindings

Comparison of the performance of FFI vs XS zeromq bindings. For FFI the ZMQ::FFI bindings are used, first using FFI::Raw on the backend and then using FFI::Platypus. For XS ZMQ::LibZMQ3 is used.

Comparison is done using the zeromq weather station example, first by timing wuclient.pl using the various implementations, and then by profiling wuserver.pl using Devel::NYTProf. When profiling the server is changed to simply publish 1 million messages and exit.

Weather station example code was lightly optimized (e.g. don't declare vars in loop) and modified to be more consistent.

Additionally, a more direct benchmark and comparison of FFI::Platypus vs XS xsubs is also done.

C and Python implementation results are provided as a baseline for performance.

All the code that was created or modified for these benchmarks is listed at the end (C/Python wuclient/wuserver code can be found in the zmq guide).

Test box

CPU:  Intel Core Quad i7-2600K CPU @ 3.40GHz
Mem:  4GB
OS:   Arch Linux
ZMQ:  4.0.5
Perl: 5.20.1

ZMQ::FFI      = 0.19 (FFI::Raw backend), dev (FFI::Platypus backend)
FFI::Raw      = 0.32
FFI::Platypus = 0.31
ZMQ::LibZMQ3  = 1.19

wuclient.pl Time Comparison

FFI::Raw Implementation

$ perl wuserver.pl &
$ time perl wuclient.pl
Collecting updates from weather station...
Average temperature for zipcode '10001 ' was 21F

real    1m22.818s
user    0m0.070s
sys     0m0.023s

FFI::Platypus Implementation

$ perl wuserver.pl &
$ time perl wuclient.pl
Collecting updates from weather station...
Average temperature for zipcode '10001 ' was 38F

real    0m12.813s
user    0m0.083s
sys     0m0.033s

XS Implementation (ZMQ::LibZMQ3)

$ perl wuserver.pl &
$ time perl wuclient.pl
Collecting updates from weather server...
Average temperature for zipcode '10001 ' was 34F

real    0m10.051s
user    0m0.017s
sys     0m0.010s

C Reference Implementation

$ ./wuserver &
$ time ./wuclient
Collecting updates from weather server...
Average temperature for zipcode '10001 ' was 26F

real    0m2.842s
user    0m0.000s
sys     0m0.023s

Python Reference Implementation

I was initially impressed with the performance of the Python example:

$ python -V
Python 3.4.2
$ python -c 'import zmq; print(zmq.pyzmq_version())'
14.5.0

$ python wuserver.py &
$ time python wuclient.py
Collecting updates from weather server...
Average temperature for zipcode '10001' was 49F

real    0m4.599s
user    0m0.063s
sys     0m0.020s

Wow, that's almost as fast as C! But then I noticed:

# Process 5 updates
total_temp = 0
for update_nbr in range(5)
    ...

So where the C and Perl implementations are processing 100 updates, the Python version only processes 5, or 1/20 as many. What about if we use 100 updates like the other languages?

$ python wuserver.py &
$ time python wuclient.py
Collecting updates from weather server...
Average temperature for zipcode '10001' was 17F

real    1m41.108s
user    0m0.077s
sys     0m0.017s

If nothing else, at least the Perl bindings blow the doors off the Python ones :)

wuserver.pl Hot Spot Comparison (Devel::NYTProf)

FFI::Raw Implementation

$self->_zmq3_ffi->{zmq_send}->($self->_socket, $msg, $length, $flags)
# spent 19.9s making 1000000 calls to FFI::Raw::__ANON__[FFI/Raw.pm:94], avg 20µs/call
# spent 5.72s making 2000000 calls to FFI::Raw::coderef, avg 3µs/call
# spent 2.90s making 1000000 calls to ZMQ::FFI::ZMQ3::Socket::_zmq3_ffi, avg 3µs/call

FFI::Platypus Implementation

zmq_send($socket, $msg, $length, $flags)
# spent 1.33s making 1000000 calls to ZMQ::FFI::ZMQ3::Socket::zmq_send, avg 1µs/call

sub ZMQ::FFI::ZMQ3::Socket::zmq_send; # xsub

XS Implementation (ZMQ::LibZMQ3)

zmq_send($socket, $string, -1);
# spent 1.23s making 1000000 calls to ZMQ::LibZMQ3::zmq_send, avg 1µs/call

sub ZMQ::LibZMQ3::zmq_send; # xsub

Direct xsub Comparison

The weather station example inevitably has layers between sending the messages and the underlying xsub calls. This is fine for comparing the two high level APIs ZMQ::FFI vs ZMQ::LibZMQ3, but we also want to compare the FFI::Platypus vs XS xsub performance directly.

So as much as possible strip out intervening layers to determine the raw performance of the two.

Benchmark.pm results

$ perl zmq-bench.pl
FFI ZMQ Version: 4.0.5
XS  ZMQ Version: 4.0.5

Benchmark: timing 10000000 iterations of FFI, XS...
       FFI:  4 wallclock secs ( 3.31 usr +  0.01 sys =  3.32 CPU) @ 3012048.19/s (n=10000000)
        XS:  2 wallclock secs ( 2.16 usr +  0.00 sys =  2.16 CPU) @ 4629629.63/s (n=10000000)

         Rate   FFI    XS     C
FFI 3012048/s    --  -35%  -82%
XS  4629630/s   54%    --  -73%
C* 16835017/s  559%  364%    --

*just 'faking' the C results into the table so it's easy to compare a baseline

$ time zmq-bench-c
C ZMQ Version: 4.0.5

real    0m0.594s
user    0m0.570s
sys     0m0.017s

$ echo '10000000 / 0.594' | bc -lq
16835016.835 # Rate

Devel::NYTProf profiling results

For profiling and timing in the shell below send in a for loop instead of via Benchmark

sub main::zmqffi_send; # xsub
# spent 15.5s within main::zmqffi_send which was called 10000000 times, avg 2µs/call

sub ZMQ::LibZMQ3::zmq_send; # xsub
# spent 15.6s within ZMQ::LibZMQ3::zmq_send which was called 10000000 times, avg 2µs/call

Q: Why does the profiler indicate basically identical performance of the xsubs, but Benchmark reports performance difference?

A: ???

Time in shell

$ time perl zmq-bench.pl
FFI ZMQ Version: 4.0.5

real    0m3.541s
user    0m3.510s
sys     0m0.027s

$ echo '10000000 / 3.541' | bc -lq
2824060.999 # Rate

$ time perl zmq-bench.pl
XS ZMQ Version: 4.0.5

real    0m2.390s
user    0m2.363s
sys     0m0.020s

$ echo '10000000 / 2.390' | bc -lq
4184100.418 # Rate

XS is 48% faster when timing on the shell.

#include <zmq.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <assert.h>
#include <string.h>
int main(void)
{
void *ctx = zmq_ctx_new();
assert(ctx);
void *socket = zmq_socket(ctx, ZMQ_PUB);
assert(socket);
pid_t p = getpid();
char *endpoint = malloc(256);
sprintf(endpoint, "ipc:///tmp/zmq-c-bench-%d", p);
assert( -1 != zmq_bind(socket, endpoint) );
int major, minor, patch;
zmq_version(&major, &minor, &patch);
printf("C ZMQ Version: %d.%d.%d\n", major, minor, patch);
for ( int i = 0; i < (10 * 1000 * 1000); i++ ) {
assert( -1 != zmq_send(socket, "ohhai", 5, 0) );
}
}
#
# Directly compare FFI::Platypus vs XS xsubs
#
use strict;
use warnings;
use v5.10;
use FFI::Platypus::Declare;
use ZMQ::LibZMQ3;
use ZMQ::FFI::Constants qw(:all);
use Benchmark qw(:all);
lib 'libzmq.so';
attach(
['zmq_ctx_new' => 'zmqffi_ctx_new']
=> [] => 'pointer'
);
attach(
['zmq_socket' => 'zmqffi_socket']
=> ['pointer', 'int'] => 'pointer'
);
attach(
['zmq_bind' => 'zmqffi_bind']
=> ['pointer', 'string'] => 'int'
);
attach(
['zmq_send' => 'zmqffi_send']
=> ['pointer', 'string', 'size_t', 'int'] => 'int'
);
attach(
['zmq_version' => 'zmqffi_version']
=> ['int*', 'int*', 'int*'] => 'void'
);
my $ffi_ctx = zmqffi_ctx_new();
die 'ffi ctx error' unless $ffi_ctx;
my $ffi_socket = zmqffi_socket($ffi_ctx, ZMQ_PUB);
die 'ffi socket error' unless $ffi_socket;
my $rv;
$rv = zmqffi_bind($ffi_socket, "ipc:///tmp/zmq-ffi-bench-$$");
die 'ffi bind error' if $rv == -1;
my $xs_ctx = zmq_ctx_new();
die 'xs ctx error' unless $xs_ctx;
my $xs_socket = zmq_socket($xs_ctx, ZMQ_PUB);
die 'xs socket error' unless $xs_socket;
$rv = zmq_bind($xs_socket, "ipc:///tmp/zmq-xs-bench-$$");
die 'xs bind error' if $rv == -1;
my ($major, $minor, $patch);
zmqffi_version(\$major, \$minor, \$patch);
say "FFI ZMQ Version: " . join(".", $major, $minor, $patch);
say "XS ZMQ Version: " . join(".", ZMQ::LibZMQ3::zmq_version());
my $r = timethese 10_000_000, {
'XS' => sub {
die 'xs send error ' if -1 == zmq_send($xs_socket, 'ohhai', 5, 0);
},
'FFI' => sub {
die 'ffi send error' if -1 == zmqffi_send($ffi_socket, 'ohhai', 5, 0);
},
};
cmpthese($r);
use strict;
use warnings;
use v5.10;
use ZMQ::FFI;
use ZMQ::FFI::Constants qw(ZMQ_SUB);
say "Collecting updates from weather station...";
my $context = ZMQ::FFI->new();
my $subscriber = $context->socket(ZMQ_SUB);
$subscriber->connect("tcp://localhost:5556");
my $filter = $ARGV[0] // "10001 ";
$subscriber->subscribe($filter);
my $update_nbr = 100;
my $total_temp = 0;
my ($string, $zipcode, $temperature, $relhumidity);
for (1..$update_nbr) {
$string = $subscriber->recv();
($zipcode, $temperature, $relhumidity) = split ' ', $string;
$total_temp += $temperature;
}
printf "Average temperature for zipcode '%s' was %dF\n",
$filter, int($total_temp / $update_nbr);
use strict;
use warnings;
use ZMQ::FFI;
use ZMQ::FFI::Constants qw(ZMQ_PUB);
my $context = ZMQ::FFI->new();
my $publisher = $context->socket(ZMQ_PUB);
$publisher->bind("tcp://*:5556");
my ($zipcode, $temperature, $relhumidity, $update);
# for (1..1_000_000) { # publish constant number when profiling
while (1) {
$zipcode = rand(100_000);
$temperature = rand(215) - 80;
$relhumidity = rand(50) + 10;
$update = sprintf(
'%05d %d %d',
$zipcode,$temperature,$relhumidity
);
$publisher->send($update);
}
use strict;
use warnings;
use v5.10;
use ZMQ::LibZMQ3;
use ZMQ::Constants qw(ZMQ_SUB ZMQ_SUBSCRIBE);
use zhelpers;
say 'Collecting updates from weather server...';
my $context = zmq_init();
my $subscriber = zmq_socket($context, ZMQ_SUB);
zmq_connect($subscriber, 'tcp://localhost:5556');
my $filter = @ARGV ? $ARGV[0] : '10001 ';
zmq_setsockopt($subscriber, ZMQ_SUBSCRIBE, $filter);
my $update_nbr = 100;
my $total_temp = 0;
my ($string, $zipcode, $temperature, $relhumidity);
for (1 .. $update_nbr) {
$string = s_recv($subscriber);
($zipcode, $temperature, $relhumidity) = split ' ', $string;
$total_temp += $temperature;
}
printf "Average temperature for zipcode '%s' was %dF\n",
$filter, int($total_temp / $update_nbr);
use strict;
use warnings;
use ZMQ::LibZMQ3;
use ZMQ::Constants qw(ZMQ_PUB);
use zhelpers;
my $context = zmq_init();
my $publisher = zmq_socket($context, ZMQ_PUB);
zmq_bind($publisher, 'tcp://*:5556');
my ($zipcode, $temperature, $relhumidity, $update);
# for (1..1_000_000) { # publish constant number when profiling
while (1) {
$zipcode = rand(100_000);
$temperature = rand(215) - 80;
$relhumidity = rand(50) + 10;
$update = sprintf(
'%05d %d %d',
$zipcode,$temperature,$relhumidity
);
s_send($publisher, $update);
}
@pipcet
Copy link

pipcet commented Mar 9, 2015

I've been able to reduce the difference very slightly on my lazy branch at https://github.com/pipcet/FFI-Platypus/tree/lazy, if you forgive the shameless self-promotion:

FFI ZMQ Version: 4.0.5
XS  ZMQ Version: 4.0.5
Benchmark: timing 10000000 iterations of FFI, FFI2, XS...
       FFI: 10 wallclock secs ( 8.54 usr +  0.00 sys =  8.54 CPU) @ 1170960.19/s (n=10000000)
      FFI2:  6 wallclock secs ( 5.89 usr +  0.00 sys =  5.89 CPU) @ 1697792.87/s (n=10000000)
        XS:  4 wallclock secs ( 4.73 usr +  0.00 sys =  4.73 CPU) @ 2114164.90/s (n=10000000)
          Rate  FFI FFI2   XS
FFI  1170960/s   -- -31% -45%
FFI2 1697793/s  45%   -- -20%
XS   2114165/s  81%  25%   --

(I tried to select a run that got close to the maximum value for both FFI2 and XS).

FFI2 is the relevant test. However, the optimizations required for that are somewhat ugly (but given the predominance of int/long/pointer return types over the others I think we can live with that), and I see very little further room for improvement. We're talking about 1015 vs 1300 CPU clock cycles per call in my case, and most of those are spent inside the Perl and ZMQ code.

Optimizations:

  • use TARG rather than creating a new SV for our return value
  • compiler switches (that's cheating, I admit, since I didn't change those for XS)
  • use PERL_NO_GET_CONTEXT. That's likely to have more of an effect when there are many arguments.
  • __builtin_expect to get gcc to follow the main branch. This doesn't appear to make a difference
  • __builtin_prefetch on the SVs we're going to look at. Adding further __builtin_prefetch instructions appears to result in a slowdown—however, that might be because in this simple test case, everything is in the cache anyway.

If we switch to multiple implementations, one of those implementations might very well generate C code at runtime for the FFI XSUB, and use either TinyCC or (if you don't mind the huge start-up delay) Inline::C to compile it into code that would look virtually identical to the XS code, in this case.

I'd be very curious to see what the difference is like on other machines; the lazy branch is using indirect function calls (one per argument, we're skipping the one for the return value), @plicease 's branch is using switch statements.

The prefetch thing demonstrates something, though: we're better than XS, even if we aren't faster right now, because it's actually a feasible project to tweak a few lines in the call routine to do prefetches, while no one is going to sift through a large XS library doing that in thousands of places. Similarly, someone with more patience than me might figure out just the right compiler switches, (in one place, used to compile code only three times), to make the call routine as fast as possible.

@pipcet
Copy link

pipcet commented Mar 9, 2015

Further optimization:

  • profile-driven optimization based on the benchmark. This is the big one.
  • -march=native -mtune=native to adjust CPU type
  • rewrote some code to make GCC not emit two integer division instructions. That helped more than I thought.
  • removed prefetches as they didn't seem to be helping any
  • removed current_argv, which I'd rather reimplement in a different way
  • use __attribute__((regparm(6))) on our internal methods. Doesn't appear to make a difference.
           Rate FFI FFI2g FFI2b FFI2k FFI2c FFI2e FFI2i FFI2f FFI2d FFI2j FFI2h FFI2a   XS
FFI   1808318/s  --  -18%  -18%  -21%  -22%  -22%  -24%  -24%  -25%  -25%  -25%  -27% -31%
FFI2g 2207506/s 22%    --   -0%   -3%   -5%   -5%   -7%   -7%   -8%   -9%   -9%  -11% -16%
FFI2b 2217295/s 23%    0%    --   -3%   -4%   -4%   -7%   -7%   -8%   -8%   -8%  -10% -16%
FFI2k 2277904/s 26%    3%    3%    --   -2%   -2%   -4%   -4%   -5%   -6%   -6%   -8% -13%
FFI2c 2314815/s 28%    5%    4%    2%    --   -0%   -3%   -3%   -4%   -4%   -4%   -6% -12%
FFI2e 2314815/s 28%    5%    4%    2%    0%    --   -3%   -3%   -4%   -4%   -4%   -6% -12%
FFI2i 2380952/s 32%    8%    7%    5%    3%    3%    --   -0%   -1%   -1%   -2%   -4% -10%
FFI2f 2380952/s 32%    8%    7%    5%    3%    3%    0%    --   -1%   -1%   -2%   -4% -10%
FFI2d 2409639/s 33%    9%    9%    6%    4%    4%    1%    1%    --   -0%   -0%   -3%  -8%
FFI2j 2415459/s 34%    9%    9%    6%    4%    4%    1%    1%    0%    --   -0%   -2%  -8%
FFI2h 2421308/s 34%   10%    9%    6%    5%    5%    2%    2%    0%    0%    --   -2%  -8%
FFI2a 2475248/s 37%   12%   12%    9%    7%    7%    4%    4%    3%    2%    2%    --  -6%
XS    2631579/s 46%   19%   19%   16%   14%   14%   11%   11%    9%    9%    9%    6%   --

(different machine; the FFI2[a-k] versions are identical except for their names.)

Is that a fair comparison? No, but I suspect that the profile data is pretty close to the same for most common applications—the main factor is likely to be that there is a single integer return value, and that's true for a lot of functions. In other words, my prediction is we'll get most of the improvement incorporating far fewer changes than I've actually made.

All of this really depends on having multiple Platypus "implementations", as discussed at PerlFFI/FFI-Platypus#44. However, I think I've made a good start on that already.

@calid
Copy link
Author

calid commented Mar 10, 2015

@pipcet, what would be interesting is if you could identify which optimizations had the most impact. If it's something like custom gcc flags, that's probably not all that useful since different compilers/platforms need to be supported. But if it's something that's easy to apply in general and provides a significant bump that would be great. Also I don't think anything below 20% (unless it's super trivial) is worth the trouble.

@pipcet
Copy link

pipcet commented Mar 10, 2015

I've removed all questionable optimizations (compiler flags, PDO, __builtin_expect) and the results are still pretty good, though I've seen no repeat of that 6% number. I've also hacked in a few more tests with the loop itself in C (Inline or TinyCC), Perl, or Python. However, while the results I'm getting are good, the variance is huge, possibly because the system is not really idle.

The precise revisions I'm using are
https://gist.github.com/pipcet/1644cbd05e3300e5cec4/365ca95b5556bdbfc381b578cde88335353891d1 and
https://github.com/pipcet/FFI-Platypus/tree/27e3159ef365161128360c74d8e779f8138a2424

Example output:

                         Rate class method class method(hash) Python method xsub(hash) xsub Perl exec   XS TinyCC Inline
class method         740192/s           --               -27%   -42%   -47%       -56% -64%      -66% -70%   -92%   -93%
class method(hash)  1018330/s          38%                 --   -20%   -27%       -40% -50%      -53% -59%   -90%   -90%
Python              1272912/s          72%                25%     --    -8%       -25% -38%      -41% -49%   -87%   -87%
method              1388889/s          88%                36%     9%     --       -18% -32%      -35% -44%   -86%   -86%
xsub(hash)          1692047/s         129%                66%    33%    22%         -- -17%      -21% -32%   -83%   -83%
xsub                2044990/s         176%               101%    61%    47%        21%   --       -5% -18%   -79%   -80%
Perl exec           2151926/s         191%               111%    69%    55%        27%   5%        -- -13%   -78%   -79%
XS                  2487562/s         236%               144%    95%    79%        47%  22%       16%   --   -75%   -75%
TinyCC              9784736/s        1222%               861%   669%   605%       478% 378%      355% 293%     --    -2%
Inline             10010010/s        1252%               883%   686%   621%       492% 389%      365% 302%     2%     --

There are quite a few things that are weird about that, including how all tests (including the unchanged XS test) seem to be somewhat slower than yesterday. Is it the phase of the moon?

ETA: it occurs to me that this might not be the phase of the moon, but the size of %main::, the package stash/hash, which increased as I added other tests. Which only supports the contention that we're overoptimizing, because Perl's speed matters more than how many clock cycles dispatching to ffi_call takes once we hit XS.

@calid
Copy link
Author

calid commented Mar 11, 2015

Which only supports the contention that we're overoptimizing, because Perl's speed matters more than how many clock cycles dispatching to ffi_call takes once we hit XS.

Yeah, honestly the tests with TinyCC/Inline are interesting, but again if you want that kind of speedup and are willing to give up the convenience of ffi just write pure native code IMO.

What I'm really curious about is the performance difference between FFI::Platypus's xsub and Perl's native one. Why do they perform differently? Without knowing anything about the implementation details, I would have thought once the ffi xsub is loaded into memory it should be just as fast as the XS one.

To me that's kind of the holy grail: Can we get the FFI xsub to be as fast as the XS one?

@pipcet
Copy link

pipcet commented Mar 11, 2015

What I'm really curious about is the performance difference between FFI::Platypus's xsub and Perl's native one. Why do they perform differently? Without knowing anything about the implementation details, I would have thought once the ffi xsub is loaded into memory it should be just as fast as the XS one.

Platypus doesn't actually generate an xsub at run-time for each function you attach (that would make attach() really slow!), but uses a generic C functon for all of them, which looks at a function descriptor structure and handles the arguments individually using either a big switch statement (in the main branch) or indirect function pointer calls (on my all-tests-pass branch (well, that's obviously a branch which I only commit to when, er, all tests pass)). Switch statements are slow. Function pointer calls used to be really really slow, but I think they've been downgraded to only really slow by now. Then it handles the return value in another big switch statement/function pointer call.

Furthermore, libffi copies all arguments one more time, though that's probably in the cache and should be relatively fast.

So in a way, the holy grail is unachievable: anything we can do in our generic function (if you want to have a look, it's at https://github.com/pipcet/FFI-Platypus/blob/lazy/include/ffi_platypus_rtypes_call.h and https://github.com/pipcet/FFI-Platypus/blob/lazy/include/ffi_platypus_call.h), XS can do. In theory. In practice, each XSUB has to optimized by hand for things we can do automatically. For example, assume there is a string argument to our native function. It's likely, then, that the native function will actually look at the string; we can exploit that by prefetching the string contents to the CPU cache before we even look at the other arguments. That kind of optimization is really hard to do for a hundred, or even a dozen, XSUBs, but we only have to do it once. As cache misses are a major source of latency in program execution, it's quite possible we can beat XS just by avoiding one of them.

The only problem with that example is it's realistic: it describes actual programs, which have cache misses, rather than simple benchmark programs, which have everything in cache already and for which prefetch instructions are unnecessary overhead.

The Platypus user isn't affected by all that. All they see is the ->attach call with a harmless 'string' in it, and they never know we're prefetching that string for them.

There is another way in which we're faster than XS, though I'm not entirely happy with it: XS and the rest of Perl use blessed references to integer scalars for pointers, while we use plain integers. That's a pointer deref we don't have to do, but because only references can be blessed, that means you cannot free an opaque pointer in its DESTROY method.

Sorry this got a bit long. In summary, Platypus gives up some low-level tweak-the-assembler-code optimization but gains the ability of automating exotic optimization techniques. Those paltry 200 CPU cycles you might be losing today are nothing compared to what you gain in maintainability, and potentially in future performance.

@calid
Copy link
Author

calid commented Mar 12, 2015

Sorry this got a bit long

Not at all. I actually have a very limited understanding of XS internals, so I appreciated the explanation.

@pipcet
Copy link

pipcet commented Mar 12, 2015

Thanks!

One of the other ideas for "optimization" is to generate C (not XS) code at runtime and compile it with FFI::TinyCC. I had assumed that wouldn't work because we need to parse the Perl header files (and TinyCC doesn't implement compiler directives that GCC does implement; but there's no easy/portable way to get Perl's build flags translated to another compiler), but it does! (I had to #define __builtin_expect, but that's hardly a major issue). And looking at the generated assembler code and benchmark results, it's good code and a little faster than ZMQ::LibZMQ3 (full results at https://gist.github.com/pipcet/1644cbd05e3300e5cec4#file-04-results-md)

What this demonstrates is that we can beat a real-world XS library in a non-real-world benchmarking situation, by using PERL_NO_GET_CONTEXT. So on threaded Perl, holy grail. Well, some coding still required, but I'm now convinced we can do it.

@pipcet
Copy link

pipcet commented Mar 13, 2015

Okay, I've implemented just enough JIT compiling (though that's really a big word for a tiny little hack like this) to cover this one example, and the speed difference compared to XS is still 10%:

time perl ./zmq-bench-xsexec.pl; time perl ./zmq-bench-ffiexec.pl
perl ./zmq-bench-xsexec.pl  52.33s user 0.05s system 97% cpu 53.624 total
FFI ZMQ Version: 4.0.5
perl ./zmq-bench-ffiexec.pl  45.33s user 0.05s system 99% cpu 45.431 total

Where the code is essentially:

attach(
  ['zmq_send' => 'zmqffi_send']
  => ['pointer', 'string', 'size_t', 'int'] => 'int',
    );

while(1) {
  $i++;
  die if -1 == zmqffi_send($ffi_socket, 'ohhai', 5, 0);
  exit if $i == 100_000_000;
}

The difference is more pronounced without error checking, but that's something I'm nervous about, at least in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment