Skip to content

Instantly share code, notes, and snippets.

Last active October 3, 2022 10:45
Show Gist options
  • Save calid/17df5bcfb81c83786d6f to your computer and use it in GitHub Desktop.
Save calid/17df5bcfb81c83786d6f to your computer and use it in GitHub Desktop.
ZeroMQ Perl Performance Comparison: FFI vs XS bindings

ØMQ Perl Performance Comparison: FFI vs XS bindings

Comparison of the performance of FFI vs XS zeromq bindings. For FFI the ZMQ::FFI bindings are used, first using FFI::Raw on the backend and then using FFI::Platypus. For XS ZMQ::LibZMQ3 is used.

Comparison is done using the zeromq weather station example, first by timing using the various implementations, and then by profiling using Devel::NYTProf. When profiling the server is changed to simply publish 1 million messages and exit.

Weather station example code was lightly optimized (e.g. don't declare vars in loop) and modified to be more consistent.

Additionally, a more direct benchmark and comparison of FFI::Platypus vs XS xsubs is also done.

C and Python implementation results are provided as a baseline for performance.

All the code that was created or modified for these benchmarks is listed at the end (C/Python wuclient/wuserver code can be found in the zmq guide).

Test box

CPU:  Intel Core Quad i7-2600K CPU @ 3.40GHz
Mem:  4GB
OS:   Arch Linux
ZMQ:  4.0.5
Perl: 5.20.1

ZMQ::FFI      = 0.19 (FFI::Raw backend), dev (FFI::Platypus backend)
FFI::Raw      = 0.32
FFI::Platypus = 0.31
ZMQ::LibZMQ3  = 1.19 Time Comparison

FFI::Raw Implementation

$ perl &
$ time perl
Collecting updates from weather station...
Average temperature for zipcode '10001 ' was 21F

real    1m22.818s
user    0m0.070s
sys     0m0.023s

FFI::Platypus Implementation

$ perl &
$ time perl
Collecting updates from weather station...
Average temperature for zipcode '10001 ' was 38F

real    0m12.813s
user    0m0.083s
sys     0m0.033s

XS Implementation (ZMQ::LibZMQ3)

$ perl &
$ time perl
Collecting updates from weather server...
Average temperature for zipcode '10001 ' was 34F

real    0m10.051s
user    0m0.017s
sys     0m0.010s

C Reference Implementation

$ ./wuserver &
$ time ./wuclient
Collecting updates from weather server...
Average temperature for zipcode '10001 ' was 26F

real    0m2.842s
user    0m0.000s
sys     0m0.023s

Python Reference Implementation

I was initially impressed with the performance of the Python example:

$ python -V
Python 3.4.2
$ python -c 'import zmq; print(zmq.pyzmq_version())'

$ python &
$ time python
Collecting updates from weather server...
Average temperature for zipcode '10001' was 49F

real    0m4.599s
user    0m0.063s
sys     0m0.020s

Wow, that's almost as fast as C! But then I noticed:

# Process 5 updates
total_temp = 0
for update_nbr in range(5)

So where the C and Perl implementations are processing 100 updates, the Python version only processes 5, or 1/20 as many. What about if we use 100 updates like the other languages?

$ python &
$ time python
Collecting updates from weather server...
Average temperature for zipcode '10001' was 17F

real    1m41.108s
user    0m0.077s
sys     0m0.017s

If nothing else, at least the Perl bindings blow the doors off the Python ones :) Hot Spot Comparison (Devel::NYTProf)

FFI::Raw Implementation

$self->_zmq3_ffi->{zmq_send}->($self->_socket, $msg, $length, $flags)
# spent 19.9s making 1000000 calls to FFI::Raw::__ANON__[FFI/], avg 20µs/call
# spent 5.72s making 2000000 calls to FFI::Raw::coderef, avg 3µs/call
# spent 2.90s making 1000000 calls to ZMQ::FFI::ZMQ3::Socket::_zmq3_ffi, avg 3µs/call

FFI::Platypus Implementation

zmq_send($socket, $msg, $length, $flags)
# spent 1.33s making 1000000 calls to ZMQ::FFI::ZMQ3::Socket::zmq_send, avg 1µs/call

sub ZMQ::FFI::ZMQ3::Socket::zmq_send; # xsub

XS Implementation (ZMQ::LibZMQ3)

zmq_send($socket, $string, -1);
# spent 1.23s making 1000000 calls to ZMQ::LibZMQ3::zmq_send, avg 1µs/call

sub ZMQ::LibZMQ3::zmq_send; # xsub

Direct xsub Comparison

The weather station example inevitably has layers between sending the messages and the underlying xsub calls. This is fine for comparing the two high level APIs ZMQ::FFI vs ZMQ::LibZMQ3, but we also want to compare the FFI::Platypus vs XS xsub performance directly.

So as much as possible strip out intervening layers to determine the raw performance of the two. results

$ perl
FFI ZMQ Version: 4.0.5
XS  ZMQ Version: 4.0.5

Benchmark: timing 10000000 iterations of FFI, XS...
       FFI:  4 wallclock secs ( 3.31 usr +  0.01 sys =  3.32 CPU) @ 3012048.19/s (n=10000000)
        XS:  2 wallclock secs ( 2.16 usr +  0.00 sys =  2.16 CPU) @ 4629629.63/s (n=10000000)

         Rate   FFI    XS     C
FFI 3012048/s    --  -35%  -82%
XS  4629630/s   54%    --  -73%
C* 16835017/s  559%  364%    --

*just 'faking' the C results into the table so it's easy to compare a baseline

$ time zmq-bench-c
C ZMQ Version: 4.0.5

real    0m0.594s
user    0m0.570s
sys     0m0.017s

$ echo '10000000 / 0.594' | bc -lq
16835016.835 # Rate

Devel::NYTProf profiling results

For profiling and timing in the shell below send in a for loop instead of via Benchmark

sub main::zmqffi_send; # xsub
# spent 15.5s within main::zmqffi_send which was called 10000000 times, avg 2µs/call

sub ZMQ::LibZMQ3::zmq_send; # xsub
# spent 15.6s within ZMQ::LibZMQ3::zmq_send which was called 10000000 times, avg 2µs/call

Q: Why does the profiler indicate basically identical performance of the xsubs, but Benchmark reports performance difference?

A: ???

Time in shell

$ time perl
FFI ZMQ Version: 4.0.5

real    0m3.541s
user    0m3.510s
sys     0m0.027s

$ echo '10000000 / 3.541' | bc -lq
2824060.999 # Rate

$ time perl
XS ZMQ Version: 4.0.5

real    0m2.390s
user    0m2.363s
sys     0m0.020s

$ echo '10000000 / 2.390' | bc -lq
4184100.418 # Rate

XS is 48% faster when timing on the shell.

#include <zmq.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <assert.h>
#include <string.h>
int main(void)
void *ctx = zmq_ctx_new();
void *socket = zmq_socket(ctx, ZMQ_PUB);
pid_t p = getpid();
char *endpoint = malloc(256);
sprintf(endpoint, "ipc:///tmp/zmq-c-bench-%d", p);
assert( -1 != zmq_bind(socket, endpoint) );
int major, minor, patch;
zmq_version(&major, &minor, &patch);
printf("C ZMQ Version: %d.%d.%d\n", major, minor, patch);
for ( int i = 0; i < (10 * 1000 * 1000); i++ ) {
assert( -1 != zmq_send(socket, "ohhai", 5, 0) );
# Directly compare FFI::Platypus vs XS xsubs
use strict;
use warnings;
use v5.10;
use FFI::Platypus::Declare;
use ZMQ::LibZMQ3;
use ZMQ::FFI::Constants qw(:all);
use Benchmark qw(:all);
lib '';
['zmq_ctx_new' => 'zmqffi_ctx_new']
=> [] => 'pointer'
['zmq_socket' => 'zmqffi_socket']
=> ['pointer', 'int'] => 'pointer'
['zmq_bind' => 'zmqffi_bind']
=> ['pointer', 'string'] => 'int'
['zmq_send' => 'zmqffi_send']
=> ['pointer', 'string', 'size_t', 'int'] => 'int'
['zmq_version' => 'zmqffi_version']
=> ['int*', 'int*', 'int*'] => 'void'
my $ffi_ctx = zmqffi_ctx_new();
die 'ffi ctx error' unless $ffi_ctx;
my $ffi_socket = zmqffi_socket($ffi_ctx, ZMQ_PUB);
die 'ffi socket error' unless $ffi_socket;
my $rv;
$rv = zmqffi_bind($ffi_socket, "ipc:///tmp/zmq-ffi-bench-$$");
die 'ffi bind error' if $rv == -1;
my $xs_ctx = zmq_ctx_new();
die 'xs ctx error' unless $xs_ctx;
my $xs_socket = zmq_socket($xs_ctx, ZMQ_PUB);
die 'xs socket error' unless $xs_socket;
$rv = zmq_bind($xs_socket, "ipc:///tmp/zmq-xs-bench-$$");
die 'xs bind error' if $rv == -1;
my ($major, $minor, $patch);
zmqffi_version(\$major, \$minor, \$patch);
say "FFI ZMQ Version: " . join(".", $major, $minor, $patch);
say "XS ZMQ Version: " . join(".", ZMQ::LibZMQ3::zmq_version());
my $r = timethese 10_000_000, {
'XS' => sub {
die 'xs send error ' if -1 == zmq_send($xs_socket, 'ohhai', 5, 0);
'FFI' => sub {
die 'ffi send error' if -1 == zmqffi_send($ffi_socket, 'ohhai', 5, 0);
use strict;
use warnings;
use v5.10;
use ZMQ::FFI;
use ZMQ::FFI::Constants qw(ZMQ_SUB);
say "Collecting updates from weather station...";
my $context = ZMQ::FFI->new();
my $subscriber = $context->socket(ZMQ_SUB);
my $filter = $ARGV[0] // "10001 ";
my $update_nbr = 100;
my $total_temp = 0;
my ($string, $zipcode, $temperature, $relhumidity);
for (1..$update_nbr) {
$string = $subscriber->recv();
($zipcode, $temperature, $relhumidity) = split ' ', $string;
$total_temp += $temperature;
printf "Average temperature for zipcode '%s' was %dF\n",
$filter, int($total_temp / $update_nbr);
use strict;
use warnings;
use ZMQ::FFI;
use ZMQ::FFI::Constants qw(ZMQ_PUB);
my $context = ZMQ::FFI->new();
my $publisher = $context->socket(ZMQ_PUB);
my ($zipcode, $temperature, $relhumidity, $update);
# for (1..1_000_000) { # publish constant number when profiling
while (1) {
$zipcode = rand(100_000);
$temperature = rand(215) - 80;
$relhumidity = rand(50) + 10;
$update = sprintf(
'%05d %d %d',
use strict;
use warnings;
use v5.10;
use ZMQ::LibZMQ3;
use ZMQ::Constants qw(ZMQ_SUB ZMQ_SUBSCRIBE);
use zhelpers;
say 'Collecting updates from weather server...';
my $context = zmq_init();
my $subscriber = zmq_socket($context, ZMQ_SUB);
zmq_connect($subscriber, 'tcp://localhost:5556');
my $filter = @ARGV ? $ARGV[0] : '10001 ';
zmq_setsockopt($subscriber, ZMQ_SUBSCRIBE, $filter);
my $update_nbr = 100;
my $total_temp = 0;
my ($string, $zipcode, $temperature, $relhumidity);
for (1 .. $update_nbr) {
$string = s_recv($subscriber);
($zipcode, $temperature, $relhumidity) = split ' ', $string;
$total_temp += $temperature;
printf "Average temperature for zipcode '%s' was %dF\n",
$filter, int($total_temp / $update_nbr);
use strict;
use warnings;
use ZMQ::LibZMQ3;
use ZMQ::Constants qw(ZMQ_PUB);
use zhelpers;
my $context = zmq_init();
my $publisher = zmq_socket($context, ZMQ_PUB);
zmq_bind($publisher, 'tcp://*:5556');
my ($zipcode, $temperature, $relhumidity, $update);
# for (1..1_000_000) { # publish constant number when profiling
while (1) {
$zipcode = rand(100_000);
$temperature = rand(215) - 80;
$relhumidity = rand(50) + 10;
$update = sprintf(
'%05d %d %d',
s_send($publisher, $update);
Copy link

pipcet commented Mar 12, 2015


One of the other ideas for "optimization" is to generate C (not XS) code at runtime and compile it with FFI::TinyCC. I had assumed that wouldn't work because we need to parse the Perl header files (and TinyCC doesn't implement compiler directives that GCC does implement; but there's no easy/portable way to get Perl's build flags translated to another compiler), but it does! (I had to #define __builtin_expect, but that's hardly a major issue). And looking at the generated assembler code and benchmark results, it's good code and a little faster than ZMQ::LibZMQ3 (full results at

What this demonstrates is that we can beat a real-world XS library in a non-real-world benchmarking situation, by using PERL_NO_GET_CONTEXT. So on threaded Perl, holy grail. Well, some coding still required, but I'm now convinced we can do it.

Copy link

pipcet commented Mar 13, 2015

Okay, I've implemented just enough JIT compiling (though that's really a big word for a tiny little hack like this) to cover this one example, and the speed difference compared to XS is still 10%:

time perl ./; time perl ./
perl ./  52.33s user 0.05s system 97% cpu 53.624 total
FFI ZMQ Version: 4.0.5
perl ./  45.33s user 0.05s system 99% cpu 45.431 total

Where the code is essentially:

  ['zmq_send' => 'zmqffi_send']
  => ['pointer', 'string', 'size_t', 'int'] => 'int',

while(1) {
  die if -1 == zmqffi_send($ffi_socket, 'ohhai', 5, 0);
  exit if $i == 100_000_000;

The difference is more pronounced without error checking, but that's something I'm nervous about, at least in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment