Skip to content

Instantly share code, notes, and snippets.

@melo
Forked from anonymous/intro.md
Last active February 18, 2016 18:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save melo/e4c4bf05182e5e669363 to your computer and use it in GitHub Desktop.
Save melo/e4c4bf05182e5e669363 to your computer and use it in GitHub Desktop.
Compare VPack to Sereal - *Size* only

At $work, we are looking to replace JSON encoding with another format, to increase encode/decode speed and required storage size.

Requirements, in order of importance for our use case, YMMV:

  • no schema requirement: data is JSON-compatible, deeply nested in cases, but we don't have a schema to start from;
  • smallest size: we store the objects in memory on Redis DB's, so size is the main factor;
  • fast decode: we can trade slower encode speed for size, but decode should be fast;
  • language support: stack is Perl, Go, and JavaScript. PHP is a plus, but not required.

We are testing msgpack, cbor, sereal, and others, but here I wanted to compare just sereal (the current forerunner) with the new VPack from ArangoDB project.

We used the sample files from VPack project tests/jsonSample/, and I took the best results for VPack from the Performance.md file, last column, VPack-c. Please note: we are only comparing size at the moment (it was enough for our use case, where size is more important, YMMV)

Please don't make this a "mine is better" competition, this is based on our criteria for our use case

If you find a bug on our methodology I would appreciate a note here or on @pedromelo.

#!/usr/bin/env perl
use strict;
use JSON::XS;
use Path::Tiny;
use Sereal::Encoder qw( SRL_SNAPPY SRL_ZLIB SRL_UNCOMPRESSED );
use Text::Table;
die "Usage: json2sereal.pl <dir>\n\n Scans <dir> for .json files, converts to seral and compares sizes\n" unless @ARGV;
my $enc_snappy = Sereal::Encoder->new({ compress => SRL_SNAPPY, dedupe_strings => 1 });
my $enc_zlib = Sereal::Encoder->new({ compress => SRL_ZLIB, dedupe_strings => 1 });
my $enc_none = Sereal::Encoder->new({ compress => SRL_UNCOMPRESSED, dedupe_strings => 1 });
my $enc_def = Sereal::Encoder->new();
my %best_vpack = (
'api-docs.json' => 994160,
'commits.json' => 20789,
'countries.json' => 956786,
'directory-tree.json' => 244716,
'doubles.json' => 899982,
'doubles-small.json' => 89998,
'file-list.json' => 133536,
'object.json' => 118630,
'pass1.json' => 804,
'pass2.json' => 51,
'pass3.json' => 108,
'random1.json' => 6836,
'random2.json' => 5815,
'random3.json' => 51515,
'sample.json' => 153187,
'small.json' => 30,
);
my $it = path(@ARGV)->iterator;
my (@rows, %totals);
while (my $f = $it->()) {
my $b = $f->basename;
next unless $f->is_file and $b =~ m/[.]json$/;
my $c = eval { decode_json($f->slurp_raw) };
debug("Skip file '$b', could not JSON-parse it: $@"), next unless defined $c;
my $v = $best_vpack{$b};
debug("Skip file '$b', no VPack comparison"), next unless $v;
my $s = $f->stat->size;
my ($def, $none, $snap, $zlib) = (
length($enc_def->encode($c)), length($enc_none->encode($c)),
length($enc_snappy->encode($c)), length($enc_zlib->encode($c))
);
$totals{json} += $s;
$totals{vpack} += $v;
$totals{def} += $def;
$totals{none} += $none;
$totals{snap} += $snap;
$totals{zlib} += $zlib;
push @rows, table_row($b, $s, $v, $def, $none, $snap, $zlib);
}
push @rows,
table_row('-- Total --', $totals{json}, $totals{vpack}, $totals{def}, $totals{none}, $totals{snap}, $totals{zlib});
my $tb = Text::Table->new(
'File',
'JSON Size',
'VPack best',
'Defaults',
'% JSON',
'% VPack',
'No Compr',
'% JSON',
'% VPack',
'Snappy',
'% JSON',
'% VPack',
'ZLib',
'% JSON',
'% VPack',
);
$tb->load(@rows);
print $tb;
sub debug {
return unless $ENV{DEBUG};
print STDERR "[DEBUG] @_\n";
}
sub table_row {
my ($b, $s, $v, $def, $none, $snap, $zlib) = @_;
return [
$b,
$s,
$v,
$def,
sprintf('%.2f%%', $def / $s * 100),
sprintf('%.2f%%', $def / $v * 100),
$none,
sprintf('%.2f%%', $none / $s * 100),
sprintf('%.2f%%', $none / $v * 100),
$snap,
sprintf('%.2f%%', $snap / $s * 100),
sprintf('%.2f%%', $snap / $v * 100),
$zlib,
sprintf('%.2f%%', $zlib / $s * 100),
sprintf('%.2f%%', $zlib / $v * 100)
];
}
Legend:
File: name of file;
JSON Size: size of original file JSON encoded
VPack best: size of VPack encoding, best result from Performance.md from github repo;
Defaults: Sereal encoder results, default settings;
No Compr: Sereal encoder results, no compression + string dedup;
Snappy: Sereal encoder results, Snappy compression + string dedup;
ZLib: Sereal encoder results, Zlib compression (level 6, the Sereal default) + string dedup;
The % JSON is compared to JSON Size, and % VPack is comparison with VPack best. Below 100% is better.
File JSON Size VPack best % JSON Defaults % JSON % VPack No Compr % JSON % VPack Snappy % JSON % VPack ZLib % JSON % VPack
api-docs.json 1205964 994160 82.44% 962926 79.85% 96.86% 908679 75.35% 91.40% 210957 17.49% 21.22% 114777 9.52% 11.55%
commits.json 25216 20789 82.44% 9732 38.59% 46.81% 9484 37.61% 45.62% 6365 25.24% 30.62% 4691 18.60% 22.56%
countries.json 1134029 956786 84.37% 585916 51.67% 61.24% 527862 46.55% 55.17% 323064 28.49% 33.77% 220710 19.46% 23.07%
directory-tree.json 297695 244716 82.20% 179021 60.14% 73.15% 168528 56.61% 68.87% 92232 30.98% 37.69% 64377 21.63% 26.31%
doubles-small.json 158706 89998 56.71% 89990 56.70% 99.99% 89990 56.70% 99.99% 80815 50.92% 89.80% 52183 32.88% 57.98%
doubles.json 1187062 899982 75.82% 899876 75.81% 99.99% 899876 75.81% 99.99% 804100 67.74% 89.35% 423361 35.66% 47.04%
file-list.json 151317 133536 88.25% 122334 80.85% 91.61% 111793 73.88% 83.72% 60120 39.73% 45.02% 40459 26.74% 30.30%
object.json 157781 118630 75.19% 118756 75.27% 100.11% 118756 75.27% 100.11% 87212 55.27% 73.52% 54979 34.85% 46.34%
pass1.json 1441 804 55.79% 806 55.93% 100.25% 806 55.93% 100.25% 806 55.93% 100.25% 806 55.93% 100.25%
pass2.json 52 51 98.08% 38 73.08% 74.51% 38 73.08% 74.51% 38 73.08% 74.51% 38 73.08% 74.51%
pass3.json 148 108 72.97% 110 74.32% 101.85% 110 74.32% 101.85% 110 74.32% 101.85% 110 74.32% 101.85%
random1.json 9672 6836 70.68% 6094 63.01% 89.15% 5863 60.62% 85.77% 4033 41.70% 59.00% 3096 32.01% 45.29%
random2.json 8239 5815 70.58% 5192 63.02% 89.29% 4981 60.46% 85.66% 3445 41.81% 59.24% 2694 32.70% 46.33%
random3.json 72953 51515 70.61% 45064 61.77% 87.48% 42271 57.94% 82.06% 25288 34.66% 49.09% 18224 24.98% 35.38%
sample.json 687491 153187 22.28% 98172 14.28% 64.09% 83121 12.09% 54.26% 83008 12.07% 54.19% 75831 11.03% 49.50%
small.json 82 30 36.59% 54 65.85% 180.00% 54 65.85% 180.00% 54 65.85% 180.00% 54 65.85% 180.00%
-- Total -- 5097848 3676943 72.13% 3124081 61.28% 84.96% 2972212 58.30% 80.83% 1781647 34.95% 48.45% 1076390 21.11% 29.27%
@neunhoef
Copy link

Not that I want to start a battle "mine is smaller" or so, but for the sake of completeness we have added two more columns to our performance table, where we have taken the compact VPack version and run "gzip -9" and snappy compression respectively. This now allows a sensible comparison of compressed sereal with compressed VelocyPack.

See https://github.com/arangodb/velocypack/blob/master/Performance.md for details.

The reason why we have not put in compression into the VPack format itself is that for us the main advantage of VPack is that one can quickly access subvalues without parsing or deserialization. This is of course no longer possible after compression. On the other hand, if the aim is only compact storage, then one can easily put compression on top of VPack outside of the format specification.

@melo
Copy link
Author

melo commented Feb 18, 2016

There are two columns above, defaults and no compression, with sereal without any compression. It still gains some space, but yes, this was not a "mine is smaller", just comparing two tools.

Doing Message pack next...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment