...to be turned into a blog post later. These are notes with references to commits, the blog post will have snippets of code so folks don't have to look things up.
Filo is an extreme serialization library for vector data. Think of it as the good parts of Parquet without the HDFS and file format garbage -- just the serdes and fast columnar storage.
I recently added a JMH benchmark for reading a Filo binary buffer containing 10,000 Ints using the simplest apply() method to sum up all the Ints.
Oh, and before we get started - avoid throwing exceptions in inner loops, especially Try(....).getOrElse(...)
patterns. Even if they occur only occasionally they can be extremely expensive.
This translates to 27.7ns per lookup, not very good for an "extreme" deserialization lib. It turns out the Google FlatBuffers code for looking up a single slice from a vector is not very efficient - it does lots of computations, plus the ByteBuffer's methods for reading are known to be slow.
The first real speedup was to implement FastBufferReader
- which uses sun.misc.Unsafe
and its intrinsics to speed up reading from the ByteBuffer vector. 7.1ns per read is much better, but I knew we could push the envelope much further.
A really small speedup to avoid an unnecessary add in the lookup logic.
At this point I believed the main culprit was the VectorExtractor
and the multiple layers of calls underneath the apply
method. VectorExtractor
returns a function, which probably involves virtual dispatch and is not very efficient.
The next big speedup involved rewriting how the ColumnWrapper
was built - instead of apply
calling another method which called the function returned by VectorExtractor
, I wrote a ColumnMaker
type class which directly mixes in a final def apply(i: Int)
method. Final def methods can avoid more expensive virtual method invocations. Not too bad, but I knew we could still do better than 4.3ns per lookup.
I used JMH -prof stack -jvmArgsAppend -Djmh.stack.lines=3
arguments to analyze where the time was being spent. One line contained UnboxInt
-- hmm... where are we doing boxing? That's got to be the remaining performance hurdle.
Using javap, and looking at the BasicFiloBenchmark
class output, I found this:
13: aload_0
14: invokevirtual #48 // Method sc:()Lorg/velvia/filo/ColumnWrapper;
17: iload_1
18: invokeinterface #54, 2 // InterfaceMethod org/velvia/filo/ColumnWrapper.apply:(I)Ljava/lang/Object;
23: invokestatic #60 // Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I
Hmm... I think I got it, the ColumnWrapper
trait has this definition: def apply(index: Int): A
. The problem was because type A in the trait was not specialized, the Java bytecode has to return a java.lang.Object
, which has to be unboxed on every read!
Adding @specialized
to ColumnWrapper did the final trick, allowing the test code to avoid unboxing and directly reading an Int... leading to an incredible .5ns per read!
Note that I tried the specialization earlier - before 36dc42b to streamline the method reads - and it made no difference. Thus we had to get past the virtual method invocations in order to uncover the boxing overhead.