velvia/gist:213b837c6e02c4982a9a

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    ...to be turned into a blog post later.  These are notes with references to commits, the blog post will have snippets of code so folks don't have to look things up.
How I tuned Filo for 50x speedup in 24 hours

Filo is an extreme serialization library for vector data.  Think of it as the good parts of Parquet without the HDFS and file format garbage -- just the serdes and fast columnar storage.
I recently added a JMH benchmark for reading a Filo binary buffer containing 10,000 Ints using the simplest apply() method to sum up all the Ints.
Oh, and before we get started - avoid throwing exceptions in inner loops, especially Try(....).getOrElse(...) patterns.  Even if they occur only occasionally they can be extremely expensive.
53fbe47 - 277us

This translates to 27.7ns per lookup, not very good for an "extreme" deserialization lib.  It turns out the Google FlatBuffers code for looking up a single slice from a vector is not very efficient - it does lots of computations, plus the ByteBuffer's methods for reading are known to be slow.
1bca307 - 71us

The first real speedup was to implement FastBufferReader - which uses sun.misc.Unsafe and its intrinsics to speed up reading from the ByteBuffer vector.  7.1ns per read is much better, but I knew we could push the envelope much further.
531237a - 67us

A really small speedup to avoid an unnecessary add in the lookup logic.
At this point I believed the main culprit was the VectorExtractor and the multiple layers of calls underneath the apply method. VectorExtractor returns a function, which probably involves virtual dispatch and is not very efficient.
36dc42b - 43us

The next big speedup involved rewriting how the ColumnWrapper was built - instead of apply calling another method which called the function returned by VectorExtractor, I wrote a ColumnMaker type class which directly mixes in a final def apply(i: Int) method.  Final def methods can avoid more expensive virtual method invocations. Not too bad, but I knew we could still do better than 4.3ns per lookup.
I used JMH -prof stack -jvmArgsAppend -Djmh.stack.lines=3 arguments to analyze where the time was being spent.  One line contained UnboxInt -- hmm... where are we doing boxing?  That's got to be the remaining performance hurdle.
Using javap, and looking at the BasicFiloBenchmark class output, I found this:
      13: aload_0
      14: invokevirtual #48                 // Method sc:()Lorg/velvia/filo/ColumnWrapper;
      17: iload_1
      18: invokeinterface #54,  2           // InterfaceMethod org/velvia/filo/ColumnWrapper.apply:(I)Ljava/lang/Object;
      23: invokestatic  #60                 // Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I

Hmm... I think I got it, the ColumnWrapper trait has this definition: def apply(index: Int): A.  The problem was because type A in the trait was not specialized, the Java bytecode has to return a java.lang.Object, which has to be unboxed on every read!
2d3d1e8 - 5us

Adding @specialized to ColumnWrapper did the final trick, allowing the test code to avoid unboxing and directly reading an Int... leading to an incredible .5ns per read!
Note that I tried the specialization earlier - before 36dc42b to streamline the method reads - and it made no difference.  Thus we had to get past the virtual method invocations in order to uncover the boxing overhead.