Skip to content

Instantly share code, notes, and snippets.

@skippy
Last active August 29, 2015 14:20
Show Gist options
  • Save skippy/00c23e4e7bdac5124fe4 to your computer and use it in GitHub Desktop.
Save skippy/00c23e4e7bdac5124fe4 to your computer and use it in GitHub Desktop.
thoughts & goals on google/protobuf

tl;dr

Update the google/protobuf 3.0-alpha library in 2 concrete areas, and 1 opinionated way:

  • remove inconsistencies between the c- and java-backed ruby libraries. Most inconsistencies are bugs in one implementation and not the other, but there was at least one helper method which existed in one and not the other.
  • Improve the test suite. Since the bulk of the logic exists in lower-level C and java libraries, items which usually aren't tested because they come from the standard lib, like #push, should be tested as this library uses plenty of low-level handcrafted code
  • Opinionated changes -- make it feel more ruby-like. This means being able to accept string or symbols in initializers, returning nil instead of raising an exception when an array index out-of-bounds occurs, have common names and aliases for methods (e.g. #size and #length), and having RepeatedFields quack like an array, and Map quack like a hash. There may be other examples.

Disclaimer

First off, I'm very excited to see this library take shape. Having a canonical protobuf library is a good thing. We use python, go, ruby (MRI & jruby), and java in our stack, and having a common library that has similar behavior and functionality across languages is toast-worthy.

Challenge:

  • Performance optimizations: Ruby, historically, does not put performance as a first-level concern and rather puts its emphasis elsewhere. Many of us who use ruby on a regular basis are fine with this trade off. The google/protobuf library has put an unofficial stake in the ground on performance. The challenge will be keeping the app maintainable and 'ruby-like' while keeping performance as a top-level concern.

  • There are three levels of code: C, Java, and Ruby. In an ideal world, the main library will be ruby with optimized and optional sub-components in Java and C. However, moving even basic functionality to ruby and out of native, like #to_h which iterates through the fields and adds them to a hash, can be upwards of 9x slower gist.

Proposal

  • make the library exhibit the same behavior under both jRuby and C
  • hook up the ruby library into travis and test Ruby 2.0.0, 2.1.0, 2.2.0, jruby-1.7.19, and jruby-9.0.0.0.pre2
  • significantly beef up the test suite, and move it from being a happy-path focused suite to the conanical definition for the ruby library. And because much of the functionality is outside the ruby core and stdlib, unit testing of array and map-like functionality should be deeply tested
  • evaluate three approaches to make the library behave more like standard Array and Map objects:
    • move common function functionality to ruby.
    • write helper methods that are 100% API-compatible with Array and Maps
    • in the Java/C libraries, move to using Hash and Array objects that can then be exposed into the Ruby lib and then delegated against.

Summary I think the choice for making it more 'ruby-like' will either be:

  • Do it up to, but not including ducktyping RepeatedFields and Map, or
  • add the functionality in a ruby module that is shared across both C and Java.

I am starting to understand why ducktyping RepeatedFields and Map is a challenging target, as it may add a lot of work. But I also feel that not having it behave similarly to existing protobuf libs will be a significant pain point. I'm guessing, so having a much higher degree of confidence before taking on the burden of being 100% Array and Map api compatible, would be great.

Work done to date

  • I'm fixing inconsistencies as I find them, though without a comprehensive test suite I'm sure I'm missing some
  • I experimented with a few different approaches for duck-typing:
    • moving common functionality to ruby: I got this to work quite well, but the performance was noticeably bad compared to the performance as it exists today. We're talking 10x faster.
    • I explored writing a ruby-only version, which would replace the jruby implementation and make the C-exts as optional. wow, was this painful, and it quickly becomes unattractive, if for no other reason than a full protobuf encoder/decoder would need to be maintained in parallel with the primary c-lib. That seems untenable to me.
    • I modified RepeatedFields.c to populate a rb_ary object in parallel. I looked at replacing the internal data structure with a rb_ary, but that was a non-starter as it uses upd, which is appropriate. Also, there are performance enhancements, such as memcpy that are not available if rb_ary becomes the primary data model. Having rb_ary as a secondary data model that mirrors the internal structures works, and works well, BUT I strongly suspect it is quite efficient. I stopped before doing detail benchmarking or analysis of memory usage, so I don't know if there will be a lot of mem copying, pointers, references, etc... And it still runs into the problem of #! methods... keeping a secondary rb_ary will make any non-modifying array behavior much faster than to keep calling #to_ary, but it won't help any modification methods.
    • I started to add methods that would allow 100% api-compatibility with Array and Map. This is a combination of C/Java functionality and helper methods in ruby. The risks are keeping this inline with different versions of ruby, and the #! methods. The #! can be implemented in ruby, but they will really not be efficient. I don't know what 'good enough' is. And, test cases will be needed to double check all this functionality; functionality which is not a part of the core protobuf lib
@cfallin
Copy link

cfallin commented May 8, 2015

Thanks for this summary and braindump of what you're thinking and hoping to do! I like the general direction, and overall we've been really happy to have community feedback like this and get improvements.

A few more specific thoughts:

  • I'm fine with adding a "top layer" in pure Ruby that builds on the lower level primitives. There is definitely value in maintainability as long as the most common fast paths (field access and repeated field iteration) are fast/native. E.g. your PR #340 that delegates inspect and the like is nice.
  • Re: "opinionated" Ruby-like array behavior, and nil rather than exception on out-of-bounds: I am probably much less deep into the ways of Ruby than you are, so I may be missing some subtlety that makes this a hard choice, but honestly it seems perfectly reasonable to me to go with nil. Throwing an exception on out-of-bounds was the default "safe" behavior we came to. It's technically a breaking API change, but it's probably unlikely that someone is relying on that exception yet, as we're still alpha.
  • Re: keeping a parallel native Ruby array for read-only accesses: the thing I would worry about here is that the array could get out of sync if a raw reference to it somehow leaked and the user were able to mutate it. There's also 2x overhead to mutations, and it forces boxing of native types (e.g. Ruby integers rather than native int64_t/int32_t).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment