- Deserialization of Ruby objects is slow.
- Filter plugins are slow.
A Record class is implemented in C. It has Ruby objects as a serialized binary using MessagePack.
When a field is accessed (e.g. record['key']
), it returns a deserialized object. This deserializes only necessary value. Thus, if only a few fields are accessed, this improves performance.
For farther optimization, when a field-access returns a String, it uses copy-on-write as following:
- A Record stores msgpack binary using String object in memory. The String object should be freezed.
- When a field is accessed and the value is a string, the Record calls
String#substr
to return subset of the entire binary. This is copy-on-write in CRuby. Thus, as long as following process doesn't modify the object, it doesn't copy data.
A Page class is implemented in C. It has sequence of Records (or maybe map of String to Record) as a serialized binary using MessagePack. Creating a Page from a msgpack binary is fast because it doesn't have to deserialize objects.
A Page also has list of Proc objects to convert records. Those Proc objects are called only when records are used.
record1 = {'a': 1, 'b': 2}
record2 = {'c': 1, 'e': 2}
page = Page.build([record1, record2]) # internally, Page has msgpack binary
p page[1] #=> {'c': 1, 'e': 2} # this deserializes record2 without deserializing record1
page.map! {|r| {'modified': 1} } # this proc is not called here
p page[1] #=> {'modified': 1} # above proc is called now. this deserializing record2 without deserializing record1, and calls above proc, and returned applied result
page.map! {|r| {'modified': 1} } # applying map! appends proc to the internal list
record1 = {'a': 1, 'b': {'c': 1}}
page = Page.build([record1, record2])
p page.dig('b', 'c') #=> this deserializing 1 without deserializing Hash
record1 = {'a': 1, 'b': {'c': 1}}
page = Page.build([record1, record2])
page.dig!('b', 'c') {|v| v + 1 } #=> this proc is not called here
p page[0] #=> {'a': 1, 'b': {'c': 2}} # above proc is called here