Skip to content

Instantly share code, notes, and snippets.

@slfritchie
Created September 18, 2014 10:01
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save slfritchie/809d3c666ec98ed15bd1 to your computer and use it in GitHub Desktop.
Save slfritchie/809d3c666ec98ed15bd1 to your computer and use it in GitHub Desktop.
Tokyo Source code reading #27 notes

Riak Source Code Reading #27: Riak KV backend API and storage

The Riak KV backend store API

behaviour_info(callbacks) ->
    [
     {api_version,0},
     {capabilities, 1},  % (State)
     {capabilities, 2},  % (Bucket, State)
     {start,2},          % (Partition, Config)
     {stop,1},           % (State)
     {get,3},            % (Bucket, Key, State)
     {put,5},            % (Bucket, Key, IndexSpecs, Val, State)
     {delete,4},         % (Bucket, Key, IndexSpecs, State)
     {drop,1},           % (State)
     {fold_buckets,4},   % (FoldBucketsFun, Acc, Opts, State),
                         %   FoldBucketsFun(Bucket, Acc)
     {fold_keys,4},      % (FoldKeysFun, Acc, Opts, State),
                         %   FoldKeysFun(Bucket, Key, Acc)
     {fold_objects,4},   % (FoldObjectsFun, Acc, Opts, State),
                         %   FoldObjectsFun(Bucket, Key, Object, Acc)
     {is_empty,1},       % (State)
     {status,1},         % (State)
     {callback,3}];      % (Ref, Msg, State) ->
behaviour_info(_Other) ->
    undefined.

A very simple version of the 'yessir' backend

First, look at this simple version of the 'yessir' back end.

  • The 'yessir' back end is only used for debugging and for performance analysis.

  • The 'yessir' back end does not store anything. It always lies.

  • The intent is to be very simple and fast.

Early version of riak_kv_yessir_backend.erl, June, 2012

  • Being fast /= being short & simple code.

Let's look at the 'memory' backend

This code is more complicated than the 'yessir' backend. Why?

  • It actually has a real memory.
  • It actually supports Riak KV vnode fold operations
    • fold_buckets, fold_keys, fold_objects
    • syncronous fold, asyncronous fold
    • The fold API looks very simple & easy to use...
    • The fold API internal plumbing is ... not pretty.
  • It supports Riak KV's 2I ("secondary indexes")
  • It supports age limits ("time to live" or "TTL")
  • It supports memory size accounting (instead of using ETS counters)
  • It supports maximum memory limit:
    • If limit is exceeded, then delete oldest object(s)

Here is the version of today's develop branch of the 'memory' backend: https://github.com/basho/riak_kv/blob/8ebfeb6700b451772e779c5f3c5601ca1ab4ec92/src/riak_kv_memory_backend.erl

Note #1 about folding

  • Very nice, pure, and easy:
lists:foldl(fun(_X, Count) -> Count + 1 end,
            0,
            [a, b, c, d, e, f, 42.0, "yo, kewl"]).
  • The backend and vnode folding funcs are not pure and easy. (sadpanda)

Note #2 about folding: general patterns & themes

  • General pattern: filter
filter_wrapper_fun(Item, InnerFun, Acc) ->
    case is_this_item_invisible(Item) of
        true ->
            Acc;
        false ->
            InnerFun(Item, Acc)
    end.
  • General pattern: transform
transform_wrapper_fun(Item, InnerFun, Acc) ->
    RealItem = transform_somehow(Item),
    InnerFun(RealItem, Acc).
  • General pattern: transform from riak_kv_vnode.erl
handle_command(?FOLD_REQ{foldfun=FoldFun, acc0=Acc0,
                         forwardable=_Forwardable, opts=Opts}, Sender, State) ->
    FoldWrapper = fun(Bucket, Key, Value, Acc) ->
                          FoldFun({Bucket, Key}, Value, Acc)
                  end,

Now, let's look at the 'memory' backend's fold functions: buckets, keys, objects, ...

What is a backend "folder" function?

  • A backend "folder" is a function that implements a backend's fold.

    • ... Or like lists:foldl() implements a lists fold.
    • ... Just like ets:fold() implements the ETS fold.
  • General folder pattern: stop the fold early

perhaps_stop_early_wrapper_fun(Item, InnerFun, Acc) ->
    try
        InnerFun(RealItem, Acc)
    catch
        stop_fold ->
            Acc
    end.

%%% ...

my_fold_function({Key, _Val}=BKey, InnerFun, Acc) ->
    if
       Key >= <<"n">> ->    % Avoid keys >= "n..."
           throw stop_fold;
       true ->
           InnerFun(BKey, Acc)
    end.

Bitcask, the Riak KV backend

Source code: https://github.com/basho/riak_kv/blob/develop/src/riak_kv_bitcask_backend.erl

This is the glue code between Riak KV and the Bitcask app.

Some additional things that the Bitcask backend does that the memory backend does not:

  • Merge operations, to reclaim unused/"dead" disk space.
  • Check for data upgrade/file format conversions

Bitcask, the app

See also: the original Bitcask design summary, April 2010.

This design doc is still mostly correct ... many small details have changed in the last 4 years, though.

  • Part 1: the "keydir" is stored in RAM always.
    • When a bitcask is opened, the keydir is read from disk -> RAM.
    • You must have enough RAM to store all keys.
  • Part 2: the data on disk.
    • Key + value data is stored on disk

Bitcask, part 1: the keydir in RAM

  • The keydir is a key/value store
    • key: same as the Bitcask key
    • value: where on disk to find the Bitcask value
      • file_id #, value size, value offset, timestamp
  • The keydir is a hash table
    • Key order is not preserved
    • Key iteration is done via fold
    • Severe limits on the # of simultaneous folds
  • The keydir is implemented as a NIF
    • NIF <-> C code

Bitcask, part 2: data files on disk

  • Bitcask files are append-only.
  • To delete or replace data, new records are appended to the active file.
  • "Dead data" lives on disk for a period of time...
  • ... until "merge" can copy "alive data" to a new data file.
    • For each copied old key + value, update the RAM keydir
    • When all "alive data" is copied, then delete the old data file.
  • Only two files may be open at one time:
    • The active file: new writes only
    • (maybe) The merge file: old data copied from old files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment