Skip to content

Instantly share code, notes, and snippets.

@lenary
Last active August 29, 2015 13:56
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lenary/8812739 to your computer and use it in GitHub Desktop.
Save lenary/8812739 to your computer and use it in GitHub Desktop.
Guidance Notes for CRDT Performance Issues

I've had one question about subpar performance in Riak 2.0's CRDTs. I thought I'd write this so that people can more easily diagnose these issues, without the CRDT team having to step in every time.

An Example: A Client was having problems with the performance fetching and updating sets. The issue manifested itself with poor fetch performance.

So, how do you go about debugging/diagnosing this?

Start off with getting a few details. These depend on the particular data type. For sets, we want to know the average count and size of elements in the set. This also gives a rough idea of the size of the riak object we're storing the set in. The total size should be below about 1M, which corresponds with our guidance on regular riak objects. However, CRDTs do have a not-insignificant overhead.

Other data types:

  • Maps: know rough number of keys, and rough depth (how far the map recurses) and width of the map (max number of keys at any level).
  • Counters: know what n value people are using
  • Booleans: meh (roughly constant size)
  • LWW-Registers: meh (constant size)

Next up, collect the relevant CRDT stats. The list of them all is here: https://github.com/basho/riak_kv/commit/26fb10f88094f990ce51ccd7e5f3559ab4fe98b9 (in a stat key, the data type is the top-level data type). I don't know what pre the commit landed in, but it's certainly there somewhere.

This should give us lots of helpful information about rough fetch/update load, and various operation latencies. Unfortunately, we haven't yet worked out if this affects anything, and don't measure it, but it might also be useful for us to know what the distribution of updates on the datatype is - ie, are most updates happening to only a few of the keys/elements, or are updates fairly evenly divided across the whole datatype.

Ok, now for some diagnosis.

If it's a set/map that's large, the problem is almost certainly ordsets/orddict. We want to look at this more. Unfortunately, it's not a search/replace to swap out a different implementation. @seancribbs has ported HashSet/HashDict from elixir to erlang (https://github.com/seancribbs/hashtypes), which we may well use. We could swap to sets/dict, but their equality is broken. Converting to hashtypes is the lowest hanging fruit currently for increasing performance (though this requires a benchmark, not words on a page from me, without a benchmark).

This is almost certainly the problem when it comes to the exemplar GET performance issues. CRDTs go through a nontrivial decode from riak_object -> riak_dt_* -> { protobuffs | http/json }. In the conversion from dt datatype to protobuffs or json, we call riak_dt_*:value/1, which does a large orddict traversal. Hello O(n).

As for general approaches to diagnosis:

  • Issues with get? Instrument riak_dt_*:value/1
  • Issues with put? Instrument riak_dt_*:update/3
  • These don't turn up anything? Instrument riak_dt_*:to_binary/1 and riak_dt_*:from_binary/1

Which riak_dt_* module?

  • Counters: riak_dt_pncounter
  • Sets: riak_dt_orswot
  • Maps: riak_dt_map

There's also riak_kv_crdt. This is the module that does stats, and proxies through calls to the right riak_dt_* module using information from the cluster. It's probably not worth debugging. I realise it uses orddict in places, but these should have maximum 3 or so entries (one for each type), so are ignorable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment