lenary/00-intro.md

## 00-intro.md

      
    Raw
  

              00-intro.md
            
          
    I've had one question about subpar performance in Riak 2.0's CRDTs. I thought I'd write this so that people can more easily diagnose these issues, without the CRDT team having to step in every time.
An Example: A Client was having problems with the performance fetching and updating sets. The issue manifested itself with poor fetch performance.
So, how do you go about debugging/diagnosing this?

  
## 01-data-gathering.md

      
    Raw
  

              01-data-gathering.md
            
          
    Start off with getting a few details. These depend on the particular data type. For sets, we want to know the average count and size of elements in the set. This also gives a rough idea of the size of the riak object we're storing the set in. The total size should be below about 1M, which corresponds with our guidance on regular riak objects. However, CRDTs do have a not-insignificant overhead.
Other data types:

Maps: know rough number of keys, and rough depth (how far the map recurses) and width of the map (max number of keys at any level).
Counters: know what n value people are using
Booleans: meh (roughly constant size)
LWW-Registers: meh (constant size)

Next up, collect the relevant CRDT stats. The list of them all is here: https://github.com/basho/riak_kv/commit/26fb10f88094f990ce51ccd7e5f3559ab4fe98b9 (in a stat key, the data type is the top-level data type). I don't know what pre the commit landed in, but it's certainly there somewhere.
This should give us lots of helpful information about rough fetch/update load, and various operation latencies. Unfortunately, we haven't yet worked out if this affects anything, and don't measure it, but it might also be useful for us to know what the distribution of updates on the datatype is - ie, are most updates happening to only a few of the keys/elements, or are updates fairly evenly divided across the whole datatype.

  
## 02-diagnosis.md

      
    Raw
  

              02-diagnosis.md
            
          
    Ok, now for some diagnosis.
If it's a set/map that's large, the problem is almost certainly ordsets/orddict. We want to look at this more. Unfortunately, it's not a search/replace to swap out a different implementation. @seancribbs has ported HashSet/HashDict from elixir to erlang (https://github.com/seancribbs/hashtypes), which we may well use. We could swap to sets/dict, but their equality is broken. Converting to hashtypes is the lowest hanging fruit currently for increasing performance (though this requires a benchmark, not words on a page from me, without a benchmark).
This is almost certainly the problem when it comes to the exemplar GET performance issues. CRDTs go through a nontrivial decode from riak_object -> riak_dt_* -> { protobuffs | http/json }. In the conversion from dt datatype to protobuffs or json, we call riak_dt_*:value/1, which does a large orddict traversal. Hello O(n).
As for general approaches to diagnosis:

Issues with get? Instrument riak_dt_*:value/1
Issues with put? Instrument riak_dt_*:update/3
These don't turn up anything? Instrument riak_dt_*:to_binary/1 and riak_dt_*:from_binary/1

Which riak_dt_* module?

Counters: riak_dt_pncounter
Sets: riak_dt_orswot
Maps: riak_dt_map

There's also riak_kv_crdt. This is the module that does stats, and proxies through calls to the right riak_dt_* module using information from the cluster. It's probably not worth debugging. I realise it uses orddict in places, but these should have maximum 3 or so entries (one for each type), so are ignorable.