Skip to content

Instantly share code, notes, and snippets.

@macintux
Last active December 16, 2015 10:09
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save macintux/5418295 to your computer and use it in GitHub Desktop.
Save macintux/5418295 to your computer and use it in GitHub Desktop.
Draft of a blog post (or possibly a series of posts) diving into the meat of several key Riak configuration parameters. Internal links do not work, but I haven't worried about that since I'm not sure how closely GFM maps to Basho's blogging platform.
@coderoshi
Copy link

Really nice work. I love this article. I only have a few comments.

Start with a description of what the document is. What is the user reading, why should they want to, what are they going to learn?

"but allows the operator and even the developer to tune read and write requests to better meet the business needs for any given set of data."
"even the developer" sounds kind of surprising, tune how?, and "better meet the business needs" isn't very descriptive. How about:
"but allows tuning read and write requests to trade higher availability for increased consistency, depending on the business needs of the data set. These choices can be made by both operators and developers."

"CAP Tuning" isn't really used anymore (I'm sure there are wrong docs somewhere that still use it, and they should be changed too). It came from a misunderstanding of Riak vis-a-vis Dynamo and perpetuated. Strictly speaking, Riak trades availability for latency and durability, no setting can make Riak truly consistent.

Under PR and PW: "may drop significantly". The odds of any given request failing due to unavailability increases.

Under DW: "value be written to the backend" should probably clarify by comparing W does not wait for the backend to reply. Better to be a little too pedantic than not enough in this case.

@macintux
Copy link
Author

Deletion just got more complicated too; the reaper apparently will only remove the Riak tombstones IFF all primary nodes are available, all nodes hold the same tombstone. However, MDC makes this complicated.

@macintux
Copy link
Author

To summarize what I think I know about notfound_ok vs basic_quorum:

If R=1, the number of notfound responses required to trigger a notfound to the client:

  • Default behavior: N
  • basic_quorum=true: quorum
  • notfound_ok=false: 1

@seancribbs
Copy link

@macintux, your third point should be notfound_ok=true, since it treats notfound toward the request quorum, instead of against it.

@macintux
Copy link
Author

I suppose to generalize it...

The number of notfound responses required to trigger a notfound to the client:

  • Both false: N - (R - 1)
  • basic_quorum=true: quorum - (R - 1)
  • notfound_ok=true (default): R

Will try to verify that in the code.

@macintux
Copy link
Author

Reasonably happy with it at this point, seeking more intensive engineering review.

@dmitrizagidulin
Copy link

I second @coderoshi's comment about "Under DW: "value be written to the backend" should probably clarify by comparing W does not wait for the backend to reply. Better to be a little too pedantic than not enough in this case."

In the "Readin' and Writin' (R and W)" section, it should emphasize that "successfully read or write" means X nodes have to acknowledge the request, but not necessarily write it to the back end (except for at least 1, since DW is set to a minimum of 1). Which is more confusing, but necessary to emphasize.

@macintux
Copy link
Author

Thanks Dmitri. I've updated the DW section since then, but will look at W.

@dmitrizagidulin
Copy link

In general, I want to say - this is an amazing writeup, and I learned a ton from it. I'm excited that this is going out on the blog (and into the docs).

One other tweak request: The notfound tuning / basic_quorum section is a bit unclear. (It took me much re-reading, and I'm not quite sure I still get it). Would it be possible to add even a one-sentence explanation of basic_quorum? And in the table illustrating the various behaviors of notfound and basic quorum, can you explicitly expand the leftmost column to list the various combinations? Something like:
| notfound_ok=true and basic_quorum=false (Default/standard behavior) |
| notfound_ok=false and basic_quorum=true |
| notfound_ok=false and basic_quorum=false |

And maybe add an explanation of which use cases / situations would warrant the other two non-default (slower) settings?

@macintux
Copy link
Author

Thanks. I'll work on that. Part of my problem is that I only vaguely grasp the impact of basic_quorum, so conveying its meaning is probably beyond my ken.

@macintux
Copy link
Author

I blew it: the section on notfound_ok is badly wrong. I'm not sure how to generalize the results, but here's what I found for the least optimal scenario.

First: there's a key parameter called FailureThreshold which is used to terminate requests early. If notfound_ok is set to false, that comes into play when dealing with missing keys.

The body values in this table indicate the value for the failure threshold based on true/false values for basic_quorum and the R value for a request:

basic_quorum R=1 R=2 R=3
true 2 2 1
false 3 2 1

As Sean indicated, the key R value that basic_quorum is designed to work with is R=1. It's irrelevant for any other R value.

To carry this analysis forward, let's assume we're asking 3 vnodes for a key, and the first 2 to respond do not have a value for that key, but the 3rd does.

This next table indicates the success vs. failure tallies accumulated by the FSM after each vnode responds; notfound_ok comes into play here, because setting that to true means that it's "ok" to get a notfound response.

notfound_ok vnode 1 vnode 2 vnode 3
true 1/0 2/0 3/0
false 0/1 0/2 1/2

(So if we look at notfound_ok=false, after the 2nd vnode has replied with its notfound, the current tally is 0 successes and 2 failures. It's not until we get to the 3rd vnode's response with a value for the requested key that we finally have a 1 in the success tally.)

Now to decide at which point the client is informed of success or failure, and whether the client is informed of success or failure, we have to (I think) look at enough/1 in riak_kv_get_core.erl. Abbreviated a bit, its logic reads:

if
    NumOk >= R ->
        true;
    NumFail >= FailThreshold ->
        true;

In other words, as soon as the successful value is >= R or the failure value is >= FailThreshold we'll stop waiting for responses and provide a response to the client.

When we merge the data above into a table that indicates the number of vnodes that must respond before the client is informed of the results, we get interesting results.

If we want the value that's only present on the 3rd vnode, then, we have to wait for the response from all 3 vnodes, as highlighted below.

basic_quorum notfound_ok R=1 R=2 R=3
true true 1 2 3
false true 1 2 3
true false 2 2 1
false false 3 2 1

In other words, in this scenario, there are only 3 (really, 2) combinations of configuration values which will return the desired value to the client. (To be clear, all combinations will fix it via read repair after the request is completed.)

If we set R=3 and notfound_ok=true, we'll always get the value.

Alternatively, we can set R=1, basic_quorum=false, and notfound_ok=false.

Configuring the request to fail quickly rather than scan all vnodes unnecessarily is trivial: set notfound_ok=true, which is our default.

@macintux
Copy link
Author

Replaced all of the notfound tuning content. Definitely a rough draft.

@macintux
Copy link
Author

Extended it a bit to talk about why R=1 is interesting for basic_quorum as an illustration for the performance rationale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment