Create a gist now

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Draft of a blog post (or possibly a series of posts) diving into the meat of several key Riak configuration parameters. Internal links do not work, but I haven't worried about that since I'm not sure how closely GFM maps to Basho's blogging platform.
@coderoshi

This comment has been minimized.

Show comment
Hide comment
@coderoshi

coderoshi Apr 19, 2013

Really nice work. I love this article. I only have a few comments.

Start with a description of what the document is. What is the user reading, why should they want to, what are they going to learn?

"but allows the operator and even the developer to tune read and write requests to better meet the business needs for any given set of data."
"even the developer" sounds kind of surprising, tune how?, and "better meet the business needs" isn't very descriptive. How about:
"but allows tuning read and write requests to trade higher availability for increased consistency, depending on the business needs of the data set. These choices can be made by both operators and developers."

"CAP Tuning" isn't really used anymore (I'm sure there are wrong docs somewhere that still use it, and they should be changed too). It came from a misunderstanding of Riak vis-a-vis Dynamo and perpetuated. Strictly speaking, Riak trades availability for latency and durability, no setting can make Riak truly consistent.

Under PR and PW: "may drop significantly". The odds of any given request failing due to unavailability increases.

Under DW: "value be written to the backend" should probably clarify by comparing W does not wait for the backend to reply. Better to be a little too pedantic than not enough in this case.

coderoshi commented Apr 19, 2013

Really nice work. I love this article. I only have a few comments.

Start with a description of what the document is. What is the user reading, why should they want to, what are they going to learn?

"but allows the operator and even the developer to tune read and write requests to better meet the business needs for any given set of data."
"even the developer" sounds kind of surprising, tune how?, and "better meet the business needs" isn't very descriptive. How about:
"but allows tuning read and write requests to trade higher availability for increased consistency, depending on the business needs of the data set. These choices can be made by both operators and developers."

"CAP Tuning" isn't really used anymore (I'm sure there are wrong docs somewhere that still use it, and they should be changed too). It came from a misunderstanding of Riak vis-a-vis Dynamo and perpetuated. Strictly speaking, Riak trades availability for latency and durability, no setting can make Riak truly consistent.

Under PR and PW: "may drop significantly". The odds of any given request failing due to unavailability increases.

Under DW: "value be written to the backend" should probably clarify by comparing W does not wait for the backend to reply. Better to be a little too pedantic than not enough in this case.

@macintux

This comment has been minimized.

Show comment
Hide comment
@macintux

macintux Apr 21, 2013

Deletion just got more complicated too; the reaper apparently will only remove the Riak tombstones IFF all primary nodes are available, all nodes hold the same tombstone. However, MDC makes this complicated.

Owner

macintux commented Apr 21, 2013

Deletion just got more complicated too; the reaper apparently will only remove the Riak tombstones IFF all primary nodes are available, all nodes hold the same tombstone. However, MDC makes this complicated.

@macintux

This comment has been minimized.

Show comment
Hide comment
@macintux

macintux Apr 21, 2013

To summarize what I think I know about notfound_ok vs basic_quorum:

If R=1, the number of notfound responses required to trigger a notfound to the client:

  • Default behavior: N
  • basic_quorum=true: quorum
  • notfound_ok=false: 1
Owner

macintux commented Apr 21, 2013

To summarize what I think I know about notfound_ok vs basic_quorum:

If R=1, the number of notfound responses required to trigger a notfound to the client:

  • Default behavior: N
  • basic_quorum=true: quorum
  • notfound_ok=false: 1
@seancribbs

This comment has been minimized.

Show comment
Hide comment
@seancribbs

seancribbs Apr 21, 2013

@macintux, your third point should be notfound_ok=true, since it treats notfound toward the request quorum, instead of against it.

seancribbs commented Apr 21, 2013

@macintux, your third point should be notfound_ok=true, since it treats notfound toward the request quorum, instead of against it.

@macintux

This comment has been minimized.

Show comment
Hide comment
@macintux

macintux Apr 21, 2013

I suppose to generalize it...

The number of notfound responses required to trigger a notfound to the client:

  • Both false: N - (R - 1)
  • basic_quorum=true: quorum - (R - 1)
  • notfound_ok=true (default): R

Will try to verify that in the code.

Owner

macintux commented Apr 21, 2013

I suppose to generalize it...

The number of notfound responses required to trigger a notfound to the client:

  • Both false: N - (R - 1)
  • basic_quorum=true: quorum - (R - 1)
  • notfound_ok=true (default): R

Will try to verify that in the code.

@macintux

This comment has been minimized.

Show comment
Hide comment
@macintux

macintux Apr 21, 2013

Reasonably happy with it at this point, seeking more intensive engineering review.

Owner

macintux commented Apr 21, 2013

Reasonably happy with it at this point, seeking more intensive engineering review.

@dmitrizagidulin

This comment has been minimized.

Show comment
Hide comment
@dmitrizagidulin

dmitrizagidulin Apr 22, 2013

I second @coderoshi's comment about "Under DW: "value be written to the backend" should probably clarify by comparing W does not wait for the backend to reply. Better to be a little too pedantic than not enough in this case."

In the "Readin' and Writin' (R and W)" section, it should emphasize that "successfully read or write" means X nodes have to acknowledge the request, but not necessarily write it to the back end (except for at least 1, since DW is set to a minimum of 1). Which is more confusing, but necessary to emphasize.

dmitrizagidulin commented Apr 22, 2013

I second @coderoshi's comment about "Under DW: "value be written to the backend" should probably clarify by comparing W does not wait for the backend to reply. Better to be a little too pedantic than not enough in this case."

In the "Readin' and Writin' (R and W)" section, it should emphasize that "successfully read or write" means X nodes have to acknowledge the request, but not necessarily write it to the back end (except for at least 1, since DW is set to a minimum of 1). Which is more confusing, but necessary to emphasize.

@macintux

This comment has been minimized.

Show comment
Hide comment
@macintux

macintux Apr 22, 2013

Thanks Dmitri. I've updated the DW section since then, but will look at W.

Owner

macintux commented Apr 22, 2013

Thanks Dmitri. I've updated the DW section since then, but will look at W.

@dmitrizagidulin

This comment has been minimized.

Show comment
Hide comment
@dmitrizagidulin

dmitrizagidulin Apr 22, 2013

In general, I want to say - this is an amazing writeup, and I learned a ton from it. I'm excited that this is going out on the blog (and into the docs).

One other tweak request: The notfound tuning / basic_quorum section is a bit unclear. (It took me much re-reading, and I'm not quite sure I still get it). Would it be possible to add even a one-sentence explanation of basic_quorum? And in the table illustrating the various behaviors of notfound and basic quorum, can you explicitly expand the leftmost column to list the various combinations? Something like:
| notfound_ok=true and basic_quorum=false (Default/standard behavior) |
| notfound_ok=false and basic_quorum=true |
| notfound_ok=false and basic_quorum=false |

And maybe add an explanation of which use cases / situations would warrant the other two non-default (slower) settings?

dmitrizagidulin commented Apr 22, 2013

In general, I want to say - this is an amazing writeup, and I learned a ton from it. I'm excited that this is going out on the blog (and into the docs).

One other tweak request: The notfound tuning / basic_quorum section is a bit unclear. (It took me much re-reading, and I'm not quite sure I still get it). Would it be possible to add even a one-sentence explanation of basic_quorum? And in the table illustrating the various behaviors of notfound and basic quorum, can you explicitly expand the leftmost column to list the various combinations? Something like:
| notfound_ok=true and basic_quorum=false (Default/standard behavior) |
| notfound_ok=false and basic_quorum=true |
| notfound_ok=false and basic_quorum=false |

And maybe add an explanation of which use cases / situations would warrant the other two non-default (slower) settings?

@macintux

This comment has been minimized.

Show comment
Hide comment
@macintux

macintux Apr 22, 2013

Thanks. I'll work on that. Part of my problem is that I only vaguely grasp the impact of basic_quorum, so conveying its meaning is probably beyond my ken.

Owner

macintux commented Apr 22, 2013

Thanks. I'll work on that. Part of my problem is that I only vaguely grasp the impact of basic_quorum, so conveying its meaning is probably beyond my ken.

@macintux

This comment has been minimized.

Show comment
Hide comment
@macintux

macintux Apr 23, 2013

I blew it: the section on notfound_ok is badly wrong. I'm not sure how to generalize the results, but here's what I found for the least optimal scenario.

First: there's a key parameter called FailureThreshold which is used to terminate requests early. If notfound_ok is set to false, that comes into play when dealing with missing keys.

The body values in this table indicate the value for the failure threshold based on true/false values for basic_quorum and the R value for a request:

basic_quorum R=1 R=2 R=3
true 2 2 1
false 3 2 1

As Sean indicated, the key R value that basic_quorum is designed to work with is R=1. It's irrelevant for any other R value.

To carry this analysis forward, let's assume we're asking 3 vnodes for a key, and the first 2 to respond do not have a value for that key, but the 3rd does.

This next table indicates the success vs. failure tallies accumulated by the FSM after each vnode responds; notfound_ok comes into play here, because setting that to true means that it's "ok" to get a notfound response.

notfound_ok vnode 1 vnode 2 vnode 3
true 1/0 2/0 3/0
false 0/1 0/2 1/2

(So if we look at notfound_ok=false, after the 2nd vnode has replied with its notfound, the current tally is 0 successes and 2 failures. It's not until we get to the 3rd vnode's response with a value for the requested key that we finally have a 1 in the success tally.)

Now to decide at which point the client is informed of success or failure, and whether the client is informed of success or failure, we have to (I think) look at enough/1 in riak_kv_get_core.erl. Abbreviated a bit, its logic reads:

if
    NumOk >= R ->
        true;
    NumFail >= FailThreshold ->
        true;

In other words, as soon as the successful value is >= R or the failure value is >= FailThreshold we'll stop waiting for responses and provide a response to the client.

When we merge the data above into a table that indicates the number of vnodes that must respond before the client is informed of the results, we get interesting results.

If we want the value that's only present on the 3rd vnode, then, we have to wait for the response from all 3 vnodes, as highlighted below.

basic_quorum notfound_ok R=1 R=2 R=3
true true 1 2 3
false true 1 2 3
true false 2 2 1
false false 3 2 1

In other words, in this scenario, there are only 3 (really, 2) combinations of configuration values which will return the desired value to the client. (To be clear, all combinations will fix it via read repair after the request is completed.)

If we set R=3 and notfound_ok=true, we'll always get the value.

Alternatively, we can set R=1, basic_quorum=false, and notfound_ok=false.

Configuring the request to fail quickly rather than scan all vnodes unnecessarily is trivial: set notfound_ok=true, which is our default.

Owner

macintux commented Apr 23, 2013

I blew it: the section on notfound_ok is badly wrong. I'm not sure how to generalize the results, but here's what I found for the least optimal scenario.

First: there's a key parameter called FailureThreshold which is used to terminate requests early. If notfound_ok is set to false, that comes into play when dealing with missing keys.

The body values in this table indicate the value for the failure threshold based on true/false values for basic_quorum and the R value for a request:

basic_quorum R=1 R=2 R=3
true 2 2 1
false 3 2 1

As Sean indicated, the key R value that basic_quorum is designed to work with is R=1. It's irrelevant for any other R value.

To carry this analysis forward, let's assume we're asking 3 vnodes for a key, and the first 2 to respond do not have a value for that key, but the 3rd does.

This next table indicates the success vs. failure tallies accumulated by the FSM after each vnode responds; notfound_ok comes into play here, because setting that to true means that it's "ok" to get a notfound response.

notfound_ok vnode 1 vnode 2 vnode 3
true 1/0 2/0 3/0
false 0/1 0/2 1/2

(So if we look at notfound_ok=false, after the 2nd vnode has replied with its notfound, the current tally is 0 successes and 2 failures. It's not until we get to the 3rd vnode's response with a value for the requested key that we finally have a 1 in the success tally.)

Now to decide at which point the client is informed of success or failure, and whether the client is informed of success or failure, we have to (I think) look at enough/1 in riak_kv_get_core.erl. Abbreviated a bit, its logic reads:

if
    NumOk >= R ->
        true;
    NumFail >= FailThreshold ->
        true;

In other words, as soon as the successful value is >= R or the failure value is >= FailThreshold we'll stop waiting for responses and provide a response to the client.

When we merge the data above into a table that indicates the number of vnodes that must respond before the client is informed of the results, we get interesting results.

If we want the value that's only present on the 3rd vnode, then, we have to wait for the response from all 3 vnodes, as highlighted below.

basic_quorum notfound_ok R=1 R=2 R=3
true true 1 2 3
false true 1 2 3
true false 2 2 1
false false 3 2 1

In other words, in this scenario, there are only 3 (really, 2) combinations of configuration values which will return the desired value to the client. (To be clear, all combinations will fix it via read repair after the request is completed.)

If we set R=3 and notfound_ok=true, we'll always get the value.

Alternatively, we can set R=1, basic_quorum=false, and notfound_ok=false.

Configuring the request to fail quickly rather than scan all vnodes unnecessarily is trivial: set notfound_ok=true, which is our default.

@macintux

This comment has been minimized.

Show comment
Hide comment
@macintux

macintux Apr 23, 2013

Replaced all of the notfound tuning content. Definitely a rough draft.

Owner

macintux commented Apr 23, 2013

Replaced all of the notfound tuning content. Definitely a rough draft.

@macintux

This comment has been minimized.

Show comment
Hide comment
@macintux

macintux Apr 23, 2013

Extended it a bit to talk about why R=1 is interesting for basic_quorum as an illustration for the performance rationale.

Owner

macintux commented Apr 23, 2013

Extended it a bit to talk about why R=1 is interesting for basic_quorum as an illustration for the performance rationale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment