Created
July 8, 2010 15:48
-
-
Save PharkMillups/468186 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jdbl # Anyone around to chat about some theoretical failure | |
scenarios / corner cases? Trying to wrap my head around what | |
Riak will do in certain circumstances. | |
seancribbs # jdbl: ask away | |
jdbl # So, first to check an assumption. Even with hinted handoffs, | |
it is possible for a write with W>1 to fail to succeed, right? | |
In certain extreme circumstances, that is. | |
benblack # sure | |
seancribbs # jdbl: So… yes. | |
benblack # fail to succeed means fail, right? ;) | |
jdbl # Sure. Reported as failed by Riak, I guess. | |
seancribbs # A write can "fail" to the client, but still actually | |
succeed in the cluster. | |
jdbl # Alright, seancribbs, that's what I'm getting at. | |
benblack # a write can partially succeed, and then be cleaned up | |
later with repair or HH | |
seancribbs # e.g. if it times out | |
jdbl # Will a read repair later on squash that failed write or | |
propagate it despite it being reported as failed? | |
benblack # a write can partially succeed, and then be cleaned | |
up later with repair or HH | |
it will not be removed | |
there is nothing to indicate it should be | |
seancribbs # jdbl: vector clocks prevent the squash | |
benblack # if you allow multi, both versions can remain in the system | |
jdbl # Could that "failed write" end up over writing data | |
win other nodes that never received that werite but have data that | |
is a direct ancestor of the failed write? In vector clock ancestor sense. | |
benblack # it would be a neat trick to never see a write | |
but create a successor to it | |
seancribbs # jdbl: the effect is the same | |
benblack # what you are describing is a branch, not direct lineage | |
jdbl # Alright, I guess I'm just getting confused about vclocks in Riak. | |
seancribbs # jdbl: most times you don' t need to worry about them | |
jdbl # So, if I have 3 replicates on 3 nodes, those replicates will | |
have different vclocks for the same set of data? | |
seancribbs # jdbl: no | |
benblack # jdbl: not Riak specific, if i understand your question properly. | |
just how vclocks work | |
jdbl # Okay, just to be clear. I'm wrong in the following example then: | |
I have data A1 with 3 replicates. I write data A2 with W=3, but | |
something occurs such that only one replicate writes A2. The write | |
is reported as a failure to the client. If the replicates share the | |
same vclock, isn't this A2 a direct decendent of A1 across all the | |
replicates and will "win" on the next read repair? Even with multi values? | |
benblack # jdbl: the vclock is not changed by replication. the version | |
of a replica is defined by its vclock. all replicas of a given version | |
of a given doc will have the same vclock. jdbl: sure, the other replicas | |
will be repaired and have the version reported to the client as failed. | |
benblack # and the client(s) will see that on the read. | |
since the vclock is given. | |
jdbl # Would only A2 be shown in this case, since it's a direct descendant? | |
frank06 # seancribbs: works now - i was doing something silly. thanks for | |
your help! | |
benblack # they would all be repaired to A2, yes. | |
seancribbs # also, the get_fsm does some vclock resolution on read, | |
even before repair | |
jdbl # Alright. So, despite being reported as failed to the client | |
(since the requested W quorum wasn't met), the value nevertheless | |
eventually propagates around the cluster and overwrites the previous | |
value. So, W-vals / related failure has more to do with guaranteed | |
durability than consistency? | |
seancribbs # jdbl: yes | |
benblack # jdbl: you are only looking at half the equation | |
seancribbs # however, you have to have a really serious failure to | |
reach that condition | |
jdbl # Sure, it's an extreme case, esp. with hinted handoffs. Was | |
actually thinking about it as a follow up to a previous discussion | |
here / feature request about "strict quorum" writes that didn't employ | |
hinted handoffs. In such a case, this scenario could be more likely. | |
Leading me to wonder if that's even a valid feature request given Riak's design. | |
seancribbs # jdbl: that would increase failures, yes | |
and also, if quorum wasn't met, how do you reliably tell the other nodes NOT to write? | |
seems too error-prone | |
benblack # it's exactly the same problem as getting the writes consistent | |
seancribbs # what ben said | |
jdbl # Sure, makes sense given Riak's design. The only solution that | |
\ might work would be to have some setting that only merges on read | |
repair (never invalidates direct ancestors) and let's the application sort | |
things out. Eg. if my app always writes with W=majority, I can likely | |
handle the in app merging with some sort of majority approach. In any case, | |
I'm not suggesting such a feature. Was just wanting to check my thoughts on the matter. | |
benblack # jdbl: we could call it allow multi with client-side resolution | |
jdbl # If that's something you see value in supporting, go for it. Might | |
allow for some clever higher level functionality built on top of Riak that | |
supports different consistency models for those who care, while keeping Riak | |
"pure". | |
seancribbs # jdbl: benblack was being silly - we already expose conflicts | |
to the client, when allow_mult is set to true | |
benblack # i am a silly billy. | |
jdbl # Ah, as thinking he was referring to a stronger multi-version option | |
that doesn't invalidate direct vclock ancestors and leads vclock resolution | |
to the client, simply responding with a merged set of all existing values. | |
seancribbs # jdbl: but you won't overwrite a direct ancestor unless you | |
actually supply the ancestor vclock technically that's not a conflict | |
benblack # one would assume you initiated the write int he first place | |
because you wanted it to succeed. riak_core_util:start_app_deps in core and kv | |
jdbl # Well, sure. Unless you have different expectations with | |
W=1 and W=3. You might want a write to only success if all W-vals are met. | |
We already discussed how riak can report failure, but still allow that | |
write to propagate. For some apps, that might be a logic error, and the | |
app would rather "cleanup" that failed write on the next read. It can't | |
roll back to the previous ancestor because the previous ancestor is already gone. | |
benblack # jdbl: that would require a global synchronization mechanism, | |
like paxos. | |
seancribbs # jdbl: if that's the use-case then you shouldn't use a | |
BASE system like Riak you should use an ACID system again, what ben said | |
seancribbs # we usually advocate the "write now, resolve later" | |
attitude | |
benblack # there are times you want that behavior. if you require | |
it, you should definitely use a system that supports it. | |
seancribbs # (for riak) | |
benblack # like all dynamo systems, riak is eventually consistent. | |
transactions just aren't in the cards. | |
seancribbs # think of write quorums more like soft guarantees that the | |
data was written | |
jdbl # Yeah, I understand that. Roughly 90% or more of my operations work | |
fine with BASE (hence the focus on Riak). I was just contemplating about mapping | |
the other 10% to riak with slightly more consistency guarantees and client | |
logic to handle it. Makes more sense from an ops point than two separate | |
datastores. I'm don't need full ACID anyway, just for a write failure to | |
not propagate. I believe Cassandra with strict quorums seems to have this behavior. | |
benblack # it doesn't | |
no dynamo systems do. transactions are not a good match for them. | |
seancribbs # jdbl: an RDBMS or kv store with atomic operations | |
(like Redis) would fit those use-case. expecting something to fit 100% of your | |
needs is part of the problem | |
benblack # jdbl: what you are describing _requires_ global synchronization, | |
regardless of whether you want "full ACID" the entire design is structured | |
to avoid global synchronization | |
seancribbs # there are a number of CP systems out there - Scalaris for example | |
jdbl # Any thoughts on how Amazon maps their newish SimpleDB opportunistic | |
concurrency logic to an underlying Dynamo store? If I submit a OCC request | |
to 3 nodes, 1 node succeeds, the others fail the precondition, wouldn't future | |
read repair have trouble merging them? Then again, I imagine SimpleDB | |
is more than just Dynamo + fancy API. | |
Kenstigator joined #riak | |
benblack # amazon has not released enough information on the internals | |
of that to answer. so, yes, one could add a global synchronization | |
option to the system, but it'd be pretty invasive, i expect. | |
jdbl # Yeah, I understand. Was thinking of a crude approach that used | |
a single master node (allowing single point of failure) to order | |
writes/reads, and always used W=N/2+1,R=N/2+1. If a write succeeds, | |
great. If it fails, then a majority of the nodes would have the old | |
value. But, read repair on a "failed write" broke the logic. In any | |
case, I'll stop trying to shoehorn out of scope use cases into riak and | |
look into a secondary datastore. | |
benblack # one interesting approach might be to have global sync as | |
a per-bucket option so you never have to deal with mixed buckets | |
all ops on the sync bucket are sync | |
seancribbs # benblack: sounds like it would require 2PC | |
benblack # yes, or related mechanisms. paxos, VS, whatever but by doing | |
it per bucket, and having the bucket behavior be the same for | |
all ops, you simplify it such that it is tractable | |
benblack # trying to have some transactional stuff on a bucket while other stuff is not transactional would get hairy | |
seancribbs # true | |
jdbl # BTW, riak is a great product (esp. with bitcask), and I | |
appreciate all the work. My frustration with certain properties | |
of other k/v stores was the motivation behind trying to map my | |
sideline use cases to riak. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment