Created
September 17, 2010 16:57
-
-
Save PharkMillups/584539 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 07:50 <outerim> if allow_mult is not set on a bucket what happens with 2 conflicting writes | |
| (assuming the create case) ie client A writes first with rw=quorum client B writes next with rw=quorum? | |
| 07:50 <outerim> I presume that A wins because B's write would not have a descending | |
| vector clock | |
| 07:51 <outerim> and B would get some error response back. what code? | |
| 07:51 <seancribbs> outerim: depends on what vector clock is submitted | |
| 07:51 <seancribbs> and no error responses | |
| 07:51 <outerim> seancribbs: what if no vector clock is submitted, doesn't riak | |
| generate an initial one based on the client id? | |
| 07:52 <seancribbs> basically.. it assumes the empty vector-clock | |
| 07:52 <outerim> seancribbs: so you'd just have to do a readbody and test if it was what | |
| you wrote? | |
| 07:52 <seancribbs> yes | |
| 07:52 <seancribbs> returnbody | |
| 07:54 <outerim> seancribbs: so correct me if I'm wrong but with quorum r/rw values you are | |
| guaranteed that the first write will win (in the create case) and that the second will | |
| receive the first's body with returnbody? | |
| 07:54 <outerim> seancribbs: (thanks for your help btw) | |
| 07:55 <seancribbs> outerim: mmm not sure exactly how to answer that | |
| 07:55 <outerim> seancribbs: also worth mentioning that I'm using riak-client from ripple, | |
| haven't looked to see if it creates an initial vclock that riak understands... | |
| 07:55 <seancribbs> depends on the ordering of events | |
| 07:55 <seancribbs> it won't send the vclock the first time because there is none | |
| 07:55 <outerim> seancribbs: do tell if you don't mind : | |
| 07:55 <seancribbs> i.e. the empty vclock | |
| 07:55 <outerim> right | |
| 07:57 <seancribbs> outerim: probably best shown in an example. I'll gist some IRB output | |
| in a minute | |
| 07:57 <outerim> seancribbs: that would be excellent | |
| 08:00 <seancribbs> outerim: http://gist.github.com/580770 | |
| 08:00 <outerim> looking | |
| 08:00 <seancribbs> so… the answer is, when allow_mult is false, it takes the one with the | |
| latest timestamp | |
| 08:01 <outerim> seancribbs: is that only true of the both clients are using the same client id case? | |
| 08:02 <seancribbs> no, those are using different client ids | |
| 08:02 <outerim> oh yeah I see 2 clients... hmm | |
| 08:03 <outerim> seancribbs: so what is the point of last_write_wins on a bucket? | |
| 08:03 <outerim> and what does it default to? | |
| 08:03 <seancribbs> ignoring vector clocks entirely has significant speed implications | |
| 08:03 <seancribbs> it means you don't ever compare or update them | |
| 08:04 <outerim> so isn't it ignoring vector clocks when it says (as in your example) the last | |
| write wins? | |
| 08:04 <seancribbs> so…if i can explain this correctly, in the gist this worked because both versions | |
| descended from the same vclock (the empty one) | |
| 08:04 <outerim> or since there are no vclocks at that point.. | |
| 08:05 <seancribbs> so if you updated obj2 a couple of times, then updated obj1 with a stale vclock, | |
| it should not take | |
| 08:05 <outerim> but the second didn't descend from the first one which should have been stored | |
| at the time the second write occurred right? | |
| 08:05 <seancribbs> but they had the same parent | |
| 08:07 <outerim> hmm, so it allows a write to be lost if 2 conflicting writes have the same parent? | |
| that doesn't seem right to me :\ | |
| 08:07 <seancribbs> soo…. as i was telling people last night, if you're trying to prevent concurrent | |
| creations from clobbering each other, don't expect Riak to stop you, unless you have allow_mult turned on | |
| 08:11 <outerim> anyway so if that's not going to work, what if allow_mult is true? presumably | |
| client 2's write would have gotten a multiple choices back. in my application I need to be | |
| able to guarantee that if one client claimed a key by creating it that cannot be clobbered | |
| 08:11 <outerim> how would you do that? | |
| 08:11 <seancribbs> (updated my gist) | |
| 08:11 <outerim> can I trust the timestamps (hopefully my machines clocks are in sync) :\ | |
| 08:12 <seancribbs> there's no 100% way to guarantee your clients won't clobber each other | |
| 08:12 <seancribbs> allow_mult just exposes it to you | |
| 08:13 <seancribbs> and the timestamps are internal to riak, so if your riak machines are reasonably | |
| synced via NTP, it should be mostly ok | |
| 08:13 <seancribbs> but we usually suggest other means for resolving conflicts | |
| 08:15 <outerim> seancribbs: in this case, think of someone claiming something akin to a subdomain in a | |
| system for themselves. first write succeeds and the client sees no conflict second write generates a conflict, | |
| we want the second one to lose because the first one was already told he'd claimed it. is there | |
| a better way than timestamps for this case | |
| 08:15 <seancribbs> well, if this is in the context of web requests, check for its existence first | |
| (maybe via ajax) | |
| 08:16 <seancribbs> and also check for its existence when they submit the form | |
| 08:16 <outerim> seancribbs: API actually, and yeah I will of course check for existence before | |
| attempting to write | |
| 08:17 <seancribbs> ok | |
| 08:17 <seancribbs> you can also use If-None-Match: * header for a little extra checking | |
| 08:17 <seancribbs> i should add that to the client | |
| 08:18 <seancribbs> something like obj.store_conditionally = true | |
| 08:18 <seancribbs> or obj.prevent_stale_writes = true | |
| 08:19 <outerim> seancribbs: ooh, that could be good... how would riak handle that internally if 2 writes | |
| literally came in on different nodes at ~ the exact same time | |
| 08:19 <seancribbs> outerim: you would hard pressed to deal with that without allow_mult | |
| 08:20 <seancribbs> the scope of the critical section on the Riak side is very small, but there is a | |
| read before the write | |
| 08:21 <outerim> seancribbs: I am having a hard time even visualizing the madness on the | |
| riak side, so request comes in and it's committed to at least N nodes where N is my rw value | |
| 08:21 <seancribbs> right, so this is why we have vector clocks | |
| 08:21 <seancribbs> to remove the madness | |
| 08:22 <seancribbs> or at least, to make it deterministic | |
| 08:22 <outerim> any one of those could either fail because of If-None-Match or generate a | |
| conflict right? | |
| 08:22 <outerim> if allow_mult is true | |
| 08:22 <seancribbs> if-none-match is checked after the read, in the web endpoint | |
| 08:23 <outerim> seancribbs: by web endpoint you mean the host the client is connected | |
| to for it's session | |
| 08:23 <seancribbs> yes, i mean the webmachine resource sitting at /riak | |
| 08:24 <outerim> so even with if-none-match * you're only guaranteed that there wasn't | |
| anything there when your write entered the system to be committed 2 could have come in | |
| both passed the if-none-match check and been committed (either generating a conflict with allow_mult, | |
| or one of them winning based on timestamp) | |
| 08:24 <outerim> do I understand that correctly | |
| 08:24 <seancribbs> yep | |
| 08:25 <seancribbs> ordering of those events is unpredictable | |
| 08:25 <seancribbs> even on a per-replica basis | |
| 08:25 <outerim> seancribbs: so is it possible that neither could get back a conflict | |
| message at the time of their write (assuming rw/r=quorum)? | |
| 08:26 <seancribbs> yes | |
| 08:26 <outerim> seancribbs: and if the ordering per replica is unpredictable is it | |
| possible for different replicas to have different values | |
| 08:26 <seancribbs> although, i doubt it's that likely | |
| 08:26 <seancribbs> outerim: yes, absolutely. | |
| 08:26 <outerim> seancribbs: oh good lord... that scares me ;) | |
| 08:27 <seancribbs> the thing is, the window for these conflicts will be really small | |
| 08:27 <seancribbs> much smaller than a single request through your API | |
| 08:27 <outerim> seancribbs: yeah, totally understood, but it still exists :) | |
| 08:28 <outerim> so those things are fixed up in post by read repairs? | |
| 08:28 <seancribbs> outerim: right, so I would say use allow_mult for the "just in case" scenarios | |
| 08:28 <seancribbs> yes | |
| 08:28 <outerim> when idle is riak checking those sorts of things proactively by chance? | |
| 08:28 <seancribbs> no | |
| 08:29 <outerim> read repair was one thing from the docs I wasn't entirely clear on, | |
| with an r value of 'all' and inconsistent data between 2 replicas would my read fail or | |
| would the data be repaired | |
| 08:29 <outerim> ? | |
| 08:30 <seancribbs> read repair happens regardless of whether your request quorum was | |
| met, and after a reply has been returned | |
| 08:30 <seancribbs> outerim: http://github.com/seancribbs/ripple/issues#issue/61 | |
| 08:30 <seancribbs> so you might see one request fail, and the following one succeed | |
| 08:34 <outerim> seancribbs: so the request quorum is the # of hosts that have to agree on a value | |
| for the read to succeed right? assuming the bucket's n val is 3 and one host has stale data if my | |
| read has r=2 I may see a failed read (if it picks a good and stale host) or succeed (picked 2 good | |
| hosts) only in the first case will the stale host be repaired and subsequent reads should not fail? | |
| 08:35 <outerim> seancribbs: would a pull request be appreciated for issue 61 if I were to whip something up? | |
| 08:36 <seancribbs> outerim: yes, absolutely | |
| 08:39 <outerim> seancribbs: and not 2 replicas but 2 partition owners right? (just clarifying) | |
| 08:39 <outerim> seancribbs: you said "so you might see one request fail, and the following one | |
| succeed" under what conditions might a request fail? only if the data hasn't been replicated | |
| to enough partitions or a host is down? | |
| 08:40 <outerim> so if I write with an rw=1 and try to read with an r=3 it | |
| might fail? other cases? | |
| 08:40 <seancribbs> 2 replicas == 2 partition owners | |
| 08:41 <seancribbs> the request might fail if not all of the replicas are | |
| populated | |
| 08:41 <seancribbs> i.e. if enough of them come back with "not found" so that the | |
| quorum can't be satisfied | |
| 08:41 <seancribbs> but if the last one comes back with a value (which happens frequently | |
| because it's slower to send the value than send "not found") | |
| 08:41 <seancribbs> it will be read-repaired back to the other nodes | |
| 08:41 <seancribbs> err partitions | |
| 08:42 <seancribbs> then on the next request, you'll get a success | |
| 08:42 <outerim> seancribbs: and does riak always try the reads from all replicas or just whatever | |
| your r value is. | |
| 08:42 <outerim> seancribbs: that makes sense and was how I understood it I think.. | |
| 08:43 <seancribbs> always requests all replicas, replies to client as soon as R is met or is known | |
| not to be satisfiable | |
| 08:43 <outerim> by all replicas I mean all the replicas that own a specific key | |
| 08:43 <seancribbs> outerim: yes, we're on the same page | |
| 08:44 <outerim> seancribbs: cool, appreciate your help very much it's clarified a lot for me. | |
| 08:44 <seancribbs> no problem. if you know some Erlang and want more definitive | |
| answers to how this works, read the riak_kv_get_fsm module | |
| 08:45 <seancribbs> it's illuminating | |
| 08:45 <seancribbs> also the get_fsm_qc QuickCheck tests | |
| 08:46 <outerim> seancribbs: I've attempted a few times to learn erlang. perhaps I'll have a look |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment