Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created July 8, 2010 15:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PharkMillups/468186 to your computer and use it in GitHub Desktop.
Save PharkMillups/468186 to your computer and use it in GitHub Desktop.
jdbl # Anyone around to chat about some theoretical failure
scenarios / corner cases? Trying to wrap my head around what
Riak will do in certain circumstances.
seancribbs # jdbl: ask away
jdbl # So, first to check an assumption. Even with hinted handoffs,
it is possible for a write with W>1 to fail to succeed, right?
In certain extreme circumstances, that is.
benblack # sure
seancribbs # jdbl: So… yes.
benblack # fail to succeed means fail, right? ;)
jdbl # Sure. Reported as failed by Riak, I guess.
seancribbs # A write can "fail" to the client, but still actually
succeed in the cluster.
jdbl # Alright, seancribbs, that's what I'm getting at.
benblack # a write can partially succeed, and then be cleaned up
later with repair or HH
seancribbs # e.g. if it times out
jdbl # Will a read repair later on squash that failed write or
propagate it despite it being reported as failed?
benblack # a write can partially succeed, and then be cleaned
up later with repair or HH
it will not be removed
there is nothing to indicate it should be
seancribbs # jdbl: vector clocks prevent the squash
benblack # if you allow multi, both versions can remain in the system
jdbl # Could that "failed write" end up over writing data
win other nodes that never received that werite but have data that
is a direct ancestor of the failed write? In vector clock ancestor sense.
benblack # it would be a neat trick to never see a write
but create a successor to it
seancribbs # jdbl: the effect is the same
benblack # what you are describing is a branch, not direct lineage
jdbl # Alright, I guess I'm just getting confused about vclocks in Riak.
seancribbs # jdbl: most times you don' t need to worry about them
jdbl # So, if I have 3 replicates on 3 nodes, those replicates will
have different vclocks for the same set of data?
seancribbs # jdbl: no
benblack # jdbl: not Riak specific, if i understand your question properly.
just how vclocks work
jdbl # Okay, just to be clear. I'm wrong in the following example then:
I have data A1 with 3 replicates. I write data A2 with W=3, but
something occurs such that only one replicate writes A2. The write
is reported as a failure to the client. If the replicates share the
same vclock, isn't this A2 a direct decendent of A1 across all the
replicates and will "win" on the next read repair? Even with multi values?
benblack # jdbl: the vclock is not changed by replication. the version
of a replica is defined by its vclock. all replicas of a given version
of a given doc will have the same vclock. jdbl: sure, the other replicas
will be repaired and have the version reported to the client as failed.
benblack # and the client(s) will see that on the read.
since the vclock is given.
jdbl # Would only A2 be shown in this case, since it's a direct descendant?
frank06 # seancribbs: works now - i was doing something silly. thanks for
your help!
benblack # they would all be repaired to A2, yes.
seancribbs # also, the get_fsm does some vclock resolution on read,
even before repair
jdbl # Alright. So, despite being reported as failed to the client
(since the requested W quorum wasn't met), the value nevertheless
eventually propagates around the cluster and overwrites the previous
value. So, W-vals / related failure has more to do with guaranteed
durability than consistency?
seancribbs # jdbl: yes
benblack # jdbl: you are only looking at half the equation
seancribbs # however, you have to have a really serious failure to
reach that condition
jdbl # Sure, it's an extreme case, esp. with hinted handoffs. Was
actually thinking about it as a follow up to a previous discussion
here / feature request about "strict quorum" writes that didn't employ
hinted handoffs. In such a case, this scenario could be more likely.
Leading me to wonder if that's even a valid feature request given Riak's design.
seancribbs # jdbl: that would increase failures, yes
and also, if quorum wasn't met, how do you reliably tell the other nodes NOT to write?
seems too error-prone
benblack # it's exactly the same problem as getting the writes consistent
seancribbs # what ben said
jdbl # Sure, makes sense given Riak's design. The only solution that
\ might work would be to have some setting that only merges on read
repair (never invalidates direct ancestors) and let's the application sort
things out. Eg. if my app always writes with W=majority, I can likely
handle the in app merging with some sort of majority approach. In any case,
I'm not suggesting such a feature. Was just wanting to check my thoughts on the matter.
benblack # jdbl: we could call it allow multi with client-side resolution
jdbl # If that's something you see value in supporting, go for it. Might
allow for some clever higher level functionality built on top of Riak that
supports different consistency models for those who care, while keeping Riak
"pure".
seancribbs # jdbl: benblack was being silly - we already expose conflicts
to the client, when allow_mult is set to true
benblack # i am a silly billy.
jdbl # Ah, as thinking he was referring to a stronger multi-version option
that doesn't invalidate direct vclock ancestors and leads vclock resolution
to the client, simply responding with a merged set of all existing values.
seancribbs # jdbl: but you won't overwrite a direct ancestor unless you
actually supply the ancestor vclock technically that's not a conflict
benblack # one would assume you initiated the write int he first place
because you wanted it to succeed. riak_core_util:start_app_deps in core and kv
jdbl # Well, sure. Unless you have different expectations with
W=1 and W=3. You might want a write to only success if all W-vals are met.
We already discussed how riak can report failure, but still allow that
write to propagate. For some apps, that might be a logic error, and the
app would rather "cleanup" that failed write on the next read. It can't
roll back to the previous ancestor because the previous ancestor is already gone.
benblack # jdbl: that would require a global synchronization mechanism,
like paxos.
seancribbs # jdbl: if that's the use-case then you shouldn't use a
BASE system like Riak you should use an ACID system again, what ben said
seancribbs # we usually advocate the "write now, resolve later"
attitude
benblack # there are times you want that behavior. if you require
it, you should definitely use a system that supports it.
seancribbs # (for riak)
benblack # like all dynamo systems, riak is eventually consistent.
transactions just aren't in the cards.
seancribbs # think of write quorums more like soft guarantees that the
data was written
jdbl # Yeah, I understand that. Roughly 90% or more of my operations work
fine with BASE (hence the focus on Riak). I was just contemplating about mapping
the other 10% to riak with slightly more consistency guarantees and client
logic to handle it. Makes more sense from an ops point than two separate
datastores. I'm don't need full ACID anyway, just for a write failure to
not propagate. I believe Cassandra with strict quorums seems to have this behavior.
benblack # it doesn't
no dynamo systems do. transactions are not a good match for them.
seancribbs # jdbl: an RDBMS or kv store with atomic operations
(like Redis) would fit those use-case. expecting something to fit 100% of your
needs is part of the problem
benblack # jdbl: what you are describing _requires_ global synchronization,
regardless of whether you want "full ACID" the entire design is structured
to avoid global synchronization
seancribbs # there are a number of CP systems out there - Scalaris for example
jdbl # Any thoughts on how Amazon maps their newish SimpleDB opportunistic
concurrency logic to an underlying Dynamo store? If I submit a OCC request
to 3 nodes, 1 node succeeds, the others fail the precondition, wouldn't future
read repair have trouble merging them? Then again, I imagine SimpleDB
is more than just Dynamo + fancy API.
Kenstigator joined #riak
benblack # amazon has not released enough information on the internals
of that to answer. so, yes, one could add a global synchronization
option to the system, but it'd be pretty invasive, i expect.
jdbl # Yeah, I understand. Was thinking of a crude approach that used
a single master node (allowing single point of failure) to order
writes/reads, and always used W=N/2+1,R=N/2+1. If a write succeeds,
great. If it fails, then a majority of the nodes would have the old
value. But, read repair on a "failed write" broke the logic. In any
case, I'll stop trying to shoehorn out of scope use cases into riak and
look into a secondary datastore.
benblack # one interesting approach might be to have global sync as
a per-bucket option so you never have to deal with mixed buckets
all ops on the sync bucket are sync
seancribbs # benblack: sounds like it would require 2PC
benblack # yes, or related mechanisms. paxos, VS, whatever but by doing
it per bucket, and having the bucket behavior be the same for
all ops, you simplify it such that it is tractable
benblack # trying to have some transactional stuff on a bucket while other stuff is not transactional would get hairy
seancribbs # true
jdbl # BTW, riak is a great product (esp. with bitcask), and I
appreciate all the work. My frustration with certain properties
of other k/v stores was the motivation behind trying to map my
sideline use cases to riak.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment