PharkMillups/gist:468186

## gistfile1.txt
jdbl # Anyone around to chat about some theoretical failure
scenarios / corner cases? Trying to wrap my head around what
Riak will do in certain circumstances.

seancribbs # jdbl: ask away

jdbl # So, first to check an assumption. Even with hinted handoffs,
it is possible for a write with W>1 to fail to succeed, right?
In certain extreme circumstances, that is.

benblack # sure

seancribbs # jdbl: So… yes.

benblack # fail to succeed means fail, right? ;)

jdbl # Sure. Reported as failed by Riak, I guess.

seancribbs # A write can "fail" to the client, but still actually
succeed in the cluster.

jdbl # Alright, seancribbs, that's what I'm getting at.

benblack # a write can partially succeed, and then be cleaned up
later with repair or HH

seancribbs # e.g. if it times out
jdbl # Will a read repair later on squash that failed write or
propagate it despite it being reported as failed?

benblack # a write can partially succeed, and then be cleaned
up later with repair or HH
it will not be removed
there is nothing to indicate it should be

seancribbs # jdbl: vector clocks prevent the squash

benblack # if you allow multi, both versions can remain in the system

jdbl # Could that "failed write" end up over writing data
win other nodes that never received  that werite but have data that
is a direct ancestor of the failed write? In vector clock ancestor sense.

benblack # it would be a neat trick to never see a write
but create a successor to it

seancribbs # jdbl: the effect is the same

benblack # what you are describing is a branch, not direct lineage

jdbl # Alright, I guess I'm just getting confused about vclocks in Riak.

seancribbs # jdbl: most times you don' t need to worry about them

jdbl # So, if I have 3 replicates on 3 nodes, those replicates will
have different vclocks for the same set of data?

seancribbs # jdbl: no

benblack # jdbl: not Riak specific, if i understand your question properly.
just how vclocks work

jdbl # Okay, just to be clear. I'm wrong in the following example then:
I have data A1 with 3 replicates. I write data A2 with W=3, but
something occurs such that only one replicate writes A2. The write
is reported as a failure to the client. If the replicates share the
same vclock, isn't this A2 a direct decendent of A1 across all the
replicates and will "win" on the next read repair? Even with multi values?

benblack # jdbl: the vclock is not changed by replication. the version
of a replica is defined by its vclock. all replicas of a given version
of a given doc will have the same vclock. jdbl: sure, the other replicas
 will be repaired and have the version reported to the client as failed.

benblack # and the client(s) will see that on the read.
since the vclock is given.

jdbl # Would only A2 be shown in this case, since it's a direct descendant?
frank06 # seancribbs: works now - i was doing something silly. thanks for
your help!

benblack # they would all be repaired to A2, yes.

seancribbs # also, the get_fsm does some vclock resolution on read,
even before repair


jdbl # Alright. So, despite being reported as failed to the client
(since the requested W quorum wasn't met), the value nevertheless
eventually propagates around the cluster and overwrites the previous
value. So, W-vals / related failure has more to do with guaranteed
durability than consistency?

seancribbs # jdbl: yes

benblack # jdbl: you are only looking at half the equation
seancribbs # however, you have to have a really serious failure to
reach that condition

jdbl # Sure, it's an extreme case, esp. with hinted handoffs. Was
actually thinking about it as a follow up to a previous discussion
here / feature request about "strict quorum" writes that didn't employ
 hinted handoffs. In such a case, this scenario could be more likely.
Leading me to wonder if that's even a valid feature request given Riak's design.

seancribbs # jdbl: that would increase failures, yes
and also, if quorum wasn't met, how do you reliably tell the other nodes NOT to write?
seems too error-prone

benblack # it's exactly the same problem as getting the writes consistent

seancribbs # what ben said

jdbl # Sure, makes sense given Riak's design. The only solution that
\ might work would be to have some setting that only merges on read
repair (never invalidates direct ancestors) and let's the application sort
 things out. Eg. if my app always writes with W=majority, I can likely
handle the in app merging with some sort of majority approach. In any case,
I'm not suggesting such a feature. Was just wanting to check my thoughts on the matter.

benblack # jdbl: we could call it allow multi with client-side resolution

jdbl # If that's something you see value in supporting, go for it. Might
allow for some clever higher level functionality built on top of Riak that
supports different consistency models for those who care, while keeping Riak
 "pure".

seancribbs # jdbl: benblack was being silly - we already expose conflicts
to the client, when allow_mult is set to true

benblack # i am a silly billy.

jdbl # Ah, as thinking he was referring to a stronger multi-version option
that doesn't invalidate direct vclock ancestors and leads vclock resolution
to the client, simply responding with a merged set of all existing values.

seancribbs # jdbl: but you won't overwrite a direct ancestor unless you
actually supply the ancestor vclock technically that's not a conflict

benblack # one would assume you initiated the write int he first place
because you wanted it to succeed. riak_core_util:start_app_deps in core and kv

jdbl # Well, sure. Unless you have different expectations with
W=1 and W=3. You might want a write to only success if all W-vals are met.
 We already discussed how riak can report failure, but still allow that
write to propagate. For some apps, that might be a logic error, and the
app would rather "cleanup" that failed write on the next read. It can't
roll back to the previous ancestor because the previous ancestor is already gone.

benblack # jdbl: that would require a global synchronization mechanism,
like paxos.

seancribbs # jdbl: if that's the use-case then you shouldn't use a
BASE system like Riak you should use an ACID system again, what ben said

seancribbs # we usually advocate the "write now, resolve later"
 attitude

benblack # there are times you want that behavior. if you require
it, you should definitely use a system that supports it.

seancribbs # (for riak)

benblack # like all dynamo systems, riak is eventually consistent.
transactions just aren't in the cards.

seancribbs # think of write quorums more like soft guarantees that the
data was written

jdbl # Yeah, I understand that. Roughly 90% or more of my operations work
fine with BASE (hence the focus on Riak). I was just contemplating about mapping
 the other 10% to riak with slightly more consistency guarantees and client
logic to handle it. Makes more sense from an ops point than two separate
 datastores. I'm don't need full ACID anyway, just for a write failure to
 not propagate. I believe Cassandra with strict quorums seems to have this behavior.

benblack # it doesn't
no dynamo systems do. transactions are not a good match for them.

seancribbs # jdbl: an RDBMS or kv store with atomic operations
(like Redis) would fit those use-case. expecting something to fit 100% of your
needs is part of the problem

benblack # jdbl: what you are describing _requires_ global synchronization,
regardless of whether you want "full ACID" the entire design is structured
to avoid global synchronization

seancribbs # there are a number of CP systems out there - Scalaris for example

jdbl # Any thoughts on how Amazon maps their newish SimpleDB opportunistic
 concurrency logic to an underlying Dynamo store? If I submit a OCC request
to 3 nodes, 1 node succeeds, the others fail the precondition, wouldn't future
read repair have trouble merging them? Then again, I imagine SimpleDB
 is more than just Dynamo + fancy API.

Kenstigator joined #riak

benblack # amazon has not released enough information on the internals
of that to answer. so, yes, one could add a global synchronization
option to the system, but it'd be pretty invasive, i expect.

jdbl # Yeah, I understand. Was thinking of a crude approach that used
a single master node (allowing single point of failure) to order
writes/reads, and always used W=N/2+1,R=N/2+1. If a write succeeds,
great. If it fails, then a majority of the nodes would have the old
value. But, read repair on a "failed write" broke the logic. In any
case, I'll stop trying to shoehorn out of scope use cases into riak and
look into a secondary datastore.

benblack # one interesting approach might be to have global sync as
a per-bucket option so you never have to deal with mixed buckets
all ops on the sync bucket are sync

seancribbs # benblack: sounds like it would require 2PC

benblack # yes, or related mechanisms. paxos, VS, whatever but by doing
it per bucket, and having the bucket behavior be the same for
all ops, you simplify it such that it is tractable

benblack # trying to have some transactional stuff on a bucket while other stuff is not transactional would get hairy

seancribbs # true

jdbl # BTW, riak is a great product (esp. with bitcask), and I
appreciate all the work. My frustration with certain properties
of other k/v stores was the motivation behind trying to map my
sideline use cases to riak.
	jdbl # Anyone around to chat about some theoretical failure
	scenarios / corner cases? Trying to wrap my head around what
	Riak will do in certain circumstances.

	seancribbs # jdbl: ask away

	jdbl # So, first to check an assumption. Even with hinted handoffs,
	it is possible for a write with W>1 to fail to succeed, right?
	In certain extreme circumstances, that is.

	benblack # sure

	seancribbs # jdbl: So… yes.

	benblack # fail to succeed means fail, right? ;)

	jdbl # Sure. Reported as failed by Riak, I guess.

	seancribbs # A write can "fail" to the client, but still actually
	succeed in the cluster.

	jdbl # Alright, seancribbs, that's what I'm getting at.

	benblack # a write can partially succeed, and then be cleaned up
	later with repair or HH

	seancribbs # e.g. if it times out
	jdbl # Will a read repair later on squash that failed write or
	propagate it despite it being reported as failed?

	benblack # a write can partially succeed, and then be cleaned
	up later with repair or HH
	it will not be removed
	there is nothing to indicate it should be

	seancribbs # jdbl: vector clocks prevent the squash

	benblack # if you allow multi, both versions can remain in the system

	jdbl # Could that "failed write" end up over writing data
	win other nodes that never received that werite but have data that
	is a direct ancestor of the failed write? In vector clock ancestor sense.

	benblack # it would be a neat trick to never see a write
	but create a successor to it

	seancribbs # jdbl: the effect is the same

	benblack # what you are describing is a branch, not direct lineage

	jdbl # Alright, I guess I'm just getting confused about vclocks in Riak.

	seancribbs # jdbl: most times you don' t need to worry about them

	jdbl # So, if I have 3 replicates on 3 nodes, those replicates will
	have different vclocks for the same set of data?

	seancribbs # jdbl: no

	benblack # jdbl: not Riak specific, if i understand your question properly.
	just how vclocks work

	jdbl # Okay, just to be clear. I'm wrong in the following example then:
	I have data A1 with 3 replicates. I write data A2 with W=3, but
	something occurs such that only one replicate writes A2. The write
	is reported as a failure to the client. If the replicates share the
	same vclock, isn't this A2 a direct decendent of A1 across all the
	replicates and will "win" on the next read repair? Even with multi values?

	benblack # jdbl: the vclock is not changed by replication. the version
	of a replica is defined by its vclock. all replicas of a given version
	of a given doc will have the same vclock. jdbl: sure, the other replicas
	will be repaired and have the version reported to the client as failed.

	benblack # and the client(s) will see that on the read.
	since the vclock is given.

	jdbl # Would only A2 be shown in this case, since it's a direct descendant?
	frank06 # seancribbs: works now - i was doing something silly. thanks for
	your help!

	benblack # they would all be repaired to A2, yes.

	seancribbs # also, the get_fsm does some vclock resolution on read,
	even before repair


	jdbl # Alright. So, despite being reported as failed to the client
	(since the requested W quorum wasn't met), the value nevertheless
	eventually propagates around the cluster and overwrites the previous
	value. So, W-vals / related failure has more to do with guaranteed
	durability than consistency?

	seancribbs # jdbl: yes

	benblack # jdbl: you are only looking at half the equation
	seancribbs # however, you have to have a really serious failure to
	reach that condition

	jdbl # Sure, it's an extreme case, esp. with hinted handoffs. Was
	actually thinking about it as a follow up to a previous discussion
	here / feature request about "strict quorum" writes that didn't employ
	hinted handoffs. In such a case, this scenario could be more likely.
	Leading me to wonder if that's even a valid feature request given Riak's design.

	seancribbs # jdbl: that would increase failures, yes
	and also, if quorum wasn't met, how do you reliably tell the other nodes NOT to write?
	seems too error-prone

	benblack # it's exactly the same problem as getting the writes consistent

	seancribbs # what ben said

	jdbl # Sure, makes sense given Riak's design. The only solution that
	\ might work would be to have some setting that only merges on read
	repair (never invalidates direct ancestors) and let's the application sort
	things out. Eg. if my app always writes with W=majority, I can likely
	handle the in app merging with some sort of majority approach. In any case,
	I'm not suggesting such a feature. Was just wanting to check my thoughts on the matter.

	benblack # jdbl: we could call it allow multi with client-side resolution

	jdbl # If that's something you see value in supporting, go for it. Might
	allow for some clever higher level functionality built on top of Riak that
	supports different consistency models for those who care, while keeping Riak
	"pure".

	seancribbs # jdbl: benblack was being silly - we already expose conflicts
	to the client, when allow_mult is set to true

	benblack # i am a silly billy.

	jdbl # Ah, as thinking he was referring to a stronger multi-version option
	that doesn't invalidate direct vclock ancestors and leads vclock resolution
	to the client, simply responding with a merged set of all existing values.

	seancribbs # jdbl: but you won't overwrite a direct ancestor unless you
	actually supply the ancestor vclock technically that's not a conflict

	benblack # one would assume you initiated the write int he first place
	because you wanted it to succeed. riak_core_util:start_app_deps in core and kv

	jdbl # Well, sure. Unless you have different expectations with
	W=1 and W=3. You might want a write to only success if all W-vals are met.
	We already discussed how riak can report failure, but still allow that
	write to propagate. For some apps, that might be a logic error, and the
	app would rather "cleanup" that failed write on the next read. It can't
	roll back to the previous ancestor because the previous ancestor is already gone.

	benblack # jdbl: that would require a global synchronization mechanism,
	like paxos.

	seancribbs # jdbl: if that's the use-case then you shouldn't use a
	BASE system like Riak you should use an ACID system again, what ben said

	seancribbs # we usually advocate the "write now, resolve later"
	attitude

	benblack # there are times you want that behavior. if you require
	it, you should definitely use a system that supports it.

	seancribbs # (for riak)

	benblack # like all dynamo systems, riak is eventually consistent.
	transactions just aren't in the cards.

	seancribbs # think of write quorums more like soft guarantees that the
	data was written

	jdbl # Yeah, I understand that. Roughly 90% or more of my operations work
	fine with BASE (hence the focus on Riak). I was just contemplating about mapping
	the other 10% to riak with slightly more consistency guarantees and client
	logic to handle it. Makes more sense from an ops point than two separate
	datastores. I'm don't need full ACID anyway, just for a write failure to
	not propagate. I believe Cassandra with strict quorums seems to have this behavior.

	benblack # it doesn't
	no dynamo systems do. transactions are not a good match for them.

	seancribbs # jdbl: an RDBMS or kv store with atomic operations
	(like Redis) would fit those use-case. expecting something to fit 100% of your
	needs is part of the problem

	benblack # jdbl: what you are describing _requires_ global synchronization,
	regardless of whether you want "full ACID" the entire design is structured
	to avoid global synchronization

	seancribbs # there are a number of CP systems out there - Scalaris for example

	jdbl # Any thoughts on how Amazon maps their newish SimpleDB opportunistic
	concurrency logic to an underlying Dynamo store? If I submit a OCC request
	to 3 nodes, 1 node succeeds, the others fail the precondition, wouldn't future
	read repair have trouble merging them? Then again, I imagine SimpleDB
	is more than just Dynamo + fancy API.

	Kenstigator joined #riak

	benblack # amazon has not released enough information on the internals
	of that to answer. so, yes, one could add a global synchronization
	option to the system, but it'd be pretty invasive, i expect.

	jdbl # Yeah, I understand. Was thinking of a crude approach that used
	a single master node (allowing single point of failure) to order
	writes/reads, and always used W=N/2+1,R=N/2+1. If a write succeeds,
	great. If it fails, then a majority of the nodes would have the old
	value. But, read repair on a "failed write" broke the logic. In any
	case, I'll stop trying to shoehorn out of scope use cases into riak and
	look into a secondary datastore.

	benblack # one interesting approach might be to have global sync as
	a per-bucket option so you never have to deal with mixed buckets
	all ops on the sync bucket are sync

	seancribbs # benblack: sounds like it would require 2PC

	benblack # yes, or related mechanisms. paxos, VS, whatever but by doing
	it per bucket, and having the bucket behavior be the same for
	all ops, you simplify it such that it is tractable

	benblack # trying to have some transactional stuff on a bucket while other stuff is not transactional would get hairy

	seancribbs # true

	jdbl # BTW, riak is a great product (esp. with bitcask), and I
	appreciate all the work. My frustration with certain properties
	of other k/v stores was the motivation behind trying to map my
	sideline use cases to riak.