PharkMillups/26

## 26
florian # hi guys

seancribbs # howdy

florian # i'm an engineer at Twitter and have started to look into Riak a little bit

benblack # florian: do you work with rk?

florian # what's the largest cluster that you guys now that's currently running?

seancribbs # 60 nodes is the largest we know of

florian # greeat how is churn handled in riak?

benblack # churn?

seancribbs # what specifically about churn?

florian # i.e. if you have a cluster that's setup for multi-tenancy

seancribbs # you mean like concurrent updates?

benblack # set up for multi-tenancy in what way?
florian # nope more like nodes joining and leaving pretty regular

seancribbs # vector clocks with sibling/divergent objects oh - we don't recommend you rapidly changing cluster sizes

benblack # node joining and leaving should be infrequent events

seancribbs # causes too much handoff

florian # could you elaborate on how this is handled?

benblack # all dynamo-like systems dislike cluster membership churn

benblack # florian: have you read the amazon dynamo paper?

florian # yes

benblack # that's how it is handled.

florian # i see

benblack # florian: do you work with rk or jmhodges?
florian # they're both on the infrastructure team but yeah somewhat this is a one-off project

seancribbs # florian: basically each leaving node is going to have to handoff ownership of partitions to other nodes
and the inverse for joining nodes

benblack # which means lots of data transfer and cleanup

benblack # membership changes are very expensive

seancribbs # too much leaving/joining may also cause aggressive reshuffling which hurts

florian # so i've worked with coherence in the past and have been burned by their "dealing" with nodes joining / leaving
i.e. with coherence, if you network was interrupted for just a bit and then reconnected, the shuffling around / re-balancing the keyspace would often take down the entire cluster

benblack # florian: do you mean permanent leaving of nodes or just temporary?

arg # the overhead comes when nodes *permanently* join/leave the network

florian # temporarily

arg # partitions don't cause much overhead at all

benblack # right, then that's easier

seancribbs # just hinted handoff which is ez

arg # reads/writes will be handled by other nodes and any written data will be handed off when the node rejoins
^^

florian # would someone explain to me the algorithm used to a) decide when a partition is handed off, b) how it is handed off ?

benblack # florian: the partition is not handed off

benblack # in case of temporary failure the mutations to a partition are given to other nodes
"hints"
arg # each partition is represented by a process that periodically checks to see if it is "home", i.e. it owns the partition in the consistent hashing state

benblack # hence the name hinted handoff

seancribbs # partition (network failure) != partition (subdivision of the keyspace)

arg # if it doesn't own it, it either means membership changed or it has taken some handoff data

benblack # when the node rejoins, the hints are handed off
* arg # stands down

benblack # good point, seancribbs

seancribbs # that may have been confusing
* arg # stands up

seancribbs # we usually mean the keyspace one when we say partition

florian # ok cool.

benblack # a network partition is the thing that can cause hints to be stored on other nodes for later handoff when the partition is healed

florian # so what will happen if i have a cluster with 6 nodes, 2 off them are temporarily disconnected (let's say for 10 minutes), then they come back

arg # other nodes will take writes and answer reads
when the nodes come back, the nodes that took the writes will see them come back

benblack # florian: a network partition is the thing that can cause hints to be stored on other nodes for later handoff when the partition is healed

arg # and hand back the data they accepted for them while they were down/partitioned

* seancribbs # steps away since arg and benblack have this covered

florian # so does it generally work pretty well across datacenters?

benblack # florian: the dynamo paper really does cover this

florian # benblack: i think that's also what coherence would say :)

seancribbs # florian: we generally consider a cluster to be within a single DC

florian # this is awesome info

benblack # coherence would say it is covered in the dynamo paper?

arg # florian: in the enterprise product, yes. currently multi-datacenter is a for-pay feature but that might not always be the case.

florian # ok thanks guys

roidrage # indeed thanks! it's like interactive learning reading through this conversation :)

seancribbs # roidrage: Gradus ad Parnassum

arg # but to actually answer the multi-datacenter question, it is handled well each object is tagged with a vector clock that is used to detect/expose conflicting/non-causally-related writes in which case you can either get all the conflicting objects or choose to let the latest timestamp win for a given bucket/key pair
	florian # hi guys

	seancribbs # howdy

	florian # i'm an engineer at Twitter and have started to look into Riak a little bit

	benblack # florian: do you work with rk?

	florian # what's the largest cluster that you guys now that's currently running?

	seancribbs # 60 nodes is the largest we know of

	florian # greeat how is churn handled in riak?

	benblack # churn?

	seancribbs # what specifically about churn?

	florian # i.e. if you have a cluster that's setup for multi-tenancy

	seancribbs # you mean like concurrent updates?

	benblack # set up for multi-tenancy in what way?
	florian # nope more like nodes joining and leaving pretty regular

	seancribbs # vector clocks with sibling/divergent objects oh - we don't recommend you rapidly changing cluster sizes

	benblack # node joining and leaving should be infrequent events

	seancribbs # causes too much handoff

	florian # could you elaborate on how this is handled?

	benblack # all dynamo-like systems dislike cluster membership churn

	benblack # florian: have you read the amazon dynamo paper?

	florian # yes

	benblack # that's how it is handled.

	florian # i see

	benblack # florian: do you work with rk or jmhodges?
	florian # they're both on the infrastructure team but yeah somewhat this is a one-off project

	seancribbs # florian: basically each leaving node is going to have to handoff ownership of partitions to other nodes
	and the inverse for joining nodes

	benblack # which means lots of data transfer and cleanup

	benblack # membership changes are very expensive

	seancribbs # too much leaving/joining may also cause aggressive reshuffling which hurts

	florian # so i've worked with coherence in the past and have been burned by their "dealing" with nodes joining / leaving
	i.e. with coherence, if you network was interrupted for just a bit and then reconnected, the shuffling around / re-balancing the keyspace would often take down the entire cluster

	benblack # florian: do you mean permanent leaving of nodes or just temporary?

	arg # the overhead comes when nodes permanently join/leave the network

	florian # temporarily

	arg # partitions don't cause much overhead at all

	benblack # right, then that's easier

	seancribbs # just hinted handoff which is ez

	arg # reads/writes will be handled by other nodes and any written data will be handed off when the node rejoins
	^^

	florian # would someone explain to me the algorithm used to a) decide when a partition is handed off, b) how it is handed off ?

	benblack # florian: the partition is not handed off

	benblack # in case of temporary failure the mutations to a partition are given to other nodes
	"hints"
	arg # each partition is represented by a process that periodically checks to see if it is "home", i.e. it owns the partition in the consistent hashing state

	benblack # hence the name hinted handoff

	seancribbs # partition (network failure) != partition (subdivision of the keyspace)

	arg # if it doesn't own it, it either means membership changed or it has taken some handoff data

	benblack # when the node rejoins, the hints are handed off
	* arg # stands down

	benblack # good point, seancribbs

	seancribbs # that may have been confusing
	* arg # stands up

	seancribbs # we usually mean the keyspace one when we say partition

	florian # ok cool.

	benblack # a network partition is the thing that can cause hints to be stored on other nodes for later handoff when the partition is healed

	florian # so what will happen if i have a cluster with 6 nodes, 2 off them are temporarily disconnected (let's say for 10 minutes), then they come back

	arg # other nodes will take writes and answer reads
	when the nodes come back, the nodes that took the writes will see them come back

	benblack # florian: a network partition is the thing that can cause hints to be stored on other nodes for later handoff when the partition is healed

	arg # and hand back the data they accepted for them while they were down/partitioned

	* seancribbs # steps away since arg and benblack have this covered

	florian # so does it generally work pretty well across datacenters?

	benblack # florian: the dynamo paper really does cover this

	florian # benblack: i think that's also what coherence would say :)

	seancribbs # florian: we generally consider a cluster to be within a single DC

	florian # this is awesome info

	benblack # coherence would say it is covered in the dynamo paper?

	arg # florian: in the enterprise product, yes. currently multi-datacenter is a for-pay feature but that might not always be the case.

	florian # ok thanks guys

	roidrage # indeed thanks! it's like interactive learning reading through this conversation :)

	seancribbs # roidrage: Gradus ad Parnassum

	arg # but to actually answer the multi-datacenter question, it is handled well each object is tagged with a vector clock that is used to detect/expose conflicting/non-causally-related writes in which case you can either get all the conflicting objects or choose to let the latest timestamp win for a given bucket/key pair