-
-
Save PharkMillups/417326 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
florian # hi guys | |
seancribbs # howdy | |
florian # i'm an engineer at Twitter and have started to look into Riak a little bit | |
benblack # florian: do you work with rk? | |
florian # what's the largest cluster that you guys now that's currently running? | |
seancribbs # 60 nodes is the largest we know of | |
florian # greeat how is churn handled in riak? | |
benblack # churn? | |
seancribbs # what specifically about churn? | |
florian # i.e. if you have a cluster that's setup for multi-tenancy | |
seancribbs # you mean like concurrent updates? | |
benblack # set up for multi-tenancy in what way? | |
florian # nope more like nodes joining and leaving pretty regular | |
seancribbs # vector clocks with sibling/divergent objects oh - we don't recommend you rapidly changing cluster sizes | |
benblack # node joining and leaving should be infrequent events | |
seancribbs # causes too much handoff | |
florian # could you elaborate on how this is handled? | |
benblack # all dynamo-like systems dislike cluster membership churn | |
benblack # florian: have you read the amazon dynamo paper? | |
florian # yes | |
benblack # that's how it is handled. | |
florian # i see | |
benblack # florian: do you work with rk or jmhodges? | |
florian # they're both on the infrastructure team but yeah somewhat this is a one-off project | |
seancribbs # florian: basically each leaving node is going to have to handoff ownership of partitions to other nodes | |
and the inverse for joining nodes | |
benblack # which means lots of data transfer and cleanup | |
benblack # membership changes are very expensive | |
seancribbs # too much leaving/joining may also cause aggressive reshuffling which hurts | |
florian # so i've worked with coherence in the past and have been burned by their "dealing" with nodes joining / leaving | |
i.e. with coherence, if you network was interrupted for just a bit and then reconnected, the shuffling around / re-balancing the keyspace would often take down the entire cluster | |
benblack # florian: do you mean permanent leaving of nodes or just temporary? | |
arg # the overhead comes when nodes *permanently* join/leave the network | |
florian # temporarily | |
arg # partitions don't cause much overhead at all | |
benblack # right, then that's easier | |
seancribbs # just hinted handoff which is ez | |
arg # reads/writes will be handled by other nodes and any written data will be handed off when the node rejoins | |
^^ | |
florian # would someone explain to me the algorithm used to a) decide when a partition is handed off, b) how it is handed off ? | |
benblack # florian: the partition is not handed off | |
benblack # in case of temporary failure the mutations to a partition are given to other nodes | |
"hints" | |
arg # each partition is represented by a process that periodically checks to see if it is "home", i.e. it owns the partition in the consistent hashing state | |
benblack # hence the name hinted handoff | |
seancribbs # partition (network failure) != partition (subdivision of the keyspace) | |
arg # if it doesn't own it, it either means membership changed or it has taken some handoff data | |
benblack # when the node rejoins, the hints are handed off | |
* arg # stands down | |
benblack # good point, seancribbs | |
seancribbs # that may have been confusing | |
* arg # stands up | |
seancribbs # we usually mean the keyspace one when we say partition | |
florian # ok cool. | |
benblack # a network partition is the thing that can cause hints to be stored on other nodes for later handoff when the partition is healed | |
florian # so what will happen if i have a cluster with 6 nodes, 2 off them are temporarily disconnected (let's say for 10 minutes), then they come back | |
arg # other nodes will take writes and answer reads | |
when the nodes come back, the nodes that took the writes will see them come back | |
benblack # florian: a network partition is the thing that can cause hints to be stored on other nodes for later handoff when the partition is healed | |
arg # and hand back the data they accepted for them while they were down/partitioned | |
* seancribbs # steps away since arg and benblack have this covered | |
florian # so does it generally work pretty well across datacenters? | |
benblack # florian: the dynamo paper really does cover this | |
florian # benblack: i think that's also what coherence would say :) | |
seancribbs # florian: we generally consider a cluster to be within a single DC | |
florian # this is awesome info | |
benblack # coherence would say it is covered in the dynamo paper? | |
arg # florian: in the enterprise product, yes. currently multi-datacenter is a for-pay feature but that might not always be the case. | |
florian # ok thanks guys | |
roidrage # indeed thanks! it's like interactive learning reading through this conversation :) | |
seancribbs # roidrage: Gradus ad Parnassum | |
arg # but to actually answer the multi-datacenter question, it is handled well each object is tagged with a vector clock that is used to detect/expose conflicting/non-causally-related writes in which case you can either get all the conflicting objects or choose to let the latest timestamp win for a given bucket/key pair |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment