PharkMillups/gist:613725

## gistfile1.txt
09:56 <kagato> Knock, knock. Can somebody give me a primer on ring_creation_size?

09:57 <drev1> kagato: sure

09:57 <drev1> ring_creation_szie is the number of partitions the entire cluster manages

09:57 <kagato> So, can it ever be changed (i.e. with downtime)?

09:57 <drev1> this value remains constant for the life of the cluster

09:57 <drev1> you would need to build a new cluster and transfer the data

09:58 <kagato> Runtime life of the cluster? Or life of the data on disk?

09:58 <benblack> life of the cluster.

09:58 <kagato> Is that automated?

09:58 <benblack> no

09:59 <kagato> ?

09:59 <benblack> it's not automated

09:59 <kagato> So, the creation size causes a permanent impact on how the data is stored?

10:00 <drev1> yes

10:00 <kagato> Let me describe my use-case, and you can tell me if this is the right mindset.

10:00 <benblack> drev1: really?

10:01 <roidrage> arg_: ping!

10:01 <drev1> benblack: ?

10:01 <kagato> I have a Cloud. It has physical nodes. Those nodes need to store their
runtime statistics.

10:01 <kagato> Each node has a copy of riak, they form a ring. Number of nodes could
reach to the order of 10^5.

10:01 <benblack> how does it impact how data is stored on disk?

10:01 <kagato> Is it better to use a dedicated cluster? Does riak hyper-connect or is it
logarithmic like Chord and family?

10:01 <benblack> kagato: i don't believe anyone has ever constructed a ring that large.

10:02 <benblack> is the topology going to be stable?

10:02 <roidrage> i guess arg_ took that nap he was talking about ;)

10:03 <drev1> benblack: ring_creation_size determines how many partitions the data is broken into

10:03 <drev1> it's not permanent for the data

10:03 <drev1> but it's permanent for the cluster

10:03 <kagato> On the order of hours, yes.

10:03 <kagato> drev1: So, if I took down the entire cluster and changed the value in the
config, and reloaded it.... what happens?

10:03 <kagato> Other than a storm of automatic rebalancing.

10:05 <drev1> Riak will keep the old value unless you delete the ring file

10:05 <kagato> The thing is that the latter is very doable if we ever need to change it.
Migrating between clusters is not. We have the automation to make the former really easy.

10:05 <kagato> If I delete the ring file will it trash the data?

10:05 <kagato> Or is the ring file basically a cache of who-had-what at last run?

10:05 <drev1> the data will be on disk but the reference to the data will be lost

10:06 <kagato> drev1: Hmmmmm, that sounds bad.

10:06 <benblack> it is bad

10:07 <kagato> What's the impact if I start out with a massively larger creation size? Say,
65536 for a cluster of 128?

10:07 <kagato> I don't have a problem with the data being spread really thin, as we're doing
realtime reporting on it and we'll be generating it at exactly the same number of nodes as we'd
have in the ring. We expect to map-reduce the hell out of it.

10:08 <kagato> Also, what's the impact of "the topology not being stable" as mentioned above?

10:09 <benblack> the ring must stabilize between toplogy changes

10:09 <benblack> that is mostly determined by gossip interval

10:09 <kagato> Is it unavailable during that period?

10:09 <benblack> of course not

10:09 <drev1> each partition a node is responsible for will consume a certain amount of system
resources, typically the limiting factor is allowed number of open files

10:10 <drev1> a larger ring_creation_size imposed on a smal number of machines would mean each
individually machine would be responsible for more partitions

10:10 <benblack> but if you have constant topology churn you probably need a short gossip
interval and a very large n_val

10:10 <benblack> both of which work against your goals, i think.

10:10 <kagato> So, let's say the split was 512 partitions per node. What's realistic?

10:11 <benblack> that's only part of the issue in doing something like that, imo.

10:12 <drev1> 512 partitions per node is possible

10:12 <kagato> As it stands right now, riak wins hands-down on everything but this partition
thing. I gather it's an artifact of choosing consistent hash intervals over dynamic intervals.
Right now, I'm trying very hard not to end up with the mess that is Cassandra, so I really need
some hard information.

10:12 <kagato> Is there a guide on this?

10:12 <kagato> Or am I going to write it?

10:12 <kagato> How does the gossip traffic grow? Is it per node? Per partition?

10:12 <drev1> http://wiki.basho.com/display/RIAK/An+Introduction+to+Riak

10:13 <drev1> gossip is per node

10:13 <kagato> Sorry for the interrogation, but, y'know, it's how I roll.

10:13 <benblack> per node and by gossip interval

10:14 <kagato> So, number of gossip packets is proportional to nodes. Size of gossip packets is
proportional to number of partitions?

10:15 <kagato> That document is pretty light on the details. It tells me what a partition is, and
how it fits into
replication, but it doesn't really tell me much about how they affect the lifecycle. :(

10:15 <kagato> Would you all suggest that having a riak cluster span, say, four datacenters
is a bad thing?

10:16 <drev1> we don't have documentation with that level of detail as far as I know

10:16 <drev1> yes

10:16 <kagato> Again, not concerned about consistency or speed, only availability.

10:16 <benblack> depends on latency between them

10:16 <benblack> if latency is high, it's not a good thing

10:16 <kagato> Latency on order of 300ms.

10:16 <benblack> then no

10:16 <benblack> that's not a good thing

10:17 <kagato> Good latency is...?

10:17 <benblack> small number of milliseconds

10:17 <kagato> Microseconds?

10:17 <benblack> what you'd have with multiple datacenters in the same metro area

10:17 <kagato> Okay.

10:18 <kagato> Just to clarify, I'm searching for deeper competitive advantages versus Cassandra.

10:19 <benblack> cassandra would not be a good choice for that many nodes and with that
much topology change

10:19 <benblack> at all

10:20 <kagato> I doubt it would be either.

10:20 <kagato> Right now I'm setting the requirements.

10:20 <kagato> If I set the requirements right, I don't have to worry about having to muck
about with Cassandra. As such, I'm looking for useful things that Riak can do that Cassandra can't,
so they can be requirements, ...

10:20 <benblack> what you are describing can be done, just realize you are the first to attempt it.

10:21 <benblack> kagato: node recovery distributes repair load across the rest of the
nodes would be a useful requirement there.

10:21 <kagato> Exciting. Our PR people may be interested in some cross-promotional whitepaper then.

10:21 <kagato> benblack: Are nodes expendable? I.e. can one die forever?

10:21 <benblack> certainly, you just remove it from the ring

10:22 <kagato> Does that break replication guarantees (i.e. stuff that was stored thrice
is now stored twice in some cases)?

10:22 <kagato> Or does riak intelligentally re-replicate?

10:22 <benblack> it is until the partitions are redistributed and the n_val restored

10:22 <benblack> which happens automatically

10:23 <benblack> if you haven't read the dynamo paper, that'd be a good thing to do

10:23 <kagato> Gotcha. That actually brilliantly illustrates the point of partitions.

10:23 <kagato> I've read the Dynamo paper, but Q&A helps.

10:24 <kagato> My biggest problem is that I was a user of Dynomite, so I have to be
extra careful to remember which thing does what. :(

10:24 <benblack> what, a user of dynomite? summon moonpolysoft

10:25 <kagato> benblack: I have visions of a giant dong on the clouds,
in the style of the batsignal.

10:25 <benblack> man, that's a good idea

10:27 <kagato> Thanks guys. I'll check back later when my cluster 'splodes.

10:27 <moonpolysoft> hah

10:27 <* kagato> waves at moonpolysoft.

10:27 <moonpolysoft> sup dude

10:28 <kagato> moonpolysoft: Not much. Just workin'.

10:29 <kagato> Recently been working on Erlang-style actors in Python, since that's what I've
got to work with at <client redacted>.

10:29 <moonpolysoft> that sounds horrible

10:30 <moonpolysoft> might as well be node.js for all of the concurrency you'll get out of python

10:30 <kagato> Well, having classes is nicer than modules/records for dealing with tons
of loosely associated data and functions that deal with it. However, the lack of a decent
syntax for pattern-matching will always make it painful.

10:31 <kagato> The plan is to make a bridge into Erlang, convince them to use it, then migrate stuff across.

10:31 <moonpolysoft> the wages of consulting

10:31 <kagato> After the semi-success of BERT, I figure it's time for PERT or something (cue shampoo jokes).

10:31 <moonpolysoft> well good luck with that

10:32 <kagato> Thanks.
	09:56 <kagato> Knock, knock. Can somebody give me a primer on ring_creation_size?

	09:57 <drev1> kagato: sure

	09:57 <drev1> ring_creation_szie is the number of partitions the entire cluster manages

	09:57 <kagato> So, can it ever be changed (i.e. with downtime)?

	09:57 <drev1> this value remains constant for the life of the cluster

	09:57 <drev1> you would need to build a new cluster and transfer the data

	09:58 <kagato> Runtime life of the cluster? Or life of the data on disk?

	09:58 <benblack> life of the cluster.

	09:58 <kagato> Is that automated?

	09:58 <benblack> no

	09:59 <kagato> ?

	09:59 <benblack> it's not automated

	09:59 <kagato> So, the creation size causes a permanent impact on how the data is stored?

	10:00 <drev1> yes

	10:00 <kagato> Let me describe my use-case, and you can tell me if this is the right mindset.

	10:00 <benblack> drev1: really?

	10:01 <roidrage> arg_: ping!

	10:01 <drev1> benblack: ?

	10:01 <kagato> I have a Cloud. It has physical nodes. Those nodes need to store their
	runtime statistics.

	10:01 <kagato> Each node has a copy of riak, they form a ring. Number of nodes could
	reach to the order of 10^5.

	10:01 <benblack> how does it impact how data is stored on disk?

	10:01 <kagato> Is it better to use a dedicated cluster? Does riak hyper-connect or is it
	logarithmic like Chord and family?

	10:01 <benblack> kagato: i don't believe anyone has ever constructed a ring that large.

	10:02 <benblack> is the topology going to be stable?

	10:02 <roidrage> i guess arg_ took that nap he was talking about ;)

	10:03 <drev1> benblack: ring_creation_size determines how many partitions the data is broken into

	10:03 <drev1> it's not permanent for the data

	10:03 <drev1> but it's permanent for the cluster

	10:03 <kagato> On the order of hours, yes.

	10:03 <kagato> drev1: So, if I took down the entire cluster and changed the value in the
	config, and reloaded it.... what happens?

	10:03 <kagato> Other than a storm of automatic rebalancing.

	10:05 <drev1> Riak will keep the old value unless you delete the ring file

	10:05 <kagato> The thing is that the latter is very doable if we ever need to change it.
	Migrating between clusters is not. We have the automation to make the former really easy.

	10:05 <kagato> If I delete the ring file will it trash the data?

	10:05 <kagato> Or is the ring file basically a cache of who-had-what at last run?

	10:05 <drev1> the data will be on disk but the reference to the data will be lost

	10:06 <kagato> drev1: Hmmmmm, that sounds bad.

	10:06 <benblack> it is bad

	10:07 <kagato> What's the impact if I start out with a massively larger creation size? Say,
	65536 for a cluster of 128?

	10:07 <kagato> I don't have a problem with the data being spread really thin, as we're doing
	realtime reporting on it and we'll be generating it at exactly the same number of nodes as we'd
	have in the ring. We expect to map-reduce the hell out of it.

	10:08 <kagato> Also, what's the impact of "the topology not being stable" as mentioned above?

	10:09 <benblack> the ring must stabilize between toplogy changes

	10:09 <benblack> that is mostly determined by gossip interval

	10:09 <kagato> Is it unavailable during that period?

	10:09 <benblack> of course not

	10:09 <drev1> each partition a node is responsible for will consume a certain amount of system
	resources, typically the limiting factor is allowed number of open files

	10:10 <drev1> a larger ring_creation_size imposed on a smal number of machines would mean each
	individually machine would be responsible for more partitions

	10:10 <benblack> but if you have constant topology churn you probably need a short gossip
	interval and a very large n_val

	10:10 <benblack> both of which work against your goals, i think.

	10:10 <kagato> So, let's say the split was 512 partitions per node. What's realistic?

	10:11 <benblack> that's only part of the issue in doing something like that, imo.

	10:12 <drev1> 512 partitions per node is possible

	10:12 <kagato> As it stands right now, riak wins hands-down on everything but this partition
	thing. I gather it's an artifact of choosing consistent hash intervals over dynamic intervals.
	Right now, I'm trying very hard not to end up with the mess that is Cassandra, so I really need
	some hard information.

	10:12 <kagato> Is there a guide on this?

	10:12 <kagato> Or am I going to write it?

	10:12 <kagato> How does the gossip traffic grow? Is it per node? Per partition?

	10:12 <drev1> http://wiki.basho.com/display/RIAK/An+Introduction+to+Riak

	10:13 <drev1> gossip is per node

	10:13 <kagato> Sorry for the interrogation, but, y'know, it's how I roll.

	10:13 <benblack> per node and by gossip interval

	10:14 <kagato> So, number of gossip packets is proportional to nodes. Size of gossip packets is
	proportional to number of partitions?

	10:15 <kagato> That document is pretty light on the details. It tells me what a partition is, and
	how it fits into
	replication, but it doesn't really tell me much about how they affect the lifecycle. :(

	10:15 <kagato> Would you all suggest that having a riak cluster span, say, four datacenters
	is a bad thing?

	10:16 <drev1> we don't have documentation with that level of detail as far as I know

	10:16 <drev1> yes

	10:16 <kagato> Again, not concerned about consistency or speed, only availability.

	10:16 <benblack> depends on latency between them

	10:16 <benblack> if latency is high, it's not a good thing

	10:16 <kagato> Latency on order of 300ms.

	10:16 <benblack> then no

	10:16 <benblack> that's not a good thing

	10:17 <kagato> Good latency is...?

	10:17 <benblack> small number of milliseconds

	10:17 <kagato> Microseconds?

	10:17 <benblack> what you'd have with multiple datacenters in the same metro area

	10:17 <kagato> Okay.

	10:18 <kagato> Just to clarify, I'm searching for deeper competitive advantages versus Cassandra.

	10:19 <benblack> cassandra would not be a good choice for that many nodes and with that
	much topology change

	10:19 <benblack> at all

	10:20 <kagato> I doubt it would be either.

	10:20 <kagato> Right now I'm setting the requirements.

	10:20 <kagato> If I set the requirements right, I don't have to worry about having to muck
	about with Cassandra. As such, I'm looking for useful things that Riak can do that Cassandra can't,
	so they can be requirements, ...

	10:20 <benblack> what you are describing can be done, just realize you are the first to attempt it.

	10:21 <benblack> kagato: node recovery distributes repair load across the rest of the
	nodes would be a useful requirement there.

	10:21 <kagato> Exciting. Our PR people may be interested in some cross-promotional whitepaper then.

	10:21 <kagato> benblack: Are nodes expendable? I.e. can one die forever?

	10:21 <benblack> certainly, you just remove it from the ring

	10:22 <kagato> Does that break replication guarantees (i.e. stuff that was stored thrice
	is now stored twice in some cases)?

	10:22 <kagato> Or does riak intelligentally re-replicate?

	10:22 <benblack> it is until the partitions are redistributed and the n_val restored

	10:22 <benblack> which happens automatically

	10:23 <benblack> if you haven't read the dynamo paper, that'd be a good thing to do

	10:23 <kagato> Gotcha. That actually brilliantly illustrates the point of partitions.

	10:23 <kagato> I've read the Dynamo paper, but Q&A helps.

	10:24 <kagato> My biggest problem is that I was a user of Dynomite, so I have to be
	extra careful to remember which thing does what. :(

	10:24 <benblack> what, a user of dynomite? summon moonpolysoft

	10:25 <kagato> benblack: I have visions of a giant dong on the clouds,
	in the style of the batsignal.

	10:25 <benblack> man, that's a good idea

	10:27 <kagato> Thanks guys. I'll check back later when my cluster 'splodes.

	10:27 <moonpolysoft> hah

	10:27 <* kagato> waves at moonpolysoft.

	10:27 <moonpolysoft> sup dude

	10:28 <kagato> moonpolysoft: Not much. Just workin'.

	10:29 <kagato> Recently been working on Erlang-style actors in Python, since that's what I've
	got to work with at <client redacted>.

	10:29 <moonpolysoft> that sounds horrible

	10:30 <moonpolysoft> might as well be node.js for all of the concurrency you'll get out of python

	10:30 <kagato> Well, having classes is nicer than modules/records for dealing with tons
	of loosely associated data and functions that deal with it. However, the lack of a decent
	syntax for pattern-matching will always make it painful.

	10:31 <kagato> The plan is to make a bridge into Erlang, convince them to use it, then migrate stuff across.

	10:31 <moonpolysoft> the wages of consulting

	10:31 <kagato> After the semi-success of BERT, I figure it's time for PERT or something (cue shampoo jokes).

	10:31 <moonpolysoft> well good luck with that

	10:32 <kagato> Thanks.