Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created October 6, 2010 17:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PharkMillups/613725 to your computer and use it in GitHub Desktop.
Save PharkMillups/613725 to your computer and use it in GitHub Desktop.
09:56 <kagato> Knock, knock. Can somebody give me a primer on ring_creation_size?
09:57 <drev1> kagato: sure
09:57 <drev1> ring_creation_szie is the number of partitions the entire cluster manages
09:57 <kagato> So, can it ever be changed (i.e. with downtime)?
09:57 <drev1> this value remains constant for the life of the cluster
09:57 <drev1> you would need to build a new cluster and transfer the data
09:58 <kagato> Runtime life of the cluster? Or life of the data on disk?
09:58 <benblack> life of the cluster.
09:58 <kagato> Is that automated?
09:58 <benblack> no
09:59 <kagato> ?
09:59 <benblack> it's not automated
09:59 <kagato> So, the creation size causes a permanent impact on how the data is stored?
10:00 <drev1> yes
10:00 <kagato> Let me describe my use-case, and you can tell me if this is the right mindset.
10:00 <benblack> drev1: really?
10:01 <roidrage> arg_: ping!
10:01 <drev1> benblack: ?
10:01 <kagato> I have a Cloud. It has physical nodes. Those nodes need to store their
runtime statistics.
10:01 <kagato> Each node has a copy of riak, they form a ring. Number of nodes could
reach to the order of 10^5.
10:01 <benblack> how does it impact how data is stored on disk?
10:01 <kagato> Is it better to use a dedicated cluster? Does riak hyper-connect or is it
logarithmic like Chord and family?
10:01 <benblack> kagato: i don't believe anyone has ever constructed a ring that large.
10:02 <benblack> is the topology going to be stable?
10:02 <roidrage> i guess arg_ took that nap he was talking about ;)
10:03 <drev1> benblack: ring_creation_size determines how many partitions the data is broken into
10:03 <drev1> it's not permanent for the data
10:03 <drev1> but it's permanent for the cluster
10:03 <kagato> On the order of hours, yes.
10:03 <kagato> drev1: So, if I took down the entire cluster and changed the value in the
config, and reloaded it.... what happens?
10:03 <kagato> Other than a storm of automatic rebalancing.
10:05 <drev1> Riak will keep the old value unless you delete the ring file
10:05 <kagato> The thing is that the latter is very doable if we ever need to change it.
Migrating between clusters is not. We have the automation to make the former really easy.
10:05 <kagato> If I delete the ring file will it trash the data?
10:05 <kagato> Or is the ring file basically a cache of who-had-what at last run?
10:05 <drev1> the data will be on disk but the reference to the data will be lost
10:06 <kagato> drev1: Hmmmmm, that sounds bad.
10:06 <benblack> it is bad
10:07 <kagato> What's the impact if I start out with a massively larger creation size? Say,
65536 for a cluster of 128?
10:07 <kagato> I don't have a problem with the data being spread really thin, as we're doing
realtime reporting on it and we'll be generating it at exactly the same number of nodes as we'd
have in the ring. We expect to map-reduce the hell out of it.
10:08 <kagato> Also, what's the impact of "the topology not being stable" as mentioned above?
10:09 <benblack> the ring must stabilize between toplogy changes
10:09 <benblack> that is mostly determined by gossip interval
10:09 <kagato> Is it unavailable during that period?
10:09 <benblack> of course not
10:09 <drev1> each partition a node is responsible for will consume a certain amount of system
resources, typically the limiting factor is allowed number of open files
10:10 <drev1> a larger ring_creation_size imposed on a smal number of machines would mean each
individually machine would be responsible for more partitions
10:10 <benblack> but if you have constant topology churn you probably need a short gossip
interval and a very large n_val
10:10 <benblack> both of which work against your goals, i think.
10:10 <kagato> So, let's say the split was 512 partitions per node. What's realistic?
10:11 <benblack> that's only part of the issue in doing something like that, imo.
10:12 <drev1> 512 partitions per node is possible
10:12 <kagato> As it stands right now, riak wins hands-down on everything but this partition
thing. I gather it's an artifact of choosing consistent hash intervals over dynamic intervals.
Right now, I'm trying very hard not to end up with the mess that is Cassandra, so I really need
some hard information.
10:12 <kagato> Is there a guide on this?
10:12 <kagato> Or am I going to write it?
10:12 <kagato> How does the gossip traffic grow? Is it per node? Per partition?
10:12 <drev1> http://wiki.basho.com/display/RIAK/An+Introduction+to+Riak
10:13 <drev1> gossip is per node
10:13 <kagato> Sorry for the interrogation, but, y'know, it's how I roll.
10:13 <benblack> per node and by gossip interval
10:14 <kagato> So, number of gossip packets is proportional to nodes. Size of gossip packets is
proportional to number of partitions?
10:15 <kagato> That document is pretty light on the details. It tells me what a partition is, and
how it fits into
replication, but it doesn't really tell me much about how they affect the lifecycle. :(
10:15 <kagato> Would you all suggest that having a riak cluster span, say, four datacenters
is a bad thing?
10:16 <drev1> we don't have documentation with that level of detail as far as I know
10:16 <drev1> yes
10:16 <kagato> Again, not concerned about consistency or speed, only availability.
10:16 <benblack> depends on latency between them
10:16 <benblack> if latency is high, it's not a good thing
10:16 <kagato> Latency on order of 300ms.
10:16 <benblack> then no
10:16 <benblack> that's not a good thing
10:17 <kagato> Good latency is...?
10:17 <benblack> small number of milliseconds
10:17 <kagato> Microseconds?
10:17 <benblack> what you'd have with multiple datacenters in the same metro area
10:17 <kagato> Okay.
10:18 <kagato> Just to clarify, I'm searching for deeper competitive advantages versus Cassandra.
10:19 <benblack> cassandra would not be a good choice for that many nodes and with that
much topology change
10:19 <benblack> at all
10:20 <kagato> I doubt it would be either.
10:20 <kagato> Right now I'm setting the requirements.
10:20 <kagato> If I set the requirements right, I don't have to worry about having to muck
about with Cassandra. As such, I'm looking for useful things that Riak can do that Cassandra can't,
so they can be requirements, ...
10:20 <benblack> what you are describing can be done, just realize you are the first to attempt it.
10:21 <benblack> kagato: node recovery distributes repair load across the rest of the
nodes would be a useful requirement there.
10:21 <kagato> Exciting. Our PR people may be interested in some cross-promotional whitepaper then.
10:21 <kagato> benblack: Are nodes expendable? I.e. can one die forever?
10:21 <benblack> certainly, you just remove it from the ring
10:22 <kagato> Does that break replication guarantees (i.e. stuff that was stored thrice
is now stored twice in some cases)?
10:22 <kagato> Or does riak intelligentally re-replicate?
10:22 <benblack> it is until the partitions are redistributed and the n_val restored
10:22 <benblack> which happens automatically
10:23 <benblack> if you haven't read the dynamo paper, that'd be a good thing to do
10:23 <kagato> Gotcha. That actually brilliantly illustrates the point of partitions.
10:23 <kagato> I've read the Dynamo paper, but Q&A helps.
10:24 <kagato> My biggest problem is that I was a user of Dynomite, so I have to be
extra careful to remember which thing does what. :(
10:24 <benblack> what, a user of dynomite? summon moonpolysoft
10:25 <kagato> benblack: I have visions of a giant dong on the clouds,
in the style of the batsignal.
10:25 <benblack> man, that's a good idea
10:27 <kagato> Thanks guys. I'll check back later when my cluster 'splodes.
10:27 <moonpolysoft> hah
10:27 <* kagato> waves at moonpolysoft.
10:27 <moonpolysoft> sup dude
10:28 <kagato> moonpolysoft: Not much. Just workin'.
10:29 <kagato> Recently been working on Erlang-style actors in Python, since that's what I've
got to work with at <client redacted>.
10:29 <moonpolysoft> that sounds horrible
10:30 <moonpolysoft> might as well be node.js for all of the concurrency you'll get out of python
10:30 <kagato> Well, having classes is nicer than modules/records for dealing with tons
of loosely associated data and functions that deal with it. However, the lack of a decent
syntax for pattern-matching will always make it painful.
10:31 <kagato> The plan is to make a bridge into Erlang, convince them to use it, then migrate stuff across.
10:31 <moonpolysoft> the wages of consulting
10:31 <kagato> After the semi-success of BERT, I figure it's time for PERT or something (cue shampoo jokes).
10:31 <moonpolysoft> well good luck with that
10:32 <kagato> Thanks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment