Created
October 6, 2010 17:25
-
-
Save PharkMillups/613725 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
09:56 <kagato> Knock, knock. Can somebody give me a primer on ring_creation_size? | |
09:57 <drev1> kagato: sure | |
09:57 <drev1> ring_creation_szie is the number of partitions the entire cluster manages | |
09:57 <kagato> So, can it ever be changed (i.e. with downtime)? | |
09:57 <drev1> this value remains constant for the life of the cluster | |
09:57 <drev1> you would need to build a new cluster and transfer the data | |
09:58 <kagato> Runtime life of the cluster? Or life of the data on disk? | |
09:58 <benblack> life of the cluster. | |
09:58 <kagato> Is that automated? | |
09:58 <benblack> no | |
09:59 <kagato> ? | |
09:59 <benblack> it's not automated | |
09:59 <kagato> So, the creation size causes a permanent impact on how the data is stored? | |
10:00 <drev1> yes | |
10:00 <kagato> Let me describe my use-case, and you can tell me if this is the right mindset. | |
10:00 <benblack> drev1: really? | |
10:01 <roidrage> arg_: ping! | |
10:01 <drev1> benblack: ? | |
10:01 <kagato> I have a Cloud. It has physical nodes. Those nodes need to store their | |
runtime statistics. | |
10:01 <kagato> Each node has a copy of riak, they form a ring. Number of nodes could | |
reach to the order of 10^5. | |
10:01 <benblack> how does it impact how data is stored on disk? | |
10:01 <kagato> Is it better to use a dedicated cluster? Does riak hyper-connect or is it | |
logarithmic like Chord and family? | |
10:01 <benblack> kagato: i don't believe anyone has ever constructed a ring that large. | |
10:02 <benblack> is the topology going to be stable? | |
10:02 <roidrage> i guess arg_ took that nap he was talking about ;) | |
10:03 <drev1> benblack: ring_creation_size determines how many partitions the data is broken into | |
10:03 <drev1> it's not permanent for the data | |
10:03 <drev1> but it's permanent for the cluster | |
10:03 <kagato> On the order of hours, yes. | |
10:03 <kagato> drev1: So, if I took down the entire cluster and changed the value in the | |
config, and reloaded it.... what happens? | |
10:03 <kagato> Other than a storm of automatic rebalancing. | |
10:05 <drev1> Riak will keep the old value unless you delete the ring file | |
10:05 <kagato> The thing is that the latter is very doable if we ever need to change it. | |
Migrating between clusters is not. We have the automation to make the former really easy. | |
10:05 <kagato> If I delete the ring file will it trash the data? | |
10:05 <kagato> Or is the ring file basically a cache of who-had-what at last run? | |
10:05 <drev1> the data will be on disk but the reference to the data will be lost | |
10:06 <kagato> drev1: Hmmmmm, that sounds bad. | |
10:06 <benblack> it is bad | |
10:07 <kagato> What's the impact if I start out with a massively larger creation size? Say, | |
65536 for a cluster of 128? | |
10:07 <kagato> I don't have a problem with the data being spread really thin, as we're doing | |
realtime reporting on it and we'll be generating it at exactly the same number of nodes as we'd | |
have in the ring. We expect to map-reduce the hell out of it. | |
10:08 <kagato> Also, what's the impact of "the topology not being stable" as mentioned above? | |
10:09 <benblack> the ring must stabilize between toplogy changes | |
10:09 <benblack> that is mostly determined by gossip interval | |
10:09 <kagato> Is it unavailable during that period? | |
10:09 <benblack> of course not | |
10:09 <drev1> each partition a node is responsible for will consume a certain amount of system | |
resources, typically the limiting factor is allowed number of open files | |
10:10 <drev1> a larger ring_creation_size imposed on a smal number of machines would mean each | |
individually machine would be responsible for more partitions | |
10:10 <benblack> but if you have constant topology churn you probably need a short gossip | |
interval and a very large n_val | |
10:10 <benblack> both of which work against your goals, i think. | |
10:10 <kagato> So, let's say the split was 512 partitions per node. What's realistic? | |
10:11 <benblack> that's only part of the issue in doing something like that, imo. | |
10:12 <drev1> 512 partitions per node is possible | |
10:12 <kagato> As it stands right now, riak wins hands-down on everything but this partition | |
thing. I gather it's an artifact of choosing consistent hash intervals over dynamic intervals. | |
Right now, I'm trying very hard not to end up with the mess that is Cassandra, so I really need | |
some hard information. | |
10:12 <kagato> Is there a guide on this? | |
10:12 <kagato> Or am I going to write it? | |
10:12 <kagato> How does the gossip traffic grow? Is it per node? Per partition? | |
10:12 <drev1> http://wiki.basho.com/display/RIAK/An+Introduction+to+Riak | |
10:13 <drev1> gossip is per node | |
10:13 <kagato> Sorry for the interrogation, but, y'know, it's how I roll. | |
10:13 <benblack> per node and by gossip interval | |
10:14 <kagato> So, number of gossip packets is proportional to nodes. Size of gossip packets is | |
proportional to number of partitions? | |
10:15 <kagato> That document is pretty light on the details. It tells me what a partition is, and | |
how it fits into | |
replication, but it doesn't really tell me much about how they affect the lifecycle. :( | |
10:15 <kagato> Would you all suggest that having a riak cluster span, say, four datacenters | |
is a bad thing? | |
10:16 <drev1> we don't have documentation with that level of detail as far as I know | |
10:16 <drev1> yes | |
10:16 <kagato> Again, not concerned about consistency or speed, only availability. | |
10:16 <benblack> depends on latency between them | |
10:16 <benblack> if latency is high, it's not a good thing | |
10:16 <kagato> Latency on order of 300ms. | |
10:16 <benblack> then no | |
10:16 <benblack> that's not a good thing | |
10:17 <kagato> Good latency is...? | |
10:17 <benblack> small number of milliseconds | |
10:17 <kagato> Microseconds? | |
10:17 <benblack> what you'd have with multiple datacenters in the same metro area | |
10:17 <kagato> Okay. | |
10:18 <kagato> Just to clarify, I'm searching for deeper competitive advantages versus Cassandra. | |
10:19 <benblack> cassandra would not be a good choice for that many nodes and with that | |
much topology change | |
10:19 <benblack> at all | |
10:20 <kagato> I doubt it would be either. | |
10:20 <kagato> Right now I'm setting the requirements. | |
10:20 <kagato> If I set the requirements right, I don't have to worry about having to muck | |
about with Cassandra. As such, I'm looking for useful things that Riak can do that Cassandra can't, | |
so they can be requirements, ... | |
10:20 <benblack> what you are describing can be done, just realize you are the first to attempt it. | |
10:21 <benblack> kagato: node recovery distributes repair load across the rest of the | |
nodes would be a useful requirement there. | |
10:21 <kagato> Exciting. Our PR people may be interested in some cross-promotional whitepaper then. | |
10:21 <kagato> benblack: Are nodes expendable? I.e. can one die forever? | |
10:21 <benblack> certainly, you just remove it from the ring | |
10:22 <kagato> Does that break replication guarantees (i.e. stuff that was stored thrice | |
is now stored twice in some cases)? | |
10:22 <kagato> Or does riak intelligentally re-replicate? | |
10:22 <benblack> it is until the partitions are redistributed and the n_val restored | |
10:22 <benblack> which happens automatically | |
10:23 <benblack> if you haven't read the dynamo paper, that'd be a good thing to do | |
10:23 <kagato> Gotcha. That actually brilliantly illustrates the point of partitions. | |
10:23 <kagato> I've read the Dynamo paper, but Q&A helps. | |
10:24 <kagato> My biggest problem is that I was a user of Dynomite, so I have to be | |
extra careful to remember which thing does what. :( | |
10:24 <benblack> what, a user of dynomite? summon moonpolysoft | |
10:25 <kagato> benblack: I have visions of a giant dong on the clouds, | |
in the style of the batsignal. | |
10:25 <benblack> man, that's a good idea | |
10:27 <kagato> Thanks guys. I'll check back later when my cluster 'splodes. | |
10:27 <moonpolysoft> hah | |
10:27 <* kagato> waves at moonpolysoft. | |
10:27 <moonpolysoft> sup dude | |
10:28 <kagato> moonpolysoft: Not much. Just workin'. | |
10:29 <kagato> Recently been working on Erlang-style actors in Python, since that's what I've | |
got to work with at <client redacted>. | |
10:29 <moonpolysoft> that sounds horrible | |
10:30 <moonpolysoft> might as well be node.js for all of the concurrency you'll get out of python | |
10:30 <kagato> Well, having classes is nicer than modules/records for dealing with tons | |
of loosely associated data and functions that deal with it. However, the lack of a decent | |
syntax for pattern-matching will always make it painful. | |
10:31 <kagato> The plan is to make a bridge into Erlang, convince them to use it, then migrate stuff across. | |
10:31 <moonpolysoft> the wages of consulting | |
10:31 <kagato> After the semi-success of BERT, I figure it's time for PERT or something (cue shampoo jokes). | |
10:31 <moonpolysoft> well good luck with that | |
10:32 <kagato> Thanks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment