Created
October 20, 2010 19:13
-
-
Save PharkMillups/637089 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
15:38 <traceback0> Any informatino of someone using Riak at scale/ | |
15:39 <seancribbs> traceback0: at what scale | |
15:39 <traceback0> seancribbs: 1000 writes a sec | |
15:39 <seancribbs> oh that's easy | |
15:39 <seancribbs> see the Mozilla blog post on the subject | |
15:39 <traceback0> ok how about IO issues when you add a new node | |
15:39 <traceback0> Any compacting going on under the covers? | |
15:39 <pharkmillups> traceback0: http://blog.mozilla.com/data/2010/08/16/benchmarking-riak-for-the-mozilla-test-pilot-project/ | |
15:40 <traceback0> actually I am not worried about reads or writes, | |
I am worried about how it scales via IO | |
15:40 <seancribbs> traceback0: compacting in what sense | |
15:40 <traceback0> Like Cassandar | |
15:40 <traceback0> Cassandra* | |
15:40 <traceback0> We're thinking about using riak to replace cassandra | |
15:40 <traceback0> we most definitely need to KILl cassandra | |
15:40 <traceback0> Riak sounds like the best fit | |
15:41 <traceback0> seancribbs: We're basically creating a look up table to | |
map users to a shard_id | |
15:41 <seancribbs> traceback0: compaction is orthogonal to adding nodes… not | |
really sure how it applies | |
15:41 <traceback0> seancribbs: so lots of md5(key) -> shard_id (5) | |
15:41 <traceback0> seancribbs: I don't know much about Riak so my questions are naive | |
15:41 <seancribbs> traceback0: Riak doesn't let you control which nodes get replicas | |
15:42 <traceback0> seancribbs: with cassandra, they basically compact SStables laid | |
out on disk into 1 sstable | |
15:42 <traceback0> seancribbs: that causes IO issues if you have cloud-based disks | |
which aren't the best IO | |
15:42 <traceback0> seancribbs: subsequently since the IO is bad, the cassandra node | |
doesn't return reads because it's too busy | |
15:42 <seancribbs> traceback0: bitcask, our default backend, is much simpler than SStables | |
15:42 <traceback0> seancribbs: I want to avoid a datastore that does compactions and pegs my IO | |
15:43 <traceback0> seancribbs: ok | |
15:43 <seancribbs> there's no sorting, but there are occasional merges that remove tombstones | |
15:43 <traceback0> seancribbs: what are the first bottlenecks people see when their | |
working set no longer fits in riak? | |
15:43 <seancribbs> traceback0: Riak is generally IO bound | |
15:43 <seancribbs> and bitcask relies a lot on the OS filesystem cache | |
15:44 <seancribbs> i.e. it doesn't try to cache stuff itself | |
15:44 <traceback0> do you guys pass O_DIRECT? | |
15:44 <seancribbs> it's even lower level than that | |
15:44 <seancribbs> that's an inno thing iirc | |
15:45 <traceback0> seancribbs: one issue with cassandra is that the files that are | |
in the disk cache get blown when the merge happens bcause the pointer no longer references | |
those files since they were merged and deleted. | |
15:45 <traceback0> Do you guys do anything like that | |
15:45 <seancribbs> traceback0: you'll have to ask justinsheehy or dizzyd about the details | |
of bitcask implementation | |
15:45 <traceback0> seancribbs: and when you say Riak is genearlly IO bound, it doesn't fit | |
keys and values in memory? | |
15:46 <traceback0> oh I see | |
15:46 <traceback0> you mean in terms of a bottle neck | |
15:46 <traceback0> sorry | |
15:46 <traceback0> first thing to bottle is IO | |
15:46 <seancribbs> traceback0: it's IO bound in the sense that there's very | |
little computation to do | |
15:46 <traceback0> what part of IO bottles? | |
15:46 <seancribbs> network and disk | |
15:46 <traceback0> yeah so what disk related issues? | |
15:47 <traceback0> paging in/out keys/values? | |
15:47 <seancribbs> not saying it's an issue, i'm saying that's the shape of the app | |
15:47 <traceback0> I guess I just want to know the first thing most users see a | |
problem with for Riak | |
15:47 <traceback0> Is there an idea of working set data/ | |
15:47 <seancribbs> traceback0: slow client libraries ;) | |
15:47 <traceback0> ha | |
15:47 <traceback0> whats' the best python one ;) | |
15:47 <seancribbs> protobuffs all the way | |
15:48 <seancribbs> it doesn't have as many features as HTTP, but it's smoking fast | |
in comparison | |
15:48 <traceback0> ah straight protobuf | |
15:49 <seancribbs> for some reason, the HTTP-based Java client is really slow | |
15:49 <seancribbs> but one of our devs is working on the pb version | |
15:49 <traceback0> providing you have a great client library | |
15:49 <traceback0> when does riak start to fall down? | |
15:49 <seancribbs> traceback0: when you exhaust memory | |
15:50 <traceback0> do all the keys have to fit in memory? | |
15:50 <seancribbs> just the keys, not the values | |
15:50 <traceback0> ALL of the keys? | |
15:50 <traceback0> what happens when the keys don't fit in memory? | |
15:50 <seancribbs> traceback0: each key takes ~40bytes | |
15:51 <seancribbs> so it takes A LOT of keys to exhaust memory | |
15:51 <traceback0> seancribbs: how does each key take 40bytes | |
15:51 <traceback0> seancribbs: if my key is above 40 bytes what happens? | |
15:51 <seancribbs> sorry, 40 bytes + key length | |
15:51 <traceback0> =) | |
15:51 <traceback0> we have lots of keys :P | |
15:51 <seancribbs> try me | |
15:51 <seancribbs> how many? | |
15:52 <traceback0> so what happens when keys don't fit in memory? | |
15:52 <traceback0> 100s of millions | |
15:52 <seancribbs> pfft | |
15:52 <jdmaturen> bah | |
15:52 <jdmaturen> ;) | |
15:52 <seancribbs> 30 million keys over 5 nodes takes about 1.6GB per node | |
15:52 <seancribbs> and that's not all keys | |
15:53 <traceback0> regardless that's today, tomorrow it will be billions | |
15:53 <traceback0> so what happens when they don't fit in memory! | |
15:53 <seancribbs> your node crashes | |
15:53 <traceback0> hahaha | |
15:53 <seancribbs> what else happens when you fill RAM? | |
15:53 <traceback0> that's why you were avoiding the question ;) | |
15:54 <traceback0> seancribbs: swapping? | |
15:54 <traceback0> IO goes to the moon/ | |
15:54 <Tv> well, it'll start failing *if you access all the keys often* | |
15:54 <traceback0> shit doesn't crash though | |
15:54 <seancribbs> traceback0: becomes unavailable, same difference | |
15:54 <Tv> this is nosql, pile on more hardware | |
15:54 <traceback0> also why do all your keys have to fit in memory if only 30% of | |
them are accessed within a week? | |
15:54 <seancribbs> if you exhaust your RAM, you're doing it wrong | |
15:55 <seancribbs> i.e. your problem is not Riak | |
15:55 <seancribbs> traceback0: it's a consequence of the way bitcask is built | |
15:55 <jnewland> it's capacity planning :) | |
15:55 <seancribbs> jnewland: thank you | |
15:55 <traceback0> seancribbs: things swap and show signs of slowness before they become unresponsive | |
15:55 <jnewland> traceback0: there's another datastore if you don't like this requirement, innostore | |
15:56 <traceback0> seancribbs: riak just crashese | |
15:56 <traceback0> jnewland: how does it differ in this regard? | |
15:56 <seancribbs> traceback0: actually that's not quite true. it has a separate process which | |
tells it to shut down | |
15:56 <jnewland> traceback0: it doesn't require all keys in memory | |
15:56 <seancribbs> but latency is in general more erratic | |
15:56 <seancribbs> (for inno) | |
15:56 <jnewland> yah | |
15:56 <traceback0> seancribbs: thing is, my keys don't all get used very often, maybe 10% | |
are working set but the other 90% need to be around | |
15:57 <traceback0> I am losing out on memory here. | |
15:57 <seancribbs> traceback0: i think you're focusing on the wrong thing… that 10% is | |
going to be fast to access | |
15:57 <seancribbs> no DATA is stored in memory | |
15:57 <seancribbs> just keys | |
15:57 <jnewland> seancribbs: if i understand correctly, bitcask allocates more memory as | |
keys are added, so the thing to monitor here is memory usage, correct? | |
15:58 <seancribbs> yes | |
15:58 <jnewland> rad | |
15:58 <jnewland> traceback0: what size cassandra cluster are you looking to replace? | |
15:58 <jnewland> and how large is the dataset? | |
15:58 <traceback0> jnewland: cassandra is a piece of shit, there is no comparison | |
15:59 <jnewland> heyo | |
15:59 <jdmaturen> :/ | |
15:59 <pharkmillups> ha | |
15:59 <traceback0> but uh 128G is? | |
15:59 <traceback0> ish* | |
16:00 <jnewland> how many nodes? | |
16:00 <traceback0> 4 | |
16:00 <jdmaturen> what RF? | |
16:00 <traceback0> 2 | |
16:00 <traceback0> I accounted for rf =) | |
16:00 <jdmaturen> so 64G per node? | |
16:01 <jdmaturen> what hardware? | |
16:01 <traceback0> crap hardware: rackspace cloud | |
16:01 <traceback0> crap for IO anyway | |
16:01 <jnewland> dumb question: have you considered adding more nodes to scale IO? | |
16:01 <seancribbs> traceback0: you will probably find this useful | |
— https://spreadsheets.google.com/ccc?key=0Ak4OBkABJPsxdEowYXc2akxnYU9xNkJmbmZscnhaTFE&hl=en&authkey=CMHw8tYO#gid=0 | |
16:02 <traceback0> jnewland: Would rather not argue about cassandra, we just got done with | |
it after 8 months of agony =) | |
16:02 <jdmaturen> traceback0: i/o problems are not limited to cassandra | |
16:02 <jnewland> that would be my reccommendation if you ran out of io capacity / memory on riak as well | |
16:02 <seancribbs> IO on virtualized hardware generally sucks | |
16:03 <traceback0> again would really rather not argue about cassandra =) | |
16:03 <jdmaturen> this is about what happens in 8 months when you run away from riak ;) | |
16:03 <pharkmillups> jdmaturen: ha | |
16:04 <traceback0> unless you blow files in the OS cache by merging and compacting SStables, | |
shouldn't be a huge issue | |
16:04 <Tv> traceback0: i think a lot of people here feel that with that attitude, | |
you'll be trolling #voldemort or something in about 8 months ;) | |
16:04 <traceback0> avoiding things built in Java =) | |
16:04 <Tv> every storage system causes IO | |
16:04 <traceback0> Tv: thanks, aware =) | |
16:04 <seancribbs> tautology! | |
16:04 <Tv> there's tradeoffs, but you can't avoid it | |
16:05 <Tv> my point is, if you avoid SSTable merging or something, you'll pay for it elsewhere | |
16:05 <jdmaturen> traceback0: this may be helpful: https://docs.google.com/viewer?url=http://downloads.basho.com/papers/bitcask-intro.pdf | |
16:05 <traceback0> Tv: problem cassandra is a lot deeper then you realize. | |
16:05 <jdmaturen> [as an aside] | |
16:06 <Tv> traceback0: i've been using cassandra since 0.3, i don't think i've | |
missed that many bumps in the road so far.. | |
16:06 <traceback0> anyway, so what happens when you add a riak node to move keys over | |
if you're seeing memory usage build up? | |
16:06 <traceback0> Tv: how much data, how many nodes? | |
16:07 <traceback0> does riak create a tmp file, then ship it over? | |
16:07 <traceback0> tmp file with keys to ship | |
16:07 <seancribbs> traceback0: each original machine will get an approximately | |
equal decrease in RAM and disk usage, and traffic | |
16:07 <seancribbs> traceback0: no, it async sends the data over the network | |
16:07 <seancribbs> but that's best to witness in person | |
16:08 <Tv> i think traceback0 is asking "is riak subject to the infamous mongodb meltdown | |
that was just blogged about" | |
16:08 <seancribbs> no | |
16:08 <traceback0> Tv: actually I am not but what are you referring to? | |
16:08 <seancribbs> other replicas are STILL available, even when some are being transfered | |
16:08 <Tv> http://nosql.mypopescu.com/post/1251523059/mongodb-auto-sharding-and-foursquare-downtime etc | |
16:09 <traceback0> seancribbs: async over network and in batches correct | |
not all at once? | |
16:09 <seancribbs> right, you can control the number of concurrent transfers | |
16:09 dizzyd joined | |
16:09 <seancribbs> default is 4 | |
16:09 <jnewland> no matter the specific implementation | |
16:09 <jnewland> that causes IO | |
16:10 <jnewland> if you add a node when you've already exhausted IO capacity | |
16:10 <seancribbs> things will get worse for a time, heh | |
16:10 <Tv> jnewland: this is where the cassandra guys gets smug and say for them, | |
shifting data to another node is 4 reads + sendfile to start the transfer | |
16:10 <jnewland> it won't instantly help due to magical async fairy tranfers | |
16:10 <jnewland> add nodes at an off-peak time, let data make it's way around, profit | |
16:10 <jnewland> also: capacity planning | |
16:10 <jnewland> :) | |
16:10 <jdmaturen> +1 | |
16:11 <Tv> capacity planning is nice and all but people screw up, and how well you | |
tolerate screw ups is very important | |
16:11 <seancribbs> Tv: there's essentially no more overhead for riak's handoff than | |
"stream the data off disk and across the network", but that's still two IO paths | |
16:11 <traceback0> Tv: exactly thank you =) | |
16:11 <traceback0> Tv: it's when you screw up that is the scary part--we're not perfect. | |
16:12 <traceback0> Do you guys recommend having two disks per riak node, one to handle | |
writes, one to handle reads where the data sits | |
16:12 <Tv> also, don't know about you guys but the audiences my stuff has seen have always | |
been very much flash mobs, not a steady growth | |
16:12 <traceback0> Tv: btw I don't think you understand the reason for foursquare downtime =) | |
16:12 <traceback0> Tv: had nothing to do with data transfer | |
16:13 <Tv> traceback0: oh yeah it was that shifting data off the machine didn't really | |
help the RAM load | |
16:13 <Tv> traceback0: err not ram, paging | |
16:13 <traceback0> you're not understanding why though still | |
16:13 <traceback0> Tv: but you can read my explanation: | |
http://www.quora.com/What-caused-Foursquares-downtime-on-October-4-2010 | |
16:13 <dizzyd> traceback0: if your'e using inno, yes you want a dedicated disk for logging and data storage | |
16:13 <Tv> traceback0: well thank you for reading my mind -- now where did i put my dental | |
insurance membership card? | |
16:14 <dizzyd> traceback0: yes, bitcask requires all keys to be in memory, but can provide | |
some pretty decent latency as a tradeoff for a uniformly random access pattern | |
16:14 <jdmaturen> traceback0: you weren't familiar with the 4sq meltdown but you have your own dissection of it? :) | |
16:14 <traceback0> jdmaturen: I thought Tv was referring to a different meltdown | |
16:14 <dizzyd> if you have lots of keys but a smallish hot set, inno will be decent when tuned | |
16:14 <traceback0> jdmaturen: mostly because he thought I was asking about a question | |
that had nothing to do with it | |
16:15 <traceback0> brb | |
16:15 <Tv> "Solution might be to move to MySQL in the event mongo does not get fixed." | |
16:16 <seancribbs> traceback0: so the main concern i would have for you coming over from | |
Cassandra would not be an operational one. it would be dealing with differences in the data model | |
16:16 <Tv> i think that's the nazi mention for nosql conversations somebody was looking for | |
16:16 <jnewland> monty's law? | |
16:17 <Tv> if monty is a nosql NIH rewrite of godwin, then yes | |
16:18 <jdmaturen> traceback0: are you @ mixpanel? | |
16:18 <seancribbs> traceback0: fwiw, if you're on leased/virtualized hardware you might | |
consider using Joyent. they have preconfigured Riak machines | |
16:19 <seancribbs> and it's only sorta-virtualized, being Solaris zones, so the IO | |
penalty is less than Xen or KVM. | |
16:20 <traceback0> jdmaturen: yep | |
17:47 <traceback0> back | |
17:47 <traceback0> jdmaturen: yes I am at mixpanel | |
17:47 <traceback0> seancribbs: i'll deploy riak basically and see how it odes, try adding a | |
new node, and see what happens | |
17:47 <traceback0> but i just wanted an honest explanation of places where riak falls down | |
17:48 <traceback0> so i don't have to come crying in the channel =) | |
17:48 <seancribbs> traceback0: keep in mind it's not going to mean much below 3 nodes | |
17:48 <traceback0> i'll probably use inno to avoid having to keep all my keys in memory | |
17:48 <traceback0> seancribbs: I'll do a 2 node test wait for the nodes to accumulate data | |
and then add node 3 and see what happens | |
17:49 <seancribbs> i'm telling you, 2 nodes is a degenerate situation when N=3 | |
17:49 <traceback0> but i want to know what happens when we take our eye off the ball and | |
add a new node when two other nodes are slammed. | |
17:49 <traceback0> seancribbs: why? | |
17:49 <traceback0> i'll also probably limit memory to 2G to see where it bottles fastest | |
17:50 <traceback0> i mean unless you can give me a more honest explanation of where riak | |
falls then I'll reconsider | |
17:50 <traceback0> we have to test regardless but you told me the worst part about riak | |
is slow client libs | |
17:50 <traceback0> which is ridiculous imo | |
17:50 <traceback0> every data store has its issues | |
17:51 <seancribbs> traceback0: that's true, and we have our warts | |
17:51 <traceback0> yeah just asking what they are ;) | |
17:51 <traceback0> =) | |
17:51 <traceback0> it's not going to deter me, it's going to make me want to use you more | |
because you explain them | |
17:51 <traceback0> at least I know "hey these guys aren't bullshitting me" | |
17:51 <seancribbs> so, here's a few things | |
17:52 <seancribbs> large clusters sometimes have difficulty converging on a single ring state | |
17:52 <traceback0> i am going to go test and not have a clue where riak falls down w/o slamming it | |
17:53 <seancribbs> certain combinations of cluster size and ring size result in | |
degenerate preflists | |
17:53 <seancribbs> which require manual intervention | |
17:54 <seancribbs> also, feel free to look at issues.basho.com for our other warts | |
17:54 <traceback0> what does converging on a single ring state mean? | |
17:54 <seancribbs> meaning, as you add nodes, they don't agree on who owns what partitions | |
17:55 <seancribbs> but it can be remedied | |
17:56 <seancribbs> drev1 has more experience with that issue | |
17:57 <seancribbs> here's another | |
17:57 <seancribbs> not really a wart, but a limitation | |
17:58 <seancribbs> you need to pick your ring size large enough at the beginning, you can't | |
change it later | |
17:58 <seancribbs> i.e. if you choose 64 partitions, you'll reach a practical cluster size | |
limit at around 8 nodes | |
17:59 <seancribbs> anyway, those are some. if you have more questions, please email the | |
list or ask here | |
18:00 <jdmaturen> traceback0: are you planning on using it as straight up k/v? | |
18:00 <jdmaturen> otherwise you'll want to brush up on the limitations / | |
usage of m/r and link walking | |
18:01 <jdmaturen> things like "how do I know what keys to m/r over?" etc | |
18:01 <traceback0> just k-v | |
18:02 <traceback0> distributed k-v store | |
18:02 <traceback0> we're mapping keys to shard_ids | |
18:02 <traceback0> so if i get a user_id I map them to a shard_id | |
18:02 <jdmaturen> huh, cool | |
18:07 <traceback0> ? | |
18:08 <jdmaturen> nice straightforward use case | |
18:08 <jdmaturen> no timeseries data or 3D spatial mapping, etc | |
18:08 <traceback0> nope | |
18:09 <traceback0> i mean it's more then user_id to shard_id | |
18:09 <traceback0> but imagine user_ids don't always login | |
18:10 <jdmaturen> how do you handle concurrent updates? "last" write wins or client-side | |
vector clock resolution? | |
18:11 <traceback0> i don't want old users taking up old memory space | |
18:11 <traceback0> mongo mmaps files | |
18:11 <traceback0> so it'll swap them in and out | |
18:11 <traceback0> and if you're working set goes above memory then you need a new node | |
18:11 <seancribbs> traceback0: the problem is not your datastore then | |
18:11 <traceback0> jdmaturen: doesn't riak handle concurrent updates? | |
18:12 <seancribbs> but, as dizzyd said earlier, maybe inno will work well for you since you | |
can cap its memory usage | |
18:14 <jdmaturen> traceback0: no, if two writes come in with the same parent either you'll | |
have two items stored or [the default] riak will determine which one won | |
18:25 <jdmaturen> see last Q on | |
http://blog.basho.com/2010/07/09/webinar-recap---schema-design-for-riak/ for more elaborate explanation traceback0 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment