Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created October 20, 2010 19:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PharkMillups/637089 to your computer and use it in GitHub Desktop.
Save PharkMillups/637089 to your computer and use it in GitHub Desktop.
15:38 <traceback0> Any informatino of someone using Riak at scale/
15:39 <seancribbs> traceback0: at what scale
15:39 <traceback0> seancribbs: 1000 writes a sec
15:39 <seancribbs> oh that's easy
15:39 <seancribbs> see the Mozilla blog post on the subject
15:39 <traceback0> ok how about IO issues when you add a new node
15:39 <traceback0> Any compacting going on under the covers?
15:39 <pharkmillups> traceback0: http://blog.mozilla.com/data/2010/08/16/benchmarking-riak-for-the-mozilla-test-pilot-project/
15:40 <traceback0> actually I am not worried about reads or writes,
I am worried about how it scales via IO
15:40 <seancribbs> traceback0: compacting in what sense
15:40 <traceback0> Like Cassandar
15:40 <traceback0> Cassandra*
15:40 <traceback0> We're thinking about using riak to replace cassandra
15:40 <traceback0> we most definitely need to KILl cassandra
15:40 <traceback0> Riak sounds like the best fit
15:41 <traceback0> seancribbs: We're basically creating a look up table to
map users to a shard_id
15:41 <seancribbs> traceback0: compaction is orthogonal to adding nodes… not
really sure how it applies
15:41 <traceback0> seancribbs: so lots of md5(key) -> shard_id (5)
15:41 <traceback0> seancribbs: I don't know much about Riak so my questions are naive
15:41 <seancribbs> traceback0: Riak doesn't let you control which nodes get replicas
15:42 <traceback0> seancribbs: with cassandra, they basically compact SStables laid
out on disk into 1 sstable
15:42 <traceback0> seancribbs: that causes IO issues if you have cloud-based disks
which aren't the best IO
15:42 <traceback0> seancribbs: subsequently since the IO is bad, the cassandra node
doesn't return reads because it's too busy
15:42 <seancribbs> traceback0: bitcask, our default backend, is much simpler than SStables
15:42 <traceback0> seancribbs: I want to avoid a datastore that does compactions and pegs my IO
15:43 <traceback0> seancribbs: ok
15:43 <seancribbs> there's no sorting, but there are occasional merges that remove tombstones
15:43 <traceback0> seancribbs: what are the first bottlenecks people see when their
working set no longer fits in riak?
15:43 <seancribbs> traceback0: Riak is generally IO bound
15:43 <seancribbs> and bitcask relies a lot on the OS filesystem cache
15:44 <seancribbs> i.e. it doesn't try to cache stuff itself
15:44 <traceback0> do you guys pass O_DIRECT?
15:44 <seancribbs> it's even lower level than that
15:44 <seancribbs> that's an inno thing iirc
15:45 <traceback0> seancribbs: one issue with cassandra is that the files that are
in the disk cache get blown when the merge happens bcause the pointer no longer references
those files since they were merged and deleted.
15:45 <traceback0> Do you guys do anything like that
15:45 <seancribbs> traceback0: you'll have to ask justinsheehy or dizzyd about the details
of bitcask implementation
15:45 <traceback0> seancribbs: and when you say Riak is genearlly IO bound, it doesn't fit
keys and values in memory?
15:46 <traceback0> oh I see
15:46 <traceback0> you mean in terms of a bottle neck
15:46 <traceback0> sorry
15:46 <traceback0> first thing to bottle is IO
15:46 <seancribbs> traceback0: it's IO bound in the sense that there's very
little computation to do
15:46 <traceback0> what part of IO bottles?
15:46 <seancribbs> network and disk
15:46 <traceback0> yeah so what disk related issues?
15:47 <traceback0> paging in/out keys/values?
15:47 <seancribbs> not saying it's an issue, i'm saying that's the shape of the app
15:47 <traceback0> I guess I just want to know the first thing most users see a
problem with for Riak
15:47 <traceback0> Is there an idea of working set data/
15:47 <seancribbs> traceback0: slow client libraries ;)
15:47 <traceback0> ha
15:47 <traceback0> whats' the best python one ;)
15:47 <seancribbs> protobuffs all the way
15:48 <seancribbs> it doesn't have as many features as HTTP, but it's smoking fast
in comparison
15:48 <traceback0> ah straight protobuf
15:49 <seancribbs> for some reason, the HTTP-based Java client is really slow
15:49 <seancribbs> but one of our devs is working on the pb version
15:49 <traceback0> providing you have a great client library
15:49 <traceback0> when does riak start to fall down?
15:49 <seancribbs> traceback0: when you exhaust memory
15:50 <traceback0> do all the keys have to fit in memory?
15:50 <seancribbs> just the keys, not the values
15:50 <traceback0> ALL of the keys?
15:50 <traceback0> what happens when the keys don't fit in memory?
15:50 <seancribbs> traceback0: each key takes ~40bytes
15:51 <seancribbs> so it takes A LOT of keys to exhaust memory
15:51 <traceback0> seancribbs: how does each key take 40bytes
15:51 <traceback0> seancribbs: if my key is above 40 bytes what happens?
15:51 <seancribbs> sorry, 40 bytes + key length
15:51 <traceback0> =)
15:51 <traceback0> we have lots of keys :P
15:51 <seancribbs> try me
15:51 <seancribbs> how many?
15:52 <traceback0> so what happens when keys don't fit in memory?
15:52 <traceback0> 100s of millions
15:52 <seancribbs> pfft
15:52 <jdmaturen> bah
15:52 <jdmaturen> ;)
15:52 <seancribbs> 30 million keys over 5 nodes takes about 1.6GB per node
15:52 <seancribbs> and that's not all keys
15:53 <traceback0> regardless that's today, tomorrow it will be billions
15:53 <traceback0> so what happens when they don't fit in memory!
15:53 <seancribbs> your node crashes
15:53 <traceback0> hahaha
15:53 <seancribbs> what else happens when you fill RAM?
15:53 <traceback0> that's why you were avoiding the question ;)
15:54 <traceback0> seancribbs: swapping?
15:54 <traceback0> IO goes to the moon/
15:54 <Tv> well, it'll start failing *if you access all the keys often*
15:54 <traceback0> shit doesn't crash though
15:54 <seancribbs> traceback0: becomes unavailable, same difference
15:54 <Tv> this is nosql, pile on more hardware
15:54 <traceback0> also why do all your keys have to fit in memory if only 30% of
them are accessed within a week?
15:54 <seancribbs> if you exhaust your RAM, you're doing it wrong
15:55 <seancribbs> i.e. your problem is not Riak
15:55 <seancribbs> traceback0: it's a consequence of the way bitcask is built
15:55 <jnewland> it's capacity planning :)
15:55 <seancribbs> jnewland: thank you
15:55 <traceback0> seancribbs: things swap and show signs of slowness before they become unresponsive
15:55 <jnewland> traceback0: there's another datastore if you don't like this requirement, innostore
15:56 <traceback0> seancribbs: riak just crashese
15:56 <traceback0> jnewland: how does it differ in this regard?
15:56 <seancribbs> traceback0: actually that's not quite true. it has a separate process which
tells it to shut down
15:56 <jnewland> traceback0: it doesn't require all keys in memory
15:56 <seancribbs> but latency is in general more erratic
15:56 <seancribbs> (for inno)
15:56 <jnewland> yah
15:56 <traceback0> seancribbs: thing is, my keys don't all get used very often, maybe 10%
are working set but the other 90% need to be around
15:57 <traceback0> I am losing out on memory here.
15:57 <seancribbs> traceback0: i think you're focusing on the wrong thing… that 10% is
going to be fast to access
15:57 <seancribbs> no DATA is stored in memory
15:57 <seancribbs> just keys
15:57 <jnewland> seancribbs: if i understand correctly, bitcask allocates more memory as
keys are added, so the thing to monitor here is memory usage, correct?
15:58 <seancribbs> yes
15:58 <jnewland> rad
15:58 <jnewland> traceback0: what size cassandra cluster are you looking to replace?
15:58 <jnewland> and how large is the dataset?
15:58 <traceback0> jnewland: cassandra is a piece of shit, there is no comparison
15:59 <jnewland> heyo
15:59 <jdmaturen> :/
15:59 <pharkmillups> ha
15:59 <traceback0> but uh 128G is?
15:59 <traceback0> ish*
16:00 <jnewland> how many nodes?
16:00 <traceback0> 4
16:00 <jdmaturen> what RF?
16:00 <traceback0> 2
16:00 <traceback0> I accounted for rf =)
16:00 <jdmaturen> so 64G per node?
16:01 <jdmaturen> what hardware?
16:01 <traceback0> crap hardware: rackspace cloud
16:01 <traceback0> crap for IO anyway
16:01 <jnewland> dumb question: have you considered adding more nodes to scale IO?
16:01 <seancribbs> traceback0: you will probably find this useful
— https://spreadsheets.google.com/ccc?key=0Ak4OBkABJPsxdEowYXc2akxnYU9xNkJmbmZscnhaTFE&hl=en&authkey=CMHw8tYO#gid=0
16:02 <traceback0> jnewland: Would rather not argue about cassandra, we just got done with
it after 8 months of agony =)
16:02 <jdmaturen> traceback0: i/o problems are not limited to cassandra
16:02 <jnewland> that would be my reccommendation if you ran out of io capacity / memory on riak as well
16:02 <seancribbs> IO on virtualized hardware generally sucks
16:03 <traceback0> again would really rather not argue about cassandra =)
16:03 <jdmaturen> this is about what happens in 8 months when you run away from riak ;)
16:03 <pharkmillups> jdmaturen: ha
16:04 <traceback0> unless you blow files in the OS cache by merging and compacting SStables,
shouldn't be a huge issue
16:04 <Tv> traceback0: i think a lot of people here feel that with that attitude,
you'll be trolling #voldemort or something in about 8 months ;)
16:04 <traceback0> avoiding things built in Java =)
16:04 <Tv> every storage system causes IO
16:04 <traceback0> Tv: thanks, aware =)
16:04 <seancribbs> tautology!
16:04 <Tv> there's tradeoffs, but you can't avoid it
16:05 <Tv> my point is, if you avoid SSTable merging or something, you'll pay for it elsewhere
16:05 <jdmaturen> traceback0: this may be helpful: https://docs.google.com/viewer?url=http://downloads.basho.com/papers/bitcask-intro.pdf
16:05 <traceback0> Tv: problem cassandra is a lot deeper then you realize.
16:05 <jdmaturen> [as an aside]
16:06 <Tv> traceback0: i've been using cassandra since 0.3, i don't think i've
missed that many bumps in the road so far..
16:06 <traceback0> anyway, so what happens when you add a riak node to move keys over
if you're seeing memory usage build up?
16:06 <traceback0> Tv: how much data, how many nodes?
16:07 <traceback0> does riak create a tmp file, then ship it over?
16:07 <traceback0> tmp file with keys to ship
16:07 <seancribbs> traceback0: each original machine will get an approximately
equal decrease in RAM and disk usage, and traffic
16:07 <seancribbs> traceback0: no, it async sends the data over the network
16:07 <seancribbs> but that's best to witness in person
16:08 <Tv> i think traceback0 is asking "is riak subject to the infamous mongodb meltdown
that was just blogged about"
16:08 <seancribbs> no
16:08 <traceback0> Tv: actually I am not but what are you referring to?
16:08 <seancribbs> other replicas are STILL available, even when some are being transfered
16:08 <Tv> http://nosql.mypopescu.com/post/1251523059/mongodb-auto-sharding-and-foursquare-downtime etc
16:09 <traceback0> seancribbs: async over network and in batches correct
not all at once?
16:09 <seancribbs> right, you can control the number of concurrent transfers
16:09 dizzyd joined
16:09 <seancribbs> default is 4
16:09 <jnewland> no matter the specific implementation
16:09 <jnewland> that causes IO
16:10 <jnewland> if you add a node when you've already exhausted IO capacity
16:10 <seancribbs> things will get worse for a time, heh
16:10 <Tv> jnewland: this is where the cassandra guys gets smug and say for them,
shifting data to another node is 4 reads + sendfile to start the transfer
16:10 <jnewland> it won't instantly help due to magical async fairy tranfers
16:10 <jnewland> add nodes at an off-peak time, let data make it's way around, profit
16:10 <jnewland> also: capacity planning
16:10 <jnewland> :)
16:10 <jdmaturen> +1
16:11 <Tv> capacity planning is nice and all but people screw up, and how well you
tolerate screw ups is very important
16:11 <seancribbs> Tv: there's essentially no more overhead for riak's handoff than
"stream the data off disk and across the network", but that's still two IO paths
16:11 <traceback0> Tv: exactly thank you =)
16:11 <traceback0> Tv: it's when you screw up that is the scary part--we're not perfect.
16:12 <traceback0> Do you guys recommend having two disks per riak node, one to handle
writes, one to handle reads where the data sits
16:12 <Tv> also, don't know about you guys but the audiences my stuff has seen have always
been very much flash mobs, not a steady growth
16:12 <traceback0> Tv: btw I don't think you understand the reason for foursquare downtime =)
16:12 <traceback0> Tv: had nothing to do with data transfer
16:13 <Tv> traceback0: oh yeah it was that shifting data off the machine didn't really
help the RAM load
16:13 <Tv> traceback0: err not ram, paging
16:13 <traceback0> you're not understanding why though still
16:13 <traceback0> Tv: but you can read my explanation:
http://www.quora.com/What-caused-Foursquares-downtime-on-October-4-2010
16:13 <dizzyd> traceback0: if your'e using inno, yes you want a dedicated disk for logging and data storage
16:13 <Tv> traceback0: well thank you for reading my mind -- now where did i put my dental
insurance membership card?
16:14 <dizzyd> traceback0: yes, bitcask requires all keys to be in memory, but can provide
some pretty decent latency as a tradeoff for a uniformly random access pattern
16:14 <jdmaturen> traceback0: you weren't familiar with the 4sq meltdown but you have your own dissection of it? :)
16:14 <traceback0> jdmaturen: I thought Tv was referring to a different meltdown
16:14 <dizzyd> if you have lots of keys but a smallish hot set, inno will be decent when tuned
16:14 <traceback0> jdmaturen: mostly because he thought I was asking about a question
that had nothing to do with it
16:15 <traceback0> brb
16:15 <Tv> "Solution might be to move to MySQL in the event mongo does not get fixed."
16:16 <seancribbs> traceback0: so the main concern i would have for you coming over from
Cassandra would not be an operational one. it would be dealing with differences in the data model
16:16 <Tv> i think that's the nazi mention for nosql conversations somebody was looking for
16:16 <jnewland> monty's law?
16:17 <Tv> if monty is a nosql NIH rewrite of godwin, then yes
16:18 <jdmaturen> traceback0: are you @ mixpanel?
16:18 <seancribbs> traceback0: fwiw, if you're on leased/virtualized hardware you might
consider using Joyent. they have preconfigured Riak machines
16:19 <seancribbs> and it's only sorta-virtualized, being Solaris zones, so the IO
penalty is less than Xen or KVM.
16:20 <traceback0> jdmaturen: yep
17:47 <traceback0> back
17:47 <traceback0> jdmaturen: yes I am at mixpanel
17:47 <traceback0> seancribbs: i'll deploy riak basically and see how it odes, try adding a
new node, and see what happens
17:47 <traceback0> but i just wanted an honest explanation of places where riak falls down
17:48 <traceback0> so i don't have to come crying in the channel =)
17:48 <seancribbs> traceback0: keep in mind it's not going to mean much below 3 nodes
17:48 <traceback0> i'll probably use inno to avoid having to keep all my keys in memory
17:48 <traceback0> seancribbs: I'll do a 2 node test wait for the nodes to accumulate data
and then add node 3 and see what happens
17:49 <seancribbs> i'm telling you, 2 nodes is a degenerate situation when N=3
17:49 <traceback0> but i want to know what happens when we take our eye off the ball and
add a new node when two other nodes are slammed.
17:49 <traceback0> seancribbs: why?
17:49 <traceback0> i'll also probably limit memory to 2G to see where it bottles fastest
17:50 <traceback0> i mean unless you can give me a more honest explanation of where riak
falls then I'll reconsider
17:50 <traceback0> we have to test regardless but you told me the worst part about riak
is slow client libs
17:50 <traceback0> which is ridiculous imo
17:50 <traceback0> every data store has its issues
17:51 <seancribbs> traceback0: that's true, and we have our warts
17:51 <traceback0> yeah just asking what they are ;)
17:51 <traceback0> =)
17:51 <traceback0> it's not going to deter me, it's going to make me want to use you more
because you explain them
17:51 <traceback0> at least I know "hey these guys aren't bullshitting me"
17:51 <seancribbs> so, here's a few things
17:52 <seancribbs> large clusters sometimes have difficulty converging on a single ring state
17:52 <traceback0> i am going to go test and not have a clue where riak falls down w/o slamming it
17:53 <seancribbs> certain combinations of cluster size and ring size result in
degenerate preflists
17:53 <seancribbs> which require manual intervention
17:54 <seancribbs> also, feel free to look at issues.basho.com for our other warts
17:54 <traceback0> what does converging on a single ring state mean?
17:54 <seancribbs> meaning, as you add nodes, they don't agree on who owns what partitions
17:55 <seancribbs> but it can be remedied
17:56 <seancribbs> drev1 has more experience with that issue
17:57 <seancribbs> here's another
17:57 <seancribbs> not really a wart, but a limitation
17:58 <seancribbs> you need to pick your ring size large enough at the beginning, you can't
change it later
17:58 <seancribbs> i.e. if you choose 64 partitions, you'll reach a practical cluster size
limit at around 8 nodes
17:59 <seancribbs> anyway, those are some. if you have more questions, please email the
list or ask here
18:00 <jdmaturen> traceback0: are you planning on using it as straight up k/v?
18:00 <jdmaturen> otherwise you'll want to brush up on the limitations /
usage of m/r and link walking
18:01 <jdmaturen> things like "how do I know what keys to m/r over?" etc
18:01 <traceback0> just k-v
18:02 <traceback0> distributed k-v store
18:02 <traceback0> we're mapping keys to shard_ids
18:02 <traceback0> so if i get a user_id I map them to a shard_id
18:02 <jdmaturen> huh, cool
18:07 <traceback0> ?
18:08 <jdmaturen> nice straightforward use case
18:08 <jdmaturen> no timeseries data or 3D spatial mapping, etc
18:08 <traceback0> nope
18:09 <traceback0> i mean it's more then user_id to shard_id
18:09 <traceback0> but imagine user_ids don't always login
18:10 <jdmaturen> how do you handle concurrent updates? "last" write wins or client-side
vector clock resolution?
18:11 <traceback0> i don't want old users taking up old memory space
18:11 <traceback0> mongo mmaps files
18:11 <traceback0> so it'll swap them in and out
18:11 <traceback0> and if you're working set goes above memory then you need a new node
18:11 <seancribbs> traceback0: the problem is not your datastore then
18:11 <traceback0> jdmaturen: doesn't riak handle concurrent updates?
18:12 <seancribbs> but, as dizzyd said earlier, maybe inno will work well for you since you
can cap its memory usage
18:14 <jdmaturen> traceback0: no, if two writes come in with the same parent either you'll
have two items stored or [the default] riak will determine which one won
18:25 <jdmaturen> see last Q on
http://blog.basho.com/2010/07/09/webinar-recap---schema-design-for-riak/ for more elaborate explanation traceback0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment