PharkMillups/gist:637089

## gistfile1.txt
15:38 <traceback0> Any informatino of someone using Riak at scale/

15:39 <seancribbs> traceback0: at what scale

15:39 <traceback0> seancribbs: 1000 writes a sec

15:39 <seancribbs> oh that's easy

15:39 <seancribbs> see the Mozilla blog post on the subject

15:39 <traceback0> ok how about IO issues when you add a new node

15:39 <traceback0> Any compacting going on under the covers?

15:39 <pharkmillups> traceback0: http://blog.mozilla.com/data/2010/08/16/benchmarking-riak-for-the-mozilla-test-pilot-project/

15:40 <traceback0> actually I am not worried about reads or writes,
I am worried about how it scales via IO

15:40 <seancribbs> traceback0: compacting in what sense

15:40 <traceback0> Like Cassandar

15:40 <traceback0> Cassandra*

15:40 <traceback0> We're thinking about using riak to replace cassandra

15:40 <traceback0> we most definitely need to KILl cassandra

15:40 <traceback0> Riak sounds like the best fit

15:41 <traceback0> seancribbs: We're basically creating a look up table to
map users to a shard_id

15:41 <seancribbs> traceback0: compaction is orthogonal to adding nodes… not
really sure how it applies

15:41 <traceback0> seancribbs: so lots of md5(key) -> shard_id (5)

15:41 <traceback0> seancribbs: I don't know much about Riak so my questions are naive

15:41 <seancribbs> traceback0: Riak doesn't let you control which nodes get replicas

15:42 <traceback0> seancribbs: with cassandra, they basically compact SStables laid
out on disk into 1 sstable

15:42 <traceback0> seancribbs: that causes IO issues if you have cloud-based disks
which aren't the best IO

15:42 <traceback0> seancribbs: subsequently since the IO is bad, the cassandra node
doesn't return reads because it's too busy

15:42 <seancribbs> traceback0: bitcask, our default backend, is much simpler than SStables

15:42 <traceback0> seancribbs: I want to avoid a datastore that does compactions and pegs my IO

15:43 <traceback0> seancribbs: ok

15:43 <seancribbs> there's no sorting, but there are occasional merges that remove tombstones

15:43 <traceback0> seancribbs: what are the first bottlenecks people see when their
working set no longer fits in riak?

15:43 <seancribbs> traceback0: Riak is generally IO bound

15:43 <seancribbs> and bitcask relies a lot on the OS filesystem cache

15:44 <seancribbs> i.e. it doesn't try to cache stuff itself

15:44 <traceback0> do you guys pass O_DIRECT?

15:44 <seancribbs> it's even lower level than that

15:44 <seancribbs> that's an inno thing iirc

15:45 <traceback0> seancribbs: one issue with cassandra is that the files that are
in the disk cache get blown when the merge happens bcause the pointer no longer references
those files since they were merged and deleted.

15:45 <traceback0> Do you guys do anything like that

15:45 <seancribbs> traceback0: you'll have to ask justinsheehy or dizzyd about the details
of bitcask implementation

15:45 <traceback0> seancribbs: and when you say Riak is genearlly IO bound, it doesn't fit
keys and values in memory?

15:46 <traceback0> oh I see

15:46 <traceback0> you mean in terms of a bottle neck

15:46 <traceback0> sorry

15:46 <traceback0> first thing to bottle is IO

15:46 <seancribbs> traceback0: it's IO bound in the sense that there's very
little computation to do

15:46 <traceback0> what part of IO bottles?

15:46 <seancribbs> network and disk

15:46 <traceback0> yeah so what disk related issues?

15:47 <traceback0> paging in/out keys/values?

15:47 <seancribbs> not saying it's an issue, i'm saying that's the shape of the app

15:47 <traceback0> I guess I just want to know the first thing most users see a
problem with for Riak

15:47 <traceback0> Is there an idea of working set data/

15:47 <seancribbs> traceback0: slow client libraries ;)

15:47 <traceback0> ha

15:47 <traceback0> whats' the best python one ;)

15:47 <seancribbs> protobuffs all the way

15:48 <seancribbs> it doesn't have as many features as HTTP, but it's smoking fast
in comparison

15:48 <traceback0> ah straight protobuf

15:49 <seancribbs> for some reason, the HTTP-based Java client is really slow

15:49 <seancribbs> but one of our devs is working on the pb version

15:49 <traceback0> providing you have a great client library

15:49 <traceback0> when does riak start to fall down?

15:49 <seancribbs> traceback0: when you exhaust memory

15:50 <traceback0> do all the keys have to fit in memory?

15:50 <seancribbs> just the keys, not the values

15:50 <traceback0> ALL of the keys?

15:50 <traceback0> what happens when the keys don't fit in memory?

15:50 <seancribbs> traceback0: each key takes ~40bytes

15:51 <seancribbs> so it takes A LOT of keys to exhaust memory

15:51 <traceback0> seancribbs: how does each key take 40bytes

15:51 <traceback0> seancribbs: if my key is above 40 bytes what happens?

15:51 <seancribbs> sorry, 40 bytes + key length

15:51 <traceback0> =)

15:51 <traceback0> we have lots of keys :P

15:51 <seancribbs> try me

15:51 <seancribbs> how many?

15:52 <traceback0> so what happens when keys don't fit in memory?

15:52 <traceback0> 100s of millions

15:52 <seancribbs> pfft

15:52 <jdmaturen> bah

15:52 <jdmaturen> ;)

15:52 <seancribbs> 30 million keys over 5 nodes takes about 1.6GB per node

15:52 <seancribbs> and that's not all keys

15:53 <traceback0> regardless that's today, tomorrow it will be billions

15:53 <traceback0> so what happens when they don't fit in memory!

15:53 <seancribbs> your node crashes

15:53 <traceback0> hahaha

15:53 <seancribbs> what else happens when you fill RAM?

15:53 <traceback0> that's why you were avoiding the question ;)

15:54 <traceback0> seancribbs: swapping?

15:54 <traceback0> IO goes to the moon/

15:54 <Tv> well, it'll start failing *if you access all the keys often*

15:54 <traceback0> shit doesn't crash though

15:54 <seancribbs> traceback0: becomes unavailable, same difference

15:54 <Tv> this is nosql, pile on more hardware

15:54 <traceback0> also why do all your keys have to fit in memory if only 30% of
them are accessed within a week?

15:54 <seancribbs> if you exhaust your RAM, you're doing it wrong

15:55 <seancribbs> i.e. your problem is not Riak

15:55 <seancribbs> traceback0: it's a consequence of the way bitcask is built

15:55 <jnewland> it's capacity planning :)

15:55 <seancribbs> jnewland: thank you

15:55 <traceback0> seancribbs: things swap and show signs of slowness before they become unresponsive

15:55 <jnewland> traceback0: there's another datastore if you don't like this requirement, innostore

15:56 <traceback0> seancribbs: riak just crashese

15:56 <traceback0> jnewland: how does it differ in this regard?

15:56 <seancribbs> traceback0: actually that's not quite true. it has a separate process which
tells it to shut down

15:56 <jnewland> traceback0: it doesn't require all keys in memory

15:56 <seancribbs> but latency is in general more erratic

15:56 <seancribbs> (for inno)

15:56 <jnewland> yah

15:56 <traceback0> seancribbs: thing is, my keys don't all get used very often, maybe 10%
are working set but the other 90% need to be around

15:57 <traceback0> I am losing out on memory here.

15:57 <seancribbs> traceback0: i think you're focusing on the wrong thing… that 10% is
going to be fast to access

15:57 <seancribbs> no DATA is stored in memory

15:57 <seancribbs> just keys

15:57 <jnewland> seancribbs: if i understand correctly, bitcask allocates more memory as
keys are added, so the thing to monitor here is memory usage, correct?

15:58 <seancribbs> yes

15:58 <jnewland> rad

15:58 <jnewland> traceback0: what size cassandra cluster are you looking to replace?

15:58 <jnewland> and how large is the dataset?

15:58 <traceback0> jnewland: cassandra is a piece of shit, there is no comparison

15:59 <jnewland> heyo

15:59 <jdmaturen> :/

15:59 <pharkmillups> ha

15:59 <traceback0> but uh 128G is?

15:59 <traceback0> ish*

16:00 <jnewland> how many nodes?

16:00 <traceback0> 4

16:00 <jdmaturen> what RF?

16:00 <traceback0> 2

16:00 <traceback0> I accounted for rf =)

16:00 <jdmaturen> so 64G per node?

16:01 <jdmaturen> what hardware?

16:01 <traceback0> crap hardware: rackspace cloud

16:01 <traceback0> crap for IO anyway

16:01 <jnewland> dumb question: have you considered adding more nodes to scale IO?

16:01 <seancribbs> traceback0: you will probably find this useful
— https://spreadsheets.google.com/ccc?key=0Ak4OBkABJPsxdEowYXc2akxnYU9xNkJmbmZscnhaTFE&hl=en&authkey=CMHw8tYO#gid=0

16:02 <traceback0> jnewland: Would rather not argue about cassandra, we just got done with
it after 8 months of agony =)

16:02 <jdmaturen> traceback0: i/o problems are not limited to cassandra

16:02 <jnewland> that would be my reccommendation if you ran out of io capacity / memory on riak as well

16:02 <seancribbs> IO on virtualized hardware generally sucks

16:03 <traceback0> again would really rather not argue about cassandra =)

16:03 <jdmaturen> this is about what happens in 8 months when you run away from riak ;)

16:03 <pharkmillups> jdmaturen: ha

16:04 <traceback0> unless you blow files in the OS cache by merging and compacting SStables,
shouldn't be a huge issue

16:04 <Tv> traceback0: i think a lot of people here feel that with that attitude,
you'll be trolling #voldemort or something in about 8 months ;)

16:04 <traceback0> avoiding things built in Java =)

16:04 <Tv> every storage system causes IO

16:04 <traceback0> Tv: thanks, aware =)

16:04 <seancribbs> tautology!

16:04 <Tv> there's tradeoffs, but you can't avoid it

16:05 <Tv> my point is, if you avoid SSTable merging or something, you'll pay for it elsewhere

16:05 <jdmaturen> traceback0: this may be helpful: https://docs.google.com/viewer?url=http://downloads.basho.com/papers/bitcask-intro.pdf

16:05 <traceback0> Tv: problem cassandra is a lot deeper then you realize.

16:05 <jdmaturen> [as an aside]

16:06 <Tv> traceback0: i've been using cassandra since 0.3, i don't think i've
missed that many bumps in the road so far..

16:06 <traceback0> anyway, so what happens when you add a riak node to move keys over
if you're seeing memory usage build up?

16:06 <traceback0> Tv: how much data, how many nodes?

16:07 <traceback0> does riak create a tmp file, then ship it over?

16:07 <traceback0> tmp file with keys to ship

16:07 <seancribbs> traceback0: each original machine will get an approximately
equal decrease in RAM and disk usage, and traffic

16:07 <seancribbs> traceback0: no, it async sends the data over the network

16:07 <seancribbs> but that's best to witness in person

16:08 <Tv> i think traceback0 is asking "is riak subject to the infamous mongodb meltdown
that was just blogged about"

16:08 <seancribbs> no

16:08 <traceback0> Tv: actually I am not but what are you referring to?

16:08 <seancribbs> other replicas are STILL available, even when some are being transfered

16:08 <Tv> http://nosql.mypopescu.com/post/1251523059/mongodb-auto-sharding-and-foursquare-downtime etc

16:09 <traceback0> seancribbs: async over network and in batches correct
not all at once?

16:09 <seancribbs> right, you can control the number of concurrent transfers

16:09 dizzyd joined

16:09 <seancribbs> default is 4

16:09 <jnewland> no matter the specific implementation

16:09 <jnewland> that causes IO

16:10 <jnewland> if you add a node when you've already exhausted IO capacity

16:10 <seancribbs> things will get worse for a time, heh

16:10 <Tv> jnewland: this is where the cassandra guys gets smug and say for them,
shifting data to another node is 4 reads + sendfile to start the transfer

16:10 <jnewland> it won't instantly help due to magical async fairy tranfers

16:10 <jnewland> add nodes at an off-peak time, let data make it's way around, profit

16:10 <jnewland> also: capacity planning

16:10 <jnewland> :)

16:10 <jdmaturen> +1

16:11 <Tv> capacity planning is nice and all but people screw up, and how well you
tolerate screw ups is very important

16:11 <seancribbs> Tv: there's essentially no more overhead for riak's handoff than
"stream the data off disk and across the network", but that's still two IO paths

16:11 <traceback0> Tv: exactly thank you =)

16:11 <traceback0> Tv: it's when you screw up that is the scary part--we're not perfect.

16:12 <traceback0> Do you guys recommend having two disks per riak node, one to handle
writes, one to handle reads where the data sits

16:12 <Tv> also, don't know about you guys but the audiences my stuff has seen have always
been very much flash mobs, not a steady growth

16:12 <traceback0> Tv: btw I don't think you understand the reason for foursquare downtime =)

16:12 <traceback0> Tv: had nothing to do with data transfer

16:13 <Tv> traceback0: oh yeah it was that shifting data off the machine didn't really
help the RAM load

16:13 <Tv> traceback0: err not ram, paging

16:13 <traceback0> you're not understanding why though still

16:13 <traceback0> Tv: but you can read my explanation:
http://www.quora.com/What-caused-Foursquares-downtime-on-October-4-2010

16:13 <dizzyd> traceback0: if your'e using inno, yes you want a dedicated disk for logging and data storage

16:13 <Tv> traceback0: well thank you for reading my mind -- now where did i put my dental
insurance membership card?

16:14 <dizzyd> traceback0: yes, bitcask requires all keys to be in memory, but can provide
some pretty decent latency as a tradeoff for a uniformly random access pattern

16:14 <jdmaturen> traceback0: you weren't familiar with the 4sq meltdown but you have your own dissection of it? :)

16:14 <traceback0> jdmaturen: I thought Tv was referring to a different meltdown

16:14 <dizzyd> if you have lots of keys but a smallish hot set, inno will be decent when tuned

16:14 <traceback0> jdmaturen: mostly because he thought I was asking about a question
that had nothing to do with it

16:15 <traceback0> brb

16:15 <Tv> "Solution might be to move to MySQL in the event mongo does not get fixed."

16:16 <seancribbs> traceback0: so the main concern i would have for you coming over from
Cassandra would not be an operational one. it would be dealing with differences in the data model

16:16 <Tv> i think that's the nazi mention for nosql conversations somebody was looking for

16:16 <jnewland> monty's law?

16:17 <Tv> if monty is a nosql NIH rewrite of godwin, then yes

16:18 <jdmaturen> traceback0: are you @ mixpanel?

16:18 <seancribbs> traceback0: fwiw, if you're on leased/virtualized hardware you might
consider using Joyent. they have preconfigured Riak machines

16:19 <seancribbs> and it's only sorta-virtualized, being Solaris zones, so the IO
penalty is less than Xen or KVM.

16:20 <traceback0> jdmaturen: yep

17:47 <traceback0> back

17:47 <traceback0> jdmaturen: yes I am at mixpanel

17:47 <traceback0> seancribbs: i'll deploy riak basically and see how it odes, try adding a
new node, and see what happens

17:47 <traceback0> but i just wanted an honest explanation of places where riak falls down

17:48 <traceback0> so i don't have to come crying in the channel =)

17:48 <seancribbs> traceback0: keep in mind it's not going to mean much below 3 nodes

17:48 <traceback0> i'll probably use inno to avoid having to keep all my keys in memory

17:48 <traceback0> seancribbs: I'll do a 2 node test wait for the nodes to accumulate data
and then add node 3 and see what happens

17:49 <seancribbs> i'm telling you, 2 nodes is a degenerate situation when N=3

17:49 <traceback0> but i want to know what happens when we take our eye off the ball and
add a new node when two other nodes are slammed.

17:49 <traceback0> seancribbs: why?

17:49 <traceback0> i'll also probably limit memory to 2G to see where it bottles fastest

17:50 <traceback0> i mean unless you can give me a more honest explanation of where riak
falls then I'll reconsider

17:50 <traceback0> we have to test regardless but you told me the worst part about riak
is slow client libs

17:50 <traceback0> which is ridiculous imo

17:50 <traceback0> every data store has its issues

17:51 <seancribbs> traceback0: that's true, and we have our warts

17:51 <traceback0> yeah just asking what they are ;)

17:51 <traceback0> =)

17:51 <traceback0> it's not going to deter me, it's going to make me want to use you more
because you explain them

17:51 <traceback0> at least I know "hey these guys aren't bullshitting me"

17:51 <seancribbs> so, here's a few things

17:52 <seancribbs> large clusters sometimes have difficulty converging on a single ring state

17:52 <traceback0> i am going to go test and not have a clue where riak falls down w/o slamming it

17:53 <seancribbs> certain combinations of cluster size and ring size result in
degenerate preflists

17:53 <seancribbs> which require manual intervention

17:54 <seancribbs> also, feel free to look at issues.basho.com for our other warts

17:54 <traceback0> what does converging on a single ring state mean?

17:54 <seancribbs> meaning, as you add nodes, they don't agree on who owns what partitions

17:55 <seancribbs> but it can be remedied

17:56 <seancribbs> drev1 has more experience with that issue

17:57 <seancribbs> here's another

17:57 <seancribbs> not really a wart, but a limitation

17:58 <seancribbs> you need to pick your ring size large enough at the beginning, you can't
change it later

17:58 <seancribbs> i.e. if you choose 64 partitions, you'll reach a practical cluster size
limit at around 8 nodes

17:59 <seancribbs> anyway, those are some. if you have more questions, please email the
list or ask here

18:00 <jdmaturen> traceback0: are you planning on using it as straight up k/v?

18:00 <jdmaturen> otherwise you'll want to brush up on the limitations /
usage of m/r and link walking

18:01 <jdmaturen> things like "how do I know what keys to m/r over?" etc

18:01 <traceback0> just k-v

18:02 <traceback0> distributed k-v store

18:02 <traceback0> we're mapping keys to shard_ids

18:02 <traceback0> so if i get a user_id I map them to a shard_id

18:02 <jdmaturen> huh, cool

18:07 <traceback0> ?

18:08 <jdmaturen> nice straightforward use case

18:08 <jdmaturen> no timeseries data or 3D spatial mapping, etc

18:08 <traceback0> nope

18:09 <traceback0> i mean it's more then user_id to shard_id

18:09 <traceback0> but imagine user_ids don't always login

18:10 <jdmaturen> how do you handle concurrent updates? "last" write wins or client-side
vector clock resolution?

18:11 <traceback0> i don't want old users taking up old memory space

18:11 <traceback0> mongo mmaps files

18:11 <traceback0> so it'll swap them in and out

18:11 <traceback0> and if you're working set goes above memory then you need a new node

18:11 <seancribbs> traceback0: the problem is not your datastore then

18:11 <traceback0> jdmaturen: doesn't riak handle concurrent updates?

18:12 <seancribbs> but, as dizzyd said earlier, maybe inno will work well for you since you
can cap its memory usage

18:14 <jdmaturen> traceback0: no, if two writes come in with the same parent either you'll
have two items stored or [the default] riak will determine which one won

 18:25 <jdmaturen> see last Q on
http://blog.basho.com/2010/07/09/webinar-recap---schema-design-for-riak/ for more elaborate explanation traceback0
	15:38 <traceback0> Any informatino of someone using Riak at scale/

	15:39 <seancribbs> traceback0: at what scale

	15:39 <traceback0> seancribbs: 1000 writes a sec

	15:39 <seancribbs> oh that's easy

	15:39 <seancribbs> see the Mozilla blog post on the subject

	15:39 <traceback0> ok how about IO issues when you add a new node

	15:39 <traceback0> Any compacting going on under the covers?

	15:39 <pharkmillups> traceback0: http://blog.mozilla.com/data/2010/08/16/benchmarking-riak-for-the-mozilla-test-pilot-project/

	15:40 <traceback0> actually I am not worried about reads or writes,
	I am worried about how it scales via IO

	15:40 <seancribbs> traceback0: compacting in what sense

	15:40 <traceback0> Like Cassandar

	15:40 <traceback0> Cassandra*

	15:40 <traceback0> We're thinking about using riak to replace cassandra

	15:40 <traceback0> we most definitely need to KILl cassandra

	15:40 <traceback0> Riak sounds like the best fit

	15:41 <traceback0> seancribbs: We're basically creating a look up table to
	map users to a shard_id

	15:41 <seancribbs> traceback0: compaction is orthogonal to adding nodes… not
	really sure how it applies

	15:41 <traceback0> seancribbs: so lots of md5(key) -> shard_id (5)

	15:41 <traceback0> seancribbs: I don't know much about Riak so my questions are naive

	15:41 <seancribbs> traceback0: Riak doesn't let you control which nodes get replicas

	15:42 <traceback0> seancribbs: with cassandra, they basically compact SStables laid
	out on disk into 1 sstable

	15:42 <traceback0> seancribbs: that causes IO issues if you have cloud-based disks
	which aren't the best IO

	15:42 <traceback0> seancribbs: subsequently since the IO is bad, the cassandra node
	doesn't return reads because it's too busy

	15:42 <seancribbs> traceback0: bitcask, our default backend, is much simpler than SStables

	15:42 <traceback0> seancribbs: I want to avoid a datastore that does compactions and pegs my IO

	15:43 <traceback0> seancribbs: ok

	15:43 <seancribbs> there's no sorting, but there are occasional merges that remove tombstones

	15:43 <traceback0> seancribbs: what are the first bottlenecks people see when their
	working set no longer fits in riak?

	15:43 <seancribbs> traceback0: Riak is generally IO bound

	15:43 <seancribbs> and bitcask relies a lot on the OS filesystem cache

	15:44 <seancribbs> i.e. it doesn't try to cache stuff itself

	15:44 <traceback0> do you guys pass O_DIRECT?

	15:44 <seancribbs> it's even lower level than that

	15:44 <seancribbs> that's an inno thing iirc

	15:45 <traceback0> seancribbs: one issue with cassandra is that the files that are
	in the disk cache get blown when the merge happens bcause the pointer no longer references
	those files since they were merged and deleted.

	15:45 <traceback0> Do you guys do anything like that

	15:45 <seancribbs> traceback0: you'll have to ask justinsheehy or dizzyd about the details
	of bitcask implementation

	15:45 <traceback0> seancribbs: and when you say Riak is genearlly IO bound, it doesn't fit
	keys and values in memory?

	15:46 <traceback0> oh I see

	15:46 <traceback0> you mean in terms of a bottle neck

	15:46 <traceback0> sorry

	15:46 <traceback0> first thing to bottle is IO

	15:46 <seancribbs> traceback0: it's IO bound in the sense that there's very
	little computation to do

	15:46 <traceback0> what part of IO bottles?

	15:46 <seancribbs> network and disk

	15:46 <traceback0> yeah so what disk related issues?

	15:47 <traceback0> paging in/out keys/values?

	15:47 <seancribbs> not saying it's an issue, i'm saying that's the shape of the app

	15:47 <traceback0> I guess I just want to know the first thing most users see a
	problem with for Riak

	15:47 <traceback0> Is there an idea of working set data/

	15:47 <seancribbs> traceback0: slow client libraries ;)

	15:47 <traceback0> ha

	15:47 <traceback0> whats' the best python one ;)

	15:47 <seancribbs> protobuffs all the way

	15:48 <seancribbs> it doesn't have as many features as HTTP, but it's smoking fast
	in comparison

	15:48 <traceback0> ah straight protobuf

	15:49 <seancribbs> for some reason, the HTTP-based Java client is really slow

	15:49 <seancribbs> but one of our devs is working on the pb version

	15:49 <traceback0> providing you have a great client library

	15:49 <traceback0> when does riak start to fall down?

	15:49 <seancribbs> traceback0: when you exhaust memory

	15:50 <traceback0> do all the keys have to fit in memory?

	15:50 <seancribbs> just the keys, not the values

	15:50 <traceback0> ALL of the keys?

	15:50 <traceback0> what happens when the keys don't fit in memory?

	15:50 <seancribbs> traceback0: each key takes ~40bytes

	15:51 <seancribbs> so it takes A LOT of keys to exhaust memory

	15:51 <traceback0> seancribbs: how does each key take 40bytes

	15:51 <traceback0> seancribbs: if my key is above 40 bytes what happens?

	15:51 <seancribbs> sorry, 40 bytes + key length

	15:51 <traceback0> =)

	15:51 <traceback0> we have lots of keys :P

	15:51 <seancribbs> try me

	15:51 <seancribbs> how many?

	15:52 <traceback0> so what happens when keys don't fit in memory?

	15:52 <traceback0> 100s of millions

	15:52 <seancribbs> pfft

	15:52 <jdmaturen> bah

	15:52 <jdmaturen> ;)

	15:52 <seancribbs> 30 million keys over 5 nodes takes about 1.6GB per node

	15:52 <seancribbs> and that's not all keys

	15:53 <traceback0> regardless that's today, tomorrow it will be billions

	15:53 <traceback0> so what happens when they don't fit in memory!

	15:53 <seancribbs> your node crashes

	15:53 <traceback0> hahaha

	15:53 <seancribbs> what else happens when you fill RAM?

	15:53 <traceback0> that's why you were avoiding the question ;)

	15:54 <traceback0> seancribbs: swapping?

	15:54 <traceback0> IO goes to the moon/

	15:54 <Tv> well, it'll start failing if you access all the keys often

	15:54 <traceback0> shit doesn't crash though

	15:54 <seancribbs> traceback0: becomes unavailable, same difference

	15:54 <Tv> this is nosql, pile on more hardware

	15:54 <traceback0> also why do all your keys have to fit in memory if only 30% of
	them are accessed within a week?

	15:54 <seancribbs> if you exhaust your RAM, you're doing it wrong

	15:55 <seancribbs> i.e. your problem is not Riak

	15:55 <seancribbs> traceback0: it's a consequence of the way bitcask is built

	15:55 <jnewland> it's capacity planning :)

	15:55 <seancribbs> jnewland: thank you

	15:55 <traceback0> seancribbs: things swap and show signs of slowness before they become unresponsive

	15:55 <jnewland> traceback0: there's another datastore if you don't like this requirement, innostore

	15:56 <traceback0> seancribbs: riak just crashese

	15:56 <traceback0> jnewland: how does it differ in this regard?

	15:56 <seancribbs> traceback0: actually that's not quite true. it has a separate process which
	tells it to shut down

	15:56 <jnewland> traceback0: it doesn't require all keys in memory

	15:56 <seancribbs> but latency is in general more erratic

	15:56 <seancribbs> (for inno)

	15:56 <jnewland> yah

	15:56 <traceback0> seancribbs: thing is, my keys don't all get used very often, maybe 10%
	are working set but the other 90% need to be around

	15:57 <traceback0> I am losing out on memory here.

	15:57 <seancribbs> traceback0: i think you're focusing on the wrong thing… that 10% is
	going to be fast to access

	15:57 <seancribbs> no DATA is stored in memory

	15:57 <seancribbs> just keys

	15:57 <jnewland> seancribbs: if i understand correctly, bitcask allocates more memory as
	keys are added, so the thing to monitor here is memory usage, correct?

	15:58 <seancribbs> yes

	15:58 <jnewland> rad

	15:58 <jnewland> traceback0: what size cassandra cluster are you looking to replace?

	15:58 <jnewland> and how large is the dataset?

	15:58 <traceback0> jnewland: cassandra is a piece of shit, there is no comparison

	15:59 <jnewland> heyo

	15:59 <jdmaturen> :/

	15:59 <pharkmillups> ha

	15:59 <traceback0> but uh 128G is?

	15:59 <traceback0> ish*

	16:00 <jnewland> how many nodes?

	16:00 <traceback0> 4

	16:00 <jdmaturen> what RF?

	16:00 <traceback0> 2

	16:00 <traceback0> I accounted for rf =)

	16:00 <jdmaturen> so 64G per node?

	16:01 <jdmaturen> what hardware?

	16:01 <traceback0> crap hardware: rackspace cloud

	16:01 <traceback0> crap for IO anyway

	16:01 <jnewland> dumb question: have you considered adding more nodes to scale IO?

	16:01 <seancribbs> traceback0: you will probably find this useful
	— https://spreadsheets.google.com/ccc?key=0Ak4OBkABJPsxdEowYXc2akxnYU9xNkJmbmZscnhaTFE&hl=en&authkey=CMHw8tYO#gid=0

	16:02 <traceback0> jnewland: Would rather not argue about cassandra, we just got done with
	it after 8 months of agony =)

	16:02 <jdmaturen> traceback0: i/o problems are not limited to cassandra

	16:02 <jnewland> that would be my reccommendation if you ran out of io capacity / memory on riak as well

	16:02 <seancribbs> IO on virtualized hardware generally sucks

	16:03 <traceback0> again would really rather not argue about cassandra =)

	16:03 <jdmaturen> this is about what happens in 8 months when you run away from riak ;)

	16:03 <pharkmillups> jdmaturen: ha

	16:04 <traceback0> unless you blow files in the OS cache by merging and compacting SStables,
	shouldn't be a huge issue

	16:04 <Tv> traceback0: i think a lot of people here feel that with that attitude,
	you'll be trolling #voldemort or something in about 8 months ;)

	16:04 <traceback0> avoiding things built in Java =)

	16:04 <Tv> every storage system causes IO

	16:04 <traceback0> Tv: thanks, aware =)

	16:04 <seancribbs> tautology!

	16:04 <Tv> there's tradeoffs, but you can't avoid it

	16:05 <Tv> my point is, if you avoid SSTable merging or something, you'll pay for it elsewhere

	16:05 <jdmaturen> traceback0: this may be helpful: https://docs.google.com/viewer?url=http://downloads.basho.com/papers/bitcask-intro.pdf

	16:05 <traceback0> Tv: problem cassandra is a lot deeper then you realize.

	16:05 <jdmaturen> [as an aside]

	16:06 <Tv> traceback0: i've been using cassandra since 0.3, i don't think i've
	missed that many bumps in the road so far..

	16:06 <traceback0> anyway, so what happens when you add a riak node to move keys over
	if you're seeing memory usage build up?

	16:06 <traceback0> Tv: how much data, how many nodes?

	16:07 <traceback0> does riak create a tmp file, then ship it over?

	16:07 <traceback0> tmp file with keys to ship

	16:07 <seancribbs> traceback0: each original machine will get an approximately
	equal decrease in RAM and disk usage, and traffic

	16:07 <seancribbs> traceback0: no, it async sends the data over the network

	16:07 <seancribbs> but that's best to witness in person

	16:08 <Tv> i think traceback0 is asking "is riak subject to the infamous mongodb meltdown
	that was just blogged about"

	16:08 <seancribbs> no

	16:08 <traceback0> Tv: actually I am not but what are you referring to?

	16:08 <seancribbs> other replicas are STILL available, even when some are being transfered

	16:08 <Tv> http://nosql.mypopescu.com/post/1251523059/mongodb-auto-sharding-and-foursquare-downtime etc

	16:09 <traceback0> seancribbs: async over network and in batches correct
	not all at once?

	16:09 <seancribbs> right, you can control the number of concurrent transfers

	16:09 dizzyd joined

	16:09 <seancribbs> default is 4

	16:09 <jnewland> no matter the specific implementation

	16:09 <jnewland> that causes IO

	16:10 <jnewland> if you add a node when you've already exhausted IO capacity

	16:10 <seancribbs> things will get worse for a time, heh

	16:10 <Tv> jnewland: this is where the cassandra guys gets smug and say for them,
	shifting data to another node is 4 reads + sendfile to start the transfer

	16:10 <jnewland> it won't instantly help due to magical async fairy tranfers

	16:10 <jnewland> add nodes at an off-peak time, let data make it's way around, profit

	16:10 <jnewland> also: capacity planning

	16:10 <jnewland> :)

	16:10 <jdmaturen> +1

	16:11 <Tv> capacity planning is nice and all but people screw up, and how well you
	tolerate screw ups is very important

	16:11 <seancribbs> Tv: there's essentially no more overhead for riak's handoff than
	"stream the data off disk and across the network", but that's still two IO paths

	16:11 <traceback0> Tv: exactly thank you =)

	16:11 <traceback0> Tv: it's when you screw up that is the scary part--we're not perfect.

	16:12 <traceback0> Do you guys recommend having two disks per riak node, one to handle
	writes, one to handle reads where the data sits

	16:12 <Tv> also, don't know about you guys but the audiences my stuff has seen have always
	been very much flash mobs, not a steady growth

	16:12 <traceback0> Tv: btw I don't think you understand the reason for foursquare downtime =)

	16:12 <traceback0> Tv: had nothing to do with data transfer

	16:13 <Tv> traceback0: oh yeah it was that shifting data off the machine didn't really
	help the RAM load

	16:13 <Tv> traceback0: err not ram, paging

	16:13 <traceback0> you're not understanding why though still

	16:13 <traceback0> Tv: but you can read my explanation:
	http://www.quora.com/What-caused-Foursquares-downtime-on-October-4-2010

	16:13 <dizzyd> traceback0: if your'e using inno, yes you want a dedicated disk for logging and data storage

	16:13 <Tv> traceback0: well thank you for reading my mind -- now where did i put my dental
	insurance membership card?

	16:14 <dizzyd> traceback0: yes, bitcask requires all keys to be in memory, but can provide
	some pretty decent latency as a tradeoff for a uniformly random access pattern

	16:14 <jdmaturen> traceback0: you weren't familiar with the 4sq meltdown but you have your own dissection of it? :)

	16:14 <traceback0> jdmaturen: I thought Tv was referring to a different meltdown

	16:14 <dizzyd> if you have lots of keys but a smallish hot set, inno will be decent when tuned

	16:14 <traceback0> jdmaturen: mostly because he thought I was asking about a question
	that had nothing to do with it

	16:15 <traceback0> brb

	16:15 <Tv> "Solution might be to move to MySQL in the event mongo does not get fixed."

	16:16 <seancribbs> traceback0: so the main concern i would have for you coming over from
	Cassandra would not be an operational one. it would be dealing with differences in the data model

	16:16 <Tv> i think that's the nazi mention for nosql conversations somebody was looking for

	16:16 <jnewland> monty's law?

	16:17 <Tv> if monty is a nosql NIH rewrite of godwin, then yes

	16:18 <jdmaturen> traceback0: are you @ mixpanel?

	16:18 <seancribbs> traceback0: fwiw, if you're on leased/virtualized hardware you might
	consider using Joyent. they have preconfigured Riak machines

	16:19 <seancribbs> and it's only sorta-virtualized, being Solaris zones, so the IO
	penalty is less than Xen or KVM.

	16:20 <traceback0> jdmaturen: yep

	17:47 <traceback0> back

	17:47 <traceback0> jdmaturen: yes I am at mixpanel

	17:47 <traceback0> seancribbs: i'll deploy riak basically and see how it odes, try adding a
	new node, and see what happens

	17:47 <traceback0> but i just wanted an honest explanation of places where riak falls down

	17:48 <traceback0> so i don't have to come crying in the channel =)

	17:48 <seancribbs> traceback0: keep in mind it's not going to mean much below 3 nodes

	17:48 <traceback0> i'll probably use inno to avoid having to keep all my keys in memory

	17:48 <traceback0> seancribbs: I'll do a 2 node test wait for the nodes to accumulate data
	and then add node 3 and see what happens

	17:49 <seancribbs> i'm telling you, 2 nodes is a degenerate situation when N=3

	17:49 <traceback0> but i want to know what happens when we take our eye off the ball and
	add a new node when two other nodes are slammed.

	17:49 <traceback0> seancribbs: why?

	17:49 <traceback0> i'll also probably limit memory to 2G to see where it bottles fastest

	17:50 <traceback0> i mean unless you can give me a more honest explanation of where riak
	falls then I'll reconsider

	17:50 <traceback0> we have to test regardless but you told me the worst part about riak
	is slow client libs

	17:50 <traceback0> which is ridiculous imo

	17:50 <traceback0> every data store has its issues

	17:51 <seancribbs> traceback0: that's true, and we have our warts

	17:51 <traceback0> yeah just asking what they are ;)

	17:51 <traceback0> =)

	17:51 <traceback0> it's not going to deter me, it's going to make me want to use you more
	because you explain them

	17:51 <traceback0> at least I know "hey these guys aren't bullshitting me"

	17:51 <seancribbs> so, here's a few things

	17:52 <seancribbs> large clusters sometimes have difficulty converging on a single ring state

	17:52 <traceback0> i am going to go test and not have a clue where riak falls down w/o slamming it

	17:53 <seancribbs> certain combinations of cluster size and ring size result in
	degenerate preflists

	17:53 <seancribbs> which require manual intervention

	17:54 <seancribbs> also, feel free to look at issues.basho.com for our other warts

	17:54 <traceback0> what does converging on a single ring state mean?

	17:54 <seancribbs> meaning, as you add nodes, they don't agree on who owns what partitions

	17:55 <seancribbs> but it can be remedied

	17:56 <seancribbs> drev1 has more experience with that issue

	17:57 <seancribbs> here's another

	17:57 <seancribbs> not really a wart, but a limitation

	17:58 <seancribbs> you need to pick your ring size large enough at the beginning, you can't
	change it later

	17:58 <seancribbs> i.e. if you choose 64 partitions, you'll reach a practical cluster size
	limit at around 8 nodes

	17:59 <seancribbs> anyway, those are some. if you have more questions, please email the
	list or ask here

	18:00 <jdmaturen> traceback0: are you planning on using it as straight up k/v?

	18:00 <jdmaturen> otherwise you'll want to brush up on the limitations /
	usage of m/r and link walking

	18:01 <jdmaturen> things like "how do I know what keys to m/r over?" etc

	18:01 <traceback0> just k-v

	18:02 <traceback0> distributed k-v store

	18:02 <traceback0> we're mapping keys to shard_ids

	18:02 <traceback0> so if i get a user_id I map them to a shard_id

	18:02 <jdmaturen> huh, cool

	18:07 <traceback0> ?

	18:08 <jdmaturen> nice straightforward use case

	18:08 <jdmaturen> no timeseries data or 3D spatial mapping, etc

	18:08 <traceback0> nope

	18:09 <traceback0> i mean it's more then user_id to shard_id

	18:09 <traceback0> but imagine user_ids don't always login

	18:10 <jdmaturen> how do you handle concurrent updates? "last" write wins or client-side
	vector clock resolution?

	18:11 <traceback0> i don't want old users taking up old memory space

	18:11 <traceback0> mongo mmaps files

	18:11 <traceback0> so it'll swap them in and out

	18:11 <traceback0> and if you're working set goes above memory then you need a new node

	18:11 <seancribbs> traceback0: the problem is not your datastore then

	18:11 <traceback0> jdmaturen: doesn't riak handle concurrent updates?

	18:12 <seancribbs> but, as dizzyd said earlier, maybe inno will work well for you since you
	can cap its memory usage

	18:14 <jdmaturen> traceback0: no, if two writes come in with the same parent either you'll
	have two items stored or [the default] riak will determine which one won

	18:25 <jdmaturen> see last Q on
	http://blog.basho.com/2010/07/09/webinar-recap---schema-design-for-riak/ for more elaborate explanation traceback0