PharkMillups/gist:641257

## gistfile1.txt
14:46 <traceback0> for the inno backend for riak, are the writes random?

14:47 <traceback0> how do writes work in riak and how do they scale

14:47 <benblack> how they work depends on which backend you are using

14:47 <benblack> bitcask is append-only

14:47 <benblack> inno is not

14:47 <benblack> if you want highest write performance you should be using bitcask

14:48 <traceback0> talking strictly about inno

14:48 <benblack> writes scale as in all dynamo systems: approximately linearly on the
number of nodes

14:48 <traceback0> does that mean writes using inno are random?

14:48 <benblack> it means writes in inno are in place

14:49 <benblack> as opposed to append only

14:49 <traceback0> ok so more random

14:49 <benblack> whether that implies random rather depends on your access patterns

14:49 <traceback0> no updates, just inserts in this case

14:49 <benblack> then why would you use inno?

14:49 <traceback0> I don't know how Riak handles writes for inno

14:50 <benblack> riak doesn't handle writes for inno (or any other storage engine).
the storage engine does.

14:50 <traceback0> benblack: So that I don't have to monitor memory in the event I take my eye
off the ball with Bitcask

14:50 <benblack> what do you think happens with bitcask if you exceed memory?

14:51 <traceback0> benblack: and I can better more efficiently use memory due to
restricting it to working set

14:51 <traceback0> benblack: riak crashes

14:51 <benblack> does it

14:51 <benblack> that is an interesting theory

14:51 <traceback0> That's what I was told by contributors of Riak

14:51 <traceback0> the other day =)

14:51 <traceback0> all keys have to fit in memory

14:51 <benblack> the answer you get depends very much on the exact question you ask

14:51 <traceback0> else it crashes

14:52 <traceback0> but riak supposedly holds a shit ton of keys

14:52 <traceback0> 40 bytes + key length

14:52 <traceback0> but it'll crash

14:52 <traceback0> which is distressing =)

14:52 <traceback0> you get what constrain for though

14:52 <traceback0> what you*

14:52 <traceback0> not complaining just don't think that fits our needs

14:53 <benblack> http://blog.mozilla.com/data/2010/08/16/benchmarking-riak-for-the-mozilla-test-pilot-project/

14:53 <benblack> have you seen that from august?

14:54 <benblack> traceback0: was your question about _physical_ memory?

14:54 <traceback0> strange it says it'll swap with bitcask on the blog

14:54 <traceback0> wonder why someone said it crashes

14:54 <benblack> like i said

14:54 <benblack> the answer you get depends on the exact question you ask

14:54 <benblack> if you asked "exceeds memory" the answer is crash

14:54 <bingeldac> not that we condone swapping

14:54 <bingeldac> ever.

14:55 <benblack> if you asked "exceeds physical memory" the answer is "swapping"

14:55 <traceback0> what is the difference in this case?

14:55 <benblack> between swapping and crashing?

14:56 <traceback0> exceeds memory and exceeds physical memory?

14:56 <benblack> virtual memory != physical memory

14:58 <traceback0> so exceed memory assumed I meant exceed virtual memory?

14:59 <benblack> right

14:59 <benblack> once you consume all available memory, life is hard

15:00 <benblack> and really, once you consume all physical memory, life is hard

15:00 <bingeldac> brutish and short

15:00 <traceback0> ok presumably most systems have an insane amount of virutal memory?

15:00 <benblack> you do not want to live in swap

15:00 <benblack> but if your capacity planning and monitoring are not up to the task,
it can buy you some time

15:01 <traceback0> why might someone pick inno over bitcask ever?

15:06 <skeptomai> benblack: crucially swaps to ssd

15:06 <skeptomai> (as last resort)

15:07 <crucially> i would say my macbook swaps to ssd all the time

15:07 <skeptomai> ah, good point

15:07 <crucially> but yeah, all our new machines have ssds for root/boot/swap

15:07 <skeptomai> Don't you also configure some boxes in production to do so?
(Am I remembering correctly?)

15:07 <crucially> of course, a lot of our machines have no swap configured too

15:08 <bingeldac> we debate that internally all the time

15:08 <bingeldac> to have or not have swap

15:09 <pharkmillups> traceback0: you can dig around riak.markmail.org for various
"inno vs. bitcask" discussions

15:09 <bingeldac> rasputnik: I don't think it is a bad idea if that is what you
have to work with

15:09 <pharkmillups> traceback0: this one isn't bad

15:09 <pharkmillups> http://gist.github.com/438065

15:09 <skeptomai> crashing sucks, but swap really hides the problem. the service may be so
impacted that it's not really functioning when it swaps and your monitoring might not reflect that

15:10 <benblack> traceback0: if you are constantly updating the same keys (for example, doing
some sort of counter-style thing), bitcask will consume a lot more space between merges

15:10 <benblack> inno updates in place, so take up less space

15:18 <seancribbs> the tradeoff for constrained memory usage is higher/more erratic latency

15:18 <seancribbs> (with inno)


22:11 <traceback0> seancribbs: ping

22:12 <seancribbs> pong, but not for long

22:21 <traceback0> seancribbs: oh oops

22:21 <traceback0> seancribbs: reading up on bitcask

22:22 <traceback0> How is this possible: http://cl.ly/880e7e97abcf7aca7796

22:22 <benblack> virtual memory

22:22 <benblack> _swap_

22:22 <seancribbs> datasets > RAM = data is not stored in RAM

22:22 <benblack> size of _keys_/index != size of dataset

22:23 <traceback0> so a 10ms disk seek to where the data is

22:23 <traceback0> since the key is inherently a pointer to the physical data

22:23 <benblack> not quite

22:23 <seancribbs> worst-case. if it HAS to go to disk

22:23 <benblack> if working set fits in OS buffer cache, there is no diska ctivity

22:23 <seancribbs> ^^

22:23 <traceback0> how big is a OS buffer cache typically?

22:24 <traceback0> 1G? 100MB?

22:24 <benblack> how much RAM is in the box?

22:24 <seancribbs> Total RAM - RSS of other programs

22:24 <seancribbs> (if you fill it)

22:24 <seancribbs> s/other/running/

22:24 <traceback0> 8G

22:25 <traceback0> so i have 20G of data

22:25 <traceback0> 8G of physical memory

22:25 <traceback0> riak has been running for a while so 8G is full

22:25 <benblack> do you understand the difference between index size, working set size,
and total dataset size?

22:25 <traceback0> yes

22:26 <benblack> great

22:26 <benblack> all questions answered!

22:26 <traceback0> alright so once working set exceeds, swap happens beacuse it's looking
seeking for data on disk

22:26 <traceback0> each seek is ~10ms so with enough swap you die

22:26 <benblack> sorry, don't understand a thing you just said

22:27 <benblack> but i don't think so

22:27 <traceback0> just saying once your working set exceeds physical memory

22:27 <traceback0> you swap

22:27 <traceback0> since each key is a single seek

22:27 <benblack> no, that's just disk activity

22:27 <traceback0> we're talking each look up is a ~10ms look up

22:27 <benblack> means OS buffer caches are being invalidated and new stuff pulled in

22:27 <traceback0> what is just disk activity?

22:28 <traceback0> guess I don't understand how OS buffer caches work

22:28 <benblack> the riak process memory taking up more space can cause swap (as is normal
for any set of processes that exceeds available RAM)

22:29 <benblack> in some operating systems, buffer cache and VM are merged to avoid
redundant/conflicting activity. they are still independent from the perspective of the processes, though

22:29 <benblack> and linux is not such an OS

22:29 <benblack> http://www.faqs.org/docs/linux_admin/buffer-cache.html

22:30 <benblack> http://tldp.org/LDP/tlk/fs/filesystem.html

22:54 <traceback0> benblack: that buffer cache page was very helpful thanks

22:54 <traceback0> i learned a lot

22:55 <traceback0> so basically there will be no buffer cache left once all the keys
exhaust physical memory

22:55 <benblack> reading is fundamental

22:55 <traceback0> and swap will occur

22:55 <benblack> approximately so

22:57 <traceback0> benblack: any information related to how files get mapped to the buffer cache?

22:57 <traceback0> i.e. i read data file A and all of A gets thrown into the buffer cache?

22:57 <benblack> files aren't mapped to buffer cache

22:57 <benblack> you might be thinking of mmap()

22:58 <traceback0> well when you read file A

22:58 <traceback0> its data is in the buffer cache right?

22:58 <benblack> the parts you read (and whatever else got read because of read ahead behavior)

22:58 <traceback0> sure

22:58 <benblack> reading a little bit does not cause the whole thing to be pulled into cache

22:59 <traceback0> so when you read file A again in the same spot how does the OS know its
in memory (cached)?

22:59 <traceback0> that specific region of data in that file is in the cache

22:59 <benblack> because all your requests are going through the same VFS system

22:59 <traceback0> is it based on some position on the disk?

23:00 <traceback0> oh maybe i should read http://tldp.org/LDP/tlk/fs/filesystem.html i guess?

23:00 <benblack> where disk may be a logical volume

23:00 <benblack> starting to understand why things like O_SYNC, O_DIRECT are important for
merge-based storage?

23:01 <benblack> (among other things)

23:03 <traceback0> will read up on those

23:03 <traceback0> I honestly have no idea what O_DIRECT does

23:03 <* traceback0> googles

23:04 <benblack> tells the VFS layer to bypass buffer cache for those operations

23:06 <traceback0> is this for writes?

23:06 <traceback0> as in don't write through this data basically?

23:06 <benblack> http://amailbox.org/node/7563

23:07 <benblack> (much of the advise calls linus says should be used instead aren't
even implemented, so...)

23:07 <benblack> speaking of madvise(), it's crucially

23:13 <crucially> hi

23:13 <traceback0> benblack: was i right about O_DIRECT bypasses buffer cache for writes so they don't
 write through and consume your buffer cache?

23:14 <benblack> hi

23:14 <benblack> traceback0: right (except the concern is less consuming buffer cache than invalidating
entries from it)

23:14 <traceback0> benblack: ah ok yeah

23:17 <crucially> you can also open the block device directly, bypasses the buffer cache

23:18 <crucially> madvise random/sequential are implemented
	14:46 <traceback0> for the inno backend for riak, are the writes random?

	14:47 <traceback0> how do writes work in riak and how do they scale

	14:47 <benblack> how they work depends on which backend you are using

	14:47 <benblack> bitcask is append-only

	14:47 <benblack> inno is not

	14:47 <benblack> if you want highest write performance you should be using bitcask

	14:48 <traceback0> talking strictly about inno

	14:48 <benblack> writes scale as in all dynamo systems: approximately linearly on the
	number of nodes

	14:48 <traceback0> does that mean writes using inno are random?

	14:48 <benblack> it means writes in inno are in place

	14:49 <benblack> as opposed to append only

	14:49 <traceback0> ok so more random

	14:49 <benblack> whether that implies random rather depends on your access patterns

	14:49 <traceback0> no updates, just inserts in this case

	14:49 <benblack> then why would you use inno?

	14:49 <traceback0> I don't know how Riak handles writes for inno

	14:50 <benblack> riak doesn't handle writes for inno (or any other storage engine).
	the storage engine does.

	14:50 <traceback0> benblack: So that I don't have to monitor memory in the event I take my eye
	off the ball with Bitcask

	14:50 <benblack> what do you think happens with bitcask if you exceed memory?

	14:51 <traceback0> benblack: and I can better more efficiently use memory due to
	restricting it to working set

	14:51 <traceback0> benblack: riak crashes

	14:51 <benblack> does it

	14:51 <benblack> that is an interesting theory

	14:51 <traceback0> That's what I was told by contributors of Riak

	14:51 <traceback0> the other day =)

	14:51 <traceback0> all keys have to fit in memory

	14:51 <benblack> the answer you get depends very much on the exact question you ask

	14:51 <traceback0> else it crashes

	14:52 <traceback0> but riak supposedly holds a shit ton of keys

	14:52 <traceback0> 40 bytes + key length

	14:52 <traceback0> but it'll crash

	14:52 <traceback0> which is distressing =)

	14:52 <traceback0> you get what constrain for though

	14:52 <traceback0> what you*

	14:52 <traceback0> not complaining just don't think that fits our needs

	14:53 <benblack> http://blog.mozilla.com/data/2010/08/16/benchmarking-riak-for-the-mozilla-test-pilot-project/

	14:53 <benblack> have you seen that from august?

	14:54 <benblack> traceback0: was your question about _physical_ memory?

	14:54 <traceback0> strange it says it'll swap with bitcask on the blog

	14:54 <traceback0> wonder why someone said it crashes

	14:54 <benblack> like i said

	14:54 <benblack> the answer you get depends on the exact question you ask

	14:54 <benblack> if you asked "exceeds memory" the answer is crash

	14:54 <bingeldac> not that we condone swapping

	14:54 <bingeldac> ever.

	14:55 <benblack> if you asked "exceeds physical memory" the answer is "swapping"

	14:55 <traceback0> what is the difference in this case?

	14:55 <benblack> between swapping and crashing?

	14:56 <traceback0> exceeds memory and exceeds physical memory?

	14:56 <benblack> virtual memory != physical memory

	14:58 <traceback0> so exceed memory assumed I meant exceed virtual memory?

	14:59 <benblack> right

	14:59 <benblack> once you consume all available memory, life is hard

	15:00 <benblack> and really, once you consume all physical memory, life is hard

	15:00 <bingeldac> brutish and short

	15:00 <traceback0> ok presumably most systems have an insane amount of virutal memory?

	15:00 <benblack> you do not want to live in swap

	15:00 <benblack> but if your capacity planning and monitoring are not up to the task,
	it can buy you some time

	15:01 <traceback0> why might someone pick inno over bitcask ever?

	15:06 <skeptomai> benblack: crucially swaps to ssd

	15:06 <skeptomai> (as last resort)

	15:07 <crucially> i would say my macbook swaps to ssd all the time

	15:07 <skeptomai> ah, good point

	15:07 <crucially> but yeah, all our new machines have ssds for root/boot/swap

	15:07 <skeptomai> Don't you also configure some boxes in production to do so?
	(Am I remembering correctly?)

	15:07 <crucially> of course, a lot of our machines have no swap configured too

	15:08 <bingeldac> we debate that internally all the time

	15:08 <bingeldac> to have or not have swap

	15:09 <pharkmillups> traceback0: you can dig around riak.markmail.org for various
	"inno vs. bitcask" discussions

	15:09 <bingeldac> rasputnik: I don't think it is a bad idea if that is what you
	have to work with

	15:09 <pharkmillups> traceback0: this one isn't bad

	15:09 <pharkmillups> http://gist.github.com/438065

	15:09 <skeptomai> crashing sucks, but swap really hides the problem. the service may be so
	impacted that it's not really functioning when it swaps and your monitoring might not reflect that

	15:10 <benblack> traceback0: if you are constantly updating the same keys (for example, doing
	some sort of counter-style thing), bitcask will consume a lot more space between merges

	15:10 <benblack> inno updates in place, so take up less space

	15:18 <seancribbs> the tradeoff for constrained memory usage is higher/more erratic latency

	15:18 <seancribbs> (with inno)


	22:11 <traceback0> seancribbs: ping

	22:12 <seancribbs> pong, but not for long

	22:21 <traceback0> seancribbs: oh oops

	22:21 <traceback0> seancribbs: reading up on bitcask

	22:22 <traceback0> How is this possible: http://cl.ly/880e7e97abcf7aca7796

	22:22 <benblack> virtual memory

	22:22 <benblack> _swap_

	22:22 <seancribbs> datasets > RAM = data is not stored in RAM

	22:22 <benblack> size of _keys_/index != size of dataset

	22:23 <traceback0> so a 10ms disk seek to where the data is

	22:23 <traceback0> since the key is inherently a pointer to the physical data

	22:23 <benblack> not quite

	22:23 <seancribbs> worst-case. if it HAS to go to disk

	22:23 <benblack> if working set fits in OS buffer cache, there is no diska ctivity

	22:23 <seancribbs> ^^

	22:23 <traceback0> how big is a OS buffer cache typically?

	22:24 <traceback0> 1G? 100MB?

	22:24 <benblack> how much RAM is in the box?

	22:24 <seancribbs> Total RAM - RSS of other programs

	22:24 <seancribbs> (if you fill it)

	22:24 <seancribbs> s/other/running/

	22:24 <traceback0> 8G

	22:25 <traceback0> so i have 20G of data

	22:25 <traceback0> 8G of physical memory

	22:25 <traceback0> riak has been running for a while so 8G is full

	22:25 <benblack> do you understand the difference between index size, working set size,
	and total dataset size?

	22:25 <traceback0> yes

	22:26 <benblack> great

	22:26 <benblack> all questions answered!

	22:26 <traceback0> alright so once working set exceeds, swap happens beacuse it's looking
	seeking for data on disk

	22:26 <traceback0> each seek is ~10ms so with enough swap you die

	22:26 <benblack> sorry, don't understand a thing you just said

	22:27 <benblack> but i don't think so

	22:27 <traceback0> just saying once your working set exceeds physical memory

	22:27 <traceback0> you swap

	22:27 <traceback0> since each key is a single seek

	22:27 <benblack> no, that's just disk activity

	22:27 <traceback0> we're talking each look up is a ~10ms look up

	22:27 <benblack> means OS buffer caches are being invalidated and new stuff pulled in

	22:27 <traceback0> what is just disk activity?

	22:28 <traceback0> guess I don't understand how OS buffer caches work

	22:28 <benblack> the riak process memory taking up more space can cause swap (as is normal
	for any set of processes that exceeds available RAM)

	22:29 <benblack> in some operating systems, buffer cache and VM are merged to avoid
	redundant/conflicting activity. they are still independent from the perspective of the processes, though

	22:29 <benblack> and linux is not such an OS

	22:29 <benblack> http://www.faqs.org/docs/linux_admin/buffer-cache.html

	22:30 <benblack> http://tldp.org/LDP/tlk/fs/filesystem.html

	22:54 <traceback0> benblack: that buffer cache page was very helpful thanks

	22:54 <traceback0> i learned a lot

	22:55 <traceback0> so basically there will be no buffer cache left once all the keys
	exhaust physical memory

	22:55 <benblack> reading is fundamental

	22:55 <traceback0> and swap will occur

	22:55 <benblack> approximately so

	22:57 <traceback0> benblack: any information related to how files get mapped to the buffer cache?

	22:57 <traceback0> i.e. i read data file A and all of A gets thrown into the buffer cache?

	22:57 <benblack> files aren't mapped to buffer cache

	22:57 <benblack> you might be thinking of mmap()

	22:58 <traceback0> well when you read file A

	22:58 <traceback0> its data is in the buffer cache right?

	22:58 <benblack> the parts you read (and whatever else got read because of read ahead behavior)

	22:58 <traceback0> sure

	22:58 <benblack> reading a little bit does not cause the whole thing to be pulled into cache

	22:59 <traceback0> so when you read file A again in the same spot how does the OS know its
	in memory (cached)?

	22:59 <traceback0> that specific region of data in that file is in the cache

	22:59 <benblack> because all your requests are going through the same VFS system

	22:59 <traceback0> is it based on some position on the disk?

	23:00 <traceback0> oh maybe i should read http://tldp.org/LDP/tlk/fs/filesystem.html i guess?

	23:00 <benblack> where disk may be a logical volume

	23:00 <benblack> starting to understand why things like O_SYNC, O_DIRECT are important for
	merge-based storage?

	23:01 <benblack> (among other things)

	23:03 <traceback0> will read up on those

	23:03 <traceback0> I honestly have no idea what O_DIRECT does

	23:03 <* traceback0> googles

	23:04 <benblack> tells the VFS layer to bypass buffer cache for those operations

	23:06 <traceback0> is this for writes?

	23:06 <traceback0> as in don't write through this data basically?

	23:06 <benblack> http://amailbox.org/node/7563

	23:07 <benblack> (much of the advise calls linus says should be used instead aren't
	even implemented, so...)

	23:07 <benblack> speaking of madvise(), it's crucially

	23:13 <crucially> hi

	23:13 <traceback0> benblack: was i right about O_DIRECT bypasses buffer cache for writes so they don't
	write through and consume your buffer cache?

	23:14 <benblack> hi

	23:14 <benblack> traceback0: right (except the concern is less consuming buffer cache than invalidating
	entries from it)

	23:14 <traceback0> benblack: ah ok yeah

	23:17 <crucially> you can also open the block device directly, bypasses the buffer cache

	23:18 <crucially> madvise random/sequential are implemented