Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created October 22, 2010 19:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PharkMillups/641257 to your computer and use it in GitHub Desktop.
Save PharkMillups/641257 to your computer and use it in GitHub Desktop.
14:46 <traceback0> for the inno backend for riak, are the writes random?
14:47 <traceback0> how do writes work in riak and how do they scale
14:47 <benblack> how they work depends on which backend you are using
14:47 <benblack> bitcask is append-only
14:47 <benblack> inno is not
14:47 <benblack> if you want highest write performance you should be using bitcask
14:48 <traceback0> talking strictly about inno
14:48 <benblack> writes scale as in all dynamo systems: approximately linearly on the
number of nodes
14:48 <traceback0> does that mean writes using inno are random?
14:48 <benblack> it means writes in inno are in place
14:49 <benblack> as opposed to append only
14:49 <traceback0> ok so more random
14:49 <benblack> whether that implies random rather depends on your access patterns
14:49 <traceback0> no updates, just inserts in this case
14:49 <benblack> then why would you use inno?
14:49 <traceback0> I don't know how Riak handles writes for inno
14:50 <benblack> riak doesn't handle writes for inno (or any other storage engine).
the storage engine does.
14:50 <traceback0> benblack: So that I don't have to monitor memory in the event I take my eye
off the ball with Bitcask
14:50 <benblack> what do you think happens with bitcask if you exceed memory?
14:51 <traceback0> benblack: and I can better more efficiently use memory due to
restricting it to working set
14:51 <traceback0> benblack: riak crashes
14:51 <benblack> does it
14:51 <benblack> that is an interesting theory
14:51 <traceback0> That's what I was told by contributors of Riak
14:51 <traceback0> the other day =)
14:51 <traceback0> all keys have to fit in memory
14:51 <benblack> the answer you get depends very much on the exact question you ask
14:51 <traceback0> else it crashes
14:52 <traceback0> but riak supposedly holds a shit ton of keys
14:52 <traceback0> 40 bytes + key length
14:52 <traceback0> but it'll crash
14:52 <traceback0> which is distressing =)
14:52 <traceback0> you get what constrain for though
14:52 <traceback0> what you*
14:52 <traceback0> not complaining just don't think that fits our needs
14:53 <benblack> http://blog.mozilla.com/data/2010/08/16/benchmarking-riak-for-the-mozilla-test-pilot-project/
14:53 <benblack> have you seen that from august?
14:54 <benblack> traceback0: was your question about _physical_ memory?
14:54 <traceback0> strange it says it'll swap with bitcask on the blog
14:54 <traceback0> wonder why someone said it crashes
14:54 <benblack> like i said
14:54 <benblack> the answer you get depends on the exact question you ask
14:54 <benblack> if you asked "exceeds memory" the answer is crash
14:54 <bingeldac> not that we condone swapping
14:54 <bingeldac> ever.
14:55 <benblack> if you asked "exceeds physical memory" the answer is "swapping"
14:55 <traceback0> what is the difference in this case?
14:55 <benblack> between swapping and crashing?
14:56 <traceback0> exceeds memory and exceeds physical memory?
14:56 <benblack> virtual memory != physical memory
14:58 <traceback0> so exceed memory assumed I meant exceed virtual memory?
14:59 <benblack> right
14:59 <benblack> once you consume all available memory, life is hard
15:00 <benblack> and really, once you consume all physical memory, life is hard
15:00 <bingeldac> brutish and short
15:00 <traceback0> ok presumably most systems have an insane amount of virutal memory?
15:00 <benblack> you do not want to live in swap
15:00 <benblack> but if your capacity planning and monitoring are not up to the task,
it can buy you some time
15:01 <traceback0> why might someone pick inno over bitcask ever?
15:06 <skeptomai> benblack: crucially swaps to ssd
15:06 <skeptomai> (as last resort)
15:07 <crucially> i would say my macbook swaps to ssd all the time
15:07 <skeptomai> ah, good point
15:07 <crucially> but yeah, all our new machines have ssds for root/boot/swap
15:07 <skeptomai> Don't you also configure some boxes in production to do so?
(Am I remembering correctly?)
15:07 <crucially> of course, a lot of our machines have no swap configured too
15:08 <bingeldac> we debate that internally all the time
15:08 <bingeldac> to have or not have swap
15:09 <pharkmillups> traceback0: you can dig around riak.markmail.org for various
"inno vs. bitcask" discussions
15:09 <bingeldac> rasputnik: I don't think it is a bad idea if that is what you
have to work with
15:09 <pharkmillups> traceback0: this one isn't bad
15:09 <pharkmillups> http://gist.github.com/438065
15:09 <skeptomai> crashing sucks, but swap really hides the problem. the service may be so
impacted that it's not really functioning when it swaps and your monitoring might not reflect that
15:10 <benblack> traceback0: if you are constantly updating the same keys (for example, doing
some sort of counter-style thing), bitcask will consume a lot more space between merges
15:10 <benblack> inno updates in place, so take up less space
15:18 <seancribbs> the tradeoff for constrained memory usage is higher/more erratic latency
15:18 <seancribbs> (with inno)
22:11 <traceback0> seancribbs: ping
22:12 <seancribbs> pong, but not for long
22:21 <traceback0> seancribbs: oh oops
22:21 <traceback0> seancribbs: reading up on bitcask
22:22 <traceback0> How is this possible: http://cl.ly/880e7e97abcf7aca7796
22:22 <benblack> virtual memory
22:22 <benblack> _swap_
22:22 <seancribbs> datasets > RAM = data is not stored in RAM
22:22 <benblack> size of _keys_/index != size of dataset
22:23 <traceback0> so a 10ms disk seek to where the data is
22:23 <traceback0> since the key is inherently a pointer to the physical data
22:23 <benblack> not quite
22:23 <seancribbs> worst-case. if it HAS to go to disk
22:23 <benblack> if working set fits in OS buffer cache, there is no diska ctivity
22:23 <seancribbs> ^^
22:23 <traceback0> how big is a OS buffer cache typically?
22:24 <traceback0> 1G? 100MB?
22:24 <benblack> how much RAM is in the box?
22:24 <seancribbs> Total RAM - RSS of other programs
22:24 <seancribbs> (if you fill it)
22:24 <seancribbs> s/other/running/
22:24 <traceback0> 8G
22:25 <traceback0> so i have 20G of data
22:25 <traceback0> 8G of physical memory
22:25 <traceback0> riak has been running for a while so 8G is full
22:25 <benblack> do you understand the difference between index size, working set size,
and total dataset size?
22:25 <traceback0> yes
22:26 <benblack> great
22:26 <benblack> all questions answered!
22:26 <traceback0> alright so once working set exceeds, swap happens beacuse it's looking
seeking for data on disk
22:26 <traceback0> each seek is ~10ms so with enough swap you die
22:26 <benblack> sorry, don't understand a thing you just said
22:27 <benblack> but i don't think so
22:27 <traceback0> just saying once your working set exceeds physical memory
22:27 <traceback0> you swap
22:27 <traceback0> since each key is a single seek
22:27 <benblack> no, that's just disk activity
22:27 <traceback0> we're talking each look up is a ~10ms look up
22:27 <benblack> means OS buffer caches are being invalidated and new stuff pulled in
22:27 <traceback0> what is just disk activity?
22:28 <traceback0> guess I don't understand how OS buffer caches work
22:28 <benblack> the riak process memory taking up more space can cause swap (as is normal
for any set of processes that exceeds available RAM)
22:29 <benblack> in some operating systems, buffer cache and VM are merged to avoid
redundant/conflicting activity. they are still independent from the perspective of the processes, though
22:29 <benblack> and linux is not such an OS
22:29 <benblack> http://www.faqs.org/docs/linux_admin/buffer-cache.html
22:30 <benblack> http://tldp.org/LDP/tlk/fs/filesystem.html
22:54 <traceback0> benblack: that buffer cache page was very helpful thanks
22:54 <traceback0> i learned a lot
22:55 <traceback0> so basically there will be no buffer cache left once all the keys
exhaust physical memory
22:55 <benblack> reading is fundamental
22:55 <traceback0> and swap will occur
22:55 <benblack> approximately so
22:57 <traceback0> benblack: any information related to how files get mapped to the buffer cache?
22:57 <traceback0> i.e. i read data file A and all of A gets thrown into the buffer cache?
22:57 <benblack> files aren't mapped to buffer cache
22:57 <benblack> you might be thinking of mmap()
22:58 <traceback0> well when you read file A
22:58 <traceback0> its data is in the buffer cache right?
22:58 <benblack> the parts you read (and whatever else got read because of read ahead behavior)
22:58 <traceback0> sure
22:58 <benblack> reading a little bit does not cause the whole thing to be pulled into cache
22:59 <traceback0> so when you read file A again in the same spot how does the OS know its
in memory (cached)?
22:59 <traceback0> that specific region of data in that file is in the cache
22:59 <benblack> because all your requests are going through the same VFS system
22:59 <traceback0> is it based on some position on the disk?
23:00 <traceback0> oh maybe i should read http://tldp.org/LDP/tlk/fs/filesystem.html i guess?
23:00 <benblack> where disk may be a logical volume
23:00 <benblack> starting to understand why things like O_SYNC, O_DIRECT are important for
merge-based storage?
23:01 <benblack> (among other things)
23:03 <traceback0> will read up on those
23:03 <traceback0> I honestly have no idea what O_DIRECT does
23:03 <* traceback0> googles
23:04 <benblack> tells the VFS layer to bypass buffer cache for those operations
23:06 <traceback0> is this for writes?
23:06 <traceback0> as in don't write through this data basically?
23:06 <benblack> http://amailbox.org/node/7563
23:07 <benblack> (much of the advise calls linus says should be used instead aren't
even implemented, so...)
23:07 <benblack> speaking of madvise(), it's crucially
23:13 <crucially> hi
23:13 <traceback0> benblack: was i right about O_DIRECT bypasses buffer cache for writes so they don't
write through and consume your buffer cache?
23:14 <benblack> hi
23:14 <benblack> traceback0: right (except the concern is less consuming buffer cache than invalidating
entries from it)
23:14 <traceback0> benblack: ah ok yeah
23:17 <crucially> you can also open the block device directly, bypasses the buffer cache
23:18 <crucially> madvise random/sequential are implemented
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment