Created
October 22, 2010 19:56
-
-
Save PharkMillups/641257 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
14:46 <traceback0> for the inno backend for riak, are the writes random? | |
14:47 <traceback0> how do writes work in riak and how do they scale | |
14:47 <benblack> how they work depends on which backend you are using | |
14:47 <benblack> bitcask is append-only | |
14:47 <benblack> inno is not | |
14:47 <benblack> if you want highest write performance you should be using bitcask | |
14:48 <traceback0> talking strictly about inno | |
14:48 <benblack> writes scale as in all dynamo systems: approximately linearly on the | |
number of nodes | |
14:48 <traceback0> does that mean writes using inno are random? | |
14:48 <benblack> it means writes in inno are in place | |
14:49 <benblack> as opposed to append only | |
14:49 <traceback0> ok so more random | |
14:49 <benblack> whether that implies random rather depends on your access patterns | |
14:49 <traceback0> no updates, just inserts in this case | |
14:49 <benblack> then why would you use inno? | |
14:49 <traceback0> I don't know how Riak handles writes for inno | |
14:50 <benblack> riak doesn't handle writes for inno (or any other storage engine). | |
the storage engine does. | |
14:50 <traceback0> benblack: So that I don't have to monitor memory in the event I take my eye | |
off the ball with Bitcask | |
14:50 <benblack> what do you think happens with bitcask if you exceed memory? | |
14:51 <traceback0> benblack: and I can better more efficiently use memory due to | |
restricting it to working set | |
14:51 <traceback0> benblack: riak crashes | |
14:51 <benblack> does it | |
14:51 <benblack> that is an interesting theory | |
14:51 <traceback0> That's what I was told by contributors of Riak | |
14:51 <traceback0> the other day =) | |
14:51 <traceback0> all keys have to fit in memory | |
14:51 <benblack> the answer you get depends very much on the exact question you ask | |
14:51 <traceback0> else it crashes | |
14:52 <traceback0> but riak supposedly holds a shit ton of keys | |
14:52 <traceback0> 40 bytes + key length | |
14:52 <traceback0> but it'll crash | |
14:52 <traceback0> which is distressing =) | |
14:52 <traceback0> you get what constrain for though | |
14:52 <traceback0> what you* | |
14:52 <traceback0> not complaining just don't think that fits our needs | |
14:53 <benblack> http://blog.mozilla.com/data/2010/08/16/benchmarking-riak-for-the-mozilla-test-pilot-project/ | |
14:53 <benblack> have you seen that from august? | |
14:54 <benblack> traceback0: was your question about _physical_ memory? | |
14:54 <traceback0> strange it says it'll swap with bitcask on the blog | |
14:54 <traceback0> wonder why someone said it crashes | |
14:54 <benblack> like i said | |
14:54 <benblack> the answer you get depends on the exact question you ask | |
14:54 <benblack> if you asked "exceeds memory" the answer is crash | |
14:54 <bingeldac> not that we condone swapping | |
14:54 <bingeldac> ever. | |
14:55 <benblack> if you asked "exceeds physical memory" the answer is "swapping" | |
14:55 <traceback0> what is the difference in this case? | |
14:55 <benblack> between swapping and crashing? | |
14:56 <traceback0> exceeds memory and exceeds physical memory? | |
14:56 <benblack> virtual memory != physical memory | |
14:58 <traceback0> so exceed memory assumed I meant exceed virtual memory? | |
14:59 <benblack> right | |
14:59 <benblack> once you consume all available memory, life is hard | |
15:00 <benblack> and really, once you consume all physical memory, life is hard | |
15:00 <bingeldac> brutish and short | |
15:00 <traceback0> ok presumably most systems have an insane amount of virutal memory? | |
15:00 <benblack> you do not want to live in swap | |
15:00 <benblack> but if your capacity planning and monitoring are not up to the task, | |
it can buy you some time | |
15:01 <traceback0> why might someone pick inno over bitcask ever? | |
15:06 <skeptomai> benblack: crucially swaps to ssd | |
15:06 <skeptomai> (as last resort) | |
15:07 <crucially> i would say my macbook swaps to ssd all the time | |
15:07 <skeptomai> ah, good point | |
15:07 <crucially> but yeah, all our new machines have ssds for root/boot/swap | |
15:07 <skeptomai> Don't you also configure some boxes in production to do so? | |
(Am I remembering correctly?) | |
15:07 <crucially> of course, a lot of our machines have no swap configured too | |
15:08 <bingeldac> we debate that internally all the time | |
15:08 <bingeldac> to have or not have swap | |
15:09 <pharkmillups> traceback0: you can dig around riak.markmail.org for various | |
"inno vs. bitcask" discussions | |
15:09 <bingeldac> rasputnik: I don't think it is a bad idea if that is what you | |
have to work with | |
15:09 <pharkmillups> traceback0: this one isn't bad | |
15:09 <pharkmillups> http://gist.github.com/438065 | |
15:09 <skeptomai> crashing sucks, but swap really hides the problem. the service may be so | |
impacted that it's not really functioning when it swaps and your monitoring might not reflect that | |
15:10 <benblack> traceback0: if you are constantly updating the same keys (for example, doing | |
some sort of counter-style thing), bitcask will consume a lot more space between merges | |
15:10 <benblack> inno updates in place, so take up less space | |
15:18 <seancribbs> the tradeoff for constrained memory usage is higher/more erratic latency | |
15:18 <seancribbs> (with inno) | |
22:11 <traceback0> seancribbs: ping | |
22:12 <seancribbs> pong, but not for long | |
22:21 <traceback0> seancribbs: oh oops | |
22:21 <traceback0> seancribbs: reading up on bitcask | |
22:22 <traceback0> How is this possible: http://cl.ly/880e7e97abcf7aca7796 | |
22:22 <benblack> virtual memory | |
22:22 <benblack> _swap_ | |
22:22 <seancribbs> datasets > RAM = data is not stored in RAM | |
22:22 <benblack> size of _keys_/index != size of dataset | |
22:23 <traceback0> so a 10ms disk seek to where the data is | |
22:23 <traceback0> since the key is inherently a pointer to the physical data | |
22:23 <benblack> not quite | |
22:23 <seancribbs> worst-case. if it HAS to go to disk | |
22:23 <benblack> if working set fits in OS buffer cache, there is no diska ctivity | |
22:23 <seancribbs> ^^ | |
22:23 <traceback0> how big is a OS buffer cache typically? | |
22:24 <traceback0> 1G? 100MB? | |
22:24 <benblack> how much RAM is in the box? | |
22:24 <seancribbs> Total RAM - RSS of other programs | |
22:24 <seancribbs> (if you fill it) | |
22:24 <seancribbs> s/other/running/ | |
22:24 <traceback0> 8G | |
22:25 <traceback0> so i have 20G of data | |
22:25 <traceback0> 8G of physical memory | |
22:25 <traceback0> riak has been running for a while so 8G is full | |
22:25 <benblack> do you understand the difference between index size, working set size, | |
and total dataset size? | |
22:25 <traceback0> yes | |
22:26 <benblack> great | |
22:26 <benblack> all questions answered! | |
22:26 <traceback0> alright so once working set exceeds, swap happens beacuse it's looking | |
seeking for data on disk | |
22:26 <traceback0> each seek is ~10ms so with enough swap you die | |
22:26 <benblack> sorry, don't understand a thing you just said | |
22:27 <benblack> but i don't think so | |
22:27 <traceback0> just saying once your working set exceeds physical memory | |
22:27 <traceback0> you swap | |
22:27 <traceback0> since each key is a single seek | |
22:27 <benblack> no, that's just disk activity | |
22:27 <traceback0> we're talking each look up is a ~10ms look up | |
22:27 <benblack> means OS buffer caches are being invalidated and new stuff pulled in | |
22:27 <traceback0> what is just disk activity? | |
22:28 <traceback0> guess I don't understand how OS buffer caches work | |
22:28 <benblack> the riak process memory taking up more space can cause swap (as is normal | |
for any set of processes that exceeds available RAM) | |
22:29 <benblack> in some operating systems, buffer cache and VM are merged to avoid | |
redundant/conflicting activity. they are still independent from the perspective of the processes, though | |
22:29 <benblack> and linux is not such an OS | |
22:29 <benblack> http://www.faqs.org/docs/linux_admin/buffer-cache.html | |
22:30 <benblack> http://tldp.org/LDP/tlk/fs/filesystem.html | |
22:54 <traceback0> benblack: that buffer cache page was very helpful thanks | |
22:54 <traceback0> i learned a lot | |
22:55 <traceback0> so basically there will be no buffer cache left once all the keys | |
exhaust physical memory | |
22:55 <benblack> reading is fundamental | |
22:55 <traceback0> and swap will occur | |
22:55 <benblack> approximately so | |
22:57 <traceback0> benblack: any information related to how files get mapped to the buffer cache? | |
22:57 <traceback0> i.e. i read data file A and all of A gets thrown into the buffer cache? | |
22:57 <benblack> files aren't mapped to buffer cache | |
22:57 <benblack> you might be thinking of mmap() | |
22:58 <traceback0> well when you read file A | |
22:58 <traceback0> its data is in the buffer cache right? | |
22:58 <benblack> the parts you read (and whatever else got read because of read ahead behavior) | |
22:58 <traceback0> sure | |
22:58 <benblack> reading a little bit does not cause the whole thing to be pulled into cache | |
22:59 <traceback0> so when you read file A again in the same spot how does the OS know its | |
in memory (cached)? | |
22:59 <traceback0> that specific region of data in that file is in the cache | |
22:59 <benblack> because all your requests are going through the same VFS system | |
22:59 <traceback0> is it based on some position on the disk? | |
23:00 <traceback0> oh maybe i should read http://tldp.org/LDP/tlk/fs/filesystem.html i guess? | |
23:00 <benblack> where disk may be a logical volume | |
23:00 <benblack> starting to understand why things like O_SYNC, O_DIRECT are important for | |
merge-based storage? | |
23:01 <benblack> (among other things) | |
23:03 <traceback0> will read up on those | |
23:03 <traceback0> I honestly have no idea what O_DIRECT does | |
23:03 <* traceback0> googles | |
23:04 <benblack> tells the VFS layer to bypass buffer cache for those operations | |
23:06 <traceback0> is this for writes? | |
23:06 <traceback0> as in don't write through this data basically? | |
23:06 <benblack> http://amailbox.org/node/7563 | |
23:07 <benblack> (much of the advise calls linus says should be used instead aren't | |
even implemented, so...) | |
23:07 <benblack> speaking of madvise(), it's crucially | |
23:13 <crucially> hi | |
23:13 <traceback0> benblack: was i right about O_DIRECT bypasses buffer cache for writes so they don't | |
write through and consume your buffer cache? | |
23:14 <benblack> hi | |
23:14 <benblack> traceback0: right (except the concern is less consuming buffer cache than invalidating | |
entries from it) | |
23:14 <traceback0> benblack: ah ok yeah | |
23:17 <crucially> you can also open the block device directly, bypasses the buffer cache | |
23:18 <crucially> madvise random/sequential are implemented |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment