Created
October 27, 2010 20:08
-
-
Save PharkMillups/649848 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
09:47 <jonas11235> hi ALL | |
09:47 <jonas11235> I have a question about how the map/reduce work in riak | |
09:48 <jonas11235> it recalculate all documents in the bucket every time I make the map/reduce or it | |
does use the information in the vector clock to just compute the new values? | |
09:50 <jonas11235> in my context the end user can perform an action and I register each action | |
as a document, we will have lots of inserts (and this part is very critical) but I will need | |
to query how many actions each user made | |
09:51 <jonas11235> will riak process all bucket again or just the new actions? | |
09:51 <seancribbs> jonas11235: there is a modest amount of caching for map results | |
09:51 <seancribbs> so your second request across the same data set with the same spec will be somewhat faster | |
09:52 <jonas11235> good that is what I need to know :) | |
09:52 <seancribbs> but it's not incremental like CouchDB's m/r | |
09:52 <jonas11235> what you mean by "modest amount"? what limitations this cache | |
implementation has? | |
09:53 <jonas11235> sorry about asking this here | |
09:53 <seancribbs> i believe there are some knobs you can turn… digging | |
09:53 <jonas11235> I look for documentation and I didn't find anything about caching in | |
the map/reduce | |
09:54 <jonas11235> I did some digging, do you have any reference where I should look for | |
09:54 <seancribbs> no, give me a moment | |
09:54 <jonas11235> ok, thank you | |
09:55 <jonas11235> I think this cache stuff is important in terms of application behavior | |
09:56 <seancribbs> right, so the default is 100 results per vnode (partition) | |
09:56 <jonas11235> the only reference I found was one paragraph on the release notes of 0.13 | |
09:57 <jonas11235> "In addition to this, the caching layer for JavaScript MapReduce has been completely | |
re-implemented. This results in performance gains when repeating the same MapReduce jobs. Specifically, | |
this work includes a new in-memory vnode LRU cache solely for map operations. The size of the | |
cache is now configurable (via the 'vnode_cache_entries' entry in the riak_kv section of app.config) and | |
defaults to 1000 objects." | |
09:57 <seancribbs> right | |
09:57 <seancribbs> so in the riak_kv section of app.config | |
09:57 <seancribbs> you can set/change the vnode_cache_entries key | |
09:57 <seancribbs> {vnode_cache_entries, 250}, | |
09:58 <seancribbs> etc | |
09:58 <seancribbs> however, it would be best to benchmark your queries, see how much it | |
improves things | |
09:58 <seancribbs> must measure to know | |
10:00 <jonas11235> ok, let me see if I understood it rightly, it will cache the map/reduce result | |
and the vector clock for these objects, if the vector clock still the same it will use the cache and now | |
calculate again, if the vector clock is outdated (or the element is not the cache) | |
it will calculate it again? | |
10:01 <seancribbs> not exactly | |
10:01 <seancribbs> only map results are cached | |
10:01 <jonas11235> and this cache is used in each map phase or it evaluate it from | |
the last phase to the first? | |
10:02 <seancribbs> and when a new value is stored, then all entries for that key are purged | |
10:03 <jonas11235> if it find the last phase in the map, it will skip evaluation of | |
the earlier map phases? | |
10:03 <seancribbs> no, they are evaluated on a per-phase basis, if i understand your question | |
10:04 <jonas11235> I see you don't cache the reduce phase because you don't have | |
the vector clock (or actually you have more than one because of the aggregation) | |
10:04 <jonas11235> y, you understood, that is one approach I was think in terms | |
of aggregation | |
10:04 <jonas11235> sorry, not aggregation but optimization | |
10:07 <seancribbs> reduce phases are run on a single node, and can't | |
really be cached | |
10:07 <jonas11235> y?? you don't make distributed reduces? | |
10:07 <bingeldac> it is coming | |
10:07 <bingeldac> soon | |
10:07 <jonas11235> all nodes send the map results to the same node | |
10:07 <jonas11235> ah ok :) | |
10:08 <jonas11235> ok, I think it would be nice to put these questions about how the | |
cache works in some place in the project wiki | |
10:09 <jonas11235> I'm starting a new application, I need some complex queries and have good scalability | |
10:11 <jonas11235> I'm evaluating couchdb and riak, couch has already all these | |
caching sorted out but they don't scale well, all documents should be in all nodes | |
(what actually makes easy for them to implement the cache) but riak rocks in terms of | |
scalability and I like it better | |
10:12 <jonas11235> but I can't afford if the map/reduce start to take too long | |
10:13 <jonas11235> I think you guys are going in the right direction | |
10:14 <jonas11235> I would only add to your wiki the map/reduce comparative and a roadmap | |
for where are you going to in terms of caching | |
10:16 <pharkmillups> jonas11235: great suggestions. thanks! | |
10:17 <jonas11235> no, Thank You! you are doing a great job in this project |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment