Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created October 27, 2010 20:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PharkMillups/649848 to your computer and use it in GitHub Desktop.
Save PharkMillups/649848 to your computer and use it in GitHub Desktop.
09:47 <jonas11235> hi ALL
09:47 <jonas11235> I have a question about how the map/reduce work in riak
09:48 <jonas11235> it recalculate all documents in the bucket every time I make the map/reduce or it
does use the information in the vector clock to just compute the new values?
09:50 <jonas11235> in my context the end user can perform an action and I register each action
as a document, we will have lots of inserts (and this part is very critical) but I will need
to query how many actions each user made
09:51 <jonas11235> will riak process all bucket again or just the new actions?
09:51 <seancribbs> jonas11235: there is a modest amount of caching for map results
09:51 <seancribbs> so your second request across the same data set with the same spec will be somewhat faster
09:52 <jonas11235> good that is what I need to know :)
09:52 <seancribbs> but it's not incremental like CouchDB's m/r
09:52 <jonas11235> what you mean by "modest amount"? what limitations this cache
implementation has?
09:53 <jonas11235> sorry about asking this here
09:53 <seancribbs> i believe there are some knobs you can turn… digging
09:53 <jonas11235> I look for documentation and I didn't find anything about caching in
the map/reduce
09:54 <jonas11235> I did some digging, do you have any reference where I should look for
09:54 <seancribbs> no, give me a moment
09:54 <jonas11235> ok, thank you
09:55 <jonas11235> I think this cache stuff is important in terms of application behavior
09:56 <seancribbs> right, so the default is 100 results per vnode (partition)
09:56 <jonas11235> the only reference I found was one paragraph on the release notes of 0.13
09:57 <jonas11235> "In addition to this, the caching layer for JavaScript MapReduce has been completely
re-implemented. This results in performance gains when repeating the same MapReduce jobs. Specifically,
this work includes a new in-memory vnode LRU cache solely for map operations. The size of the
cache is now configurable (via the 'vnode_cache_entries' entry in the riak_kv section of app.config) and
defaults to 1000 objects."
09:57 <seancribbs> right
09:57 <seancribbs> so in the riak_kv section of app.config
09:57 <seancribbs> you can set/change the vnode_cache_entries key
09:57 <seancribbs> {vnode_cache_entries, 250},
09:58 <seancribbs> etc
09:58 <seancribbs> however, it would be best to benchmark your queries, see how much it
improves things
09:58 <seancribbs> must measure to know
10:00 <jonas11235> ok, let me see if I understood it rightly, it will cache the map/reduce result
and the vector clock for these objects, if the vector clock still the same it will use the cache and now
calculate again, if the vector clock is outdated (or the element is not the cache)
it will calculate it again?
10:01 <seancribbs> not exactly
10:01 <seancribbs> only map results are cached
10:01 <jonas11235> and this cache is used in each map phase or it evaluate it from
the last phase to the first?
10:02 <seancribbs> and when a new value is stored, then all entries for that key are purged
10:03 <jonas11235> if it find the last phase in the map, it will skip evaluation of
the earlier map phases?
10:03 <seancribbs> no, they are evaluated on a per-phase basis, if i understand your question
10:04 <jonas11235> I see you don't cache the reduce phase because you don't have
the vector clock (or actually you have more than one because of the aggregation)
10:04 <jonas11235> y, you understood, that is one approach I was think in terms
of aggregation
10:04 <jonas11235> sorry, not aggregation but optimization
10:07 <seancribbs> reduce phases are run on a single node, and can't
really be cached
10:07 <jonas11235> y?? you don't make distributed reduces?
10:07 <bingeldac> it is coming
10:07 <bingeldac> soon
10:07 <jonas11235> all nodes send the map results to the same node
10:07 <jonas11235> ah ok :)
10:08 <jonas11235> ok, I think it would be nice to put these questions about how the
cache works in some place in the project wiki
10:09 <jonas11235> I'm starting a new application, I need some complex queries and have good scalability
10:11 <jonas11235> I'm evaluating couchdb and riak, couch has already all these
caching sorted out but they don't scale well, all documents should be in all nodes
(what actually makes easy for them to implement the cache) but riak rocks in terms of
scalability and I like it better
10:12 <jonas11235> but I can't afford if the map/reduce start to take too long
10:13 <jonas11235> I think you guys are going in the right direction
10:14 <jonas11235> I would only add to your wiki the map/reduce comparative and a roadmap
for where are you going to in terms of caching
10:16 <pharkmillups> jonas11235: great suggestions. thanks!
10:17 <jonas11235> no, Thank You! you are doing a great job in this project
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment