PharkMillups/gist:649848

## gistfile1.txt
09:47 <jonas11235> hi ALL

09:47 <jonas11235> I have a question about how the map/reduce work in riak

09:48 <jonas11235> it recalculate all documents in the bucket every time I make the map/reduce or it
does use the information in the vector clock to just compute the new values?

09:50 <jonas11235> in my context the end user can perform an action and I register each action
as a document, we will have lots of inserts (and this part is very critical) but I will need
to query how many actions each user made

09:51 <jonas11235> will riak process all bucket again or just the new actions?

09:51 <seancribbs> jonas11235: there is a modest amount of caching for map results

09:51 <seancribbs> so your second request across the same data set with the same spec will be somewhat faster

09:52 <jonas11235> good that is what I need to know :)

09:52 <seancribbs> but it's not incremental like CouchDB's m/r

09:52 <jonas11235> what you mean by "modest amount"? what limitations this cache
implementation has?

09:53 <jonas11235> sorry about asking this here

09:53 <seancribbs> i believe there are some knobs you can turn… digging

09:53 <jonas11235> I look for documentation and I didn't find anything about caching in
the map/reduce

09:54 <jonas11235> I did some digging, do you have any reference where I should look for

09:54 <seancribbs> no, give me a moment

09:54 <jonas11235> ok, thank you

09:55 <jonas11235> I think this cache stuff is important in terms of application behavior

09:56 <seancribbs> right, so the default is 100 results per vnode (partition)

09:56 <jonas11235> the only reference I found was one paragraph on the release notes of 0.13

09:57 <jonas11235> "In addition to this, the caching layer for JavaScript MapReduce has been completely
re-implemented. This results in performance gains when repeating the same MapReduce jobs. Specifically,
this work includes a new in-memory vnode LRU cache solely for map operations. The size of the
cache is now configurable (via the 'vnode_cache_entries' entry in the riak_kv section of app.config) and
defaults to 1000 objects."

09:57 <seancribbs> right

09:57 <seancribbs> so in the riak_kv section of app.config

09:57 <seancribbs> you can set/change the vnode_cache_entries key

09:57 <seancribbs> {vnode_cache_entries, 250},

09:58 <seancribbs> etc

09:58 <seancribbs> however, it would be best to benchmark your queries, see how much it
improves things

09:58 <seancribbs> must measure to know

10:00 <jonas11235> ok, let me see if I understood it rightly, it will cache the map/reduce result
and the vector clock for these objects, if the vector clock still the same it will use the cache and now
calculate again, if the vector clock is outdated (or the element is not the cache)
it will calculate it again?

10:01 <seancribbs> not exactly

10:01 <seancribbs> only map results are cached

10:01 <jonas11235> and this cache is used in each map phase or it evaluate it from
the last phase to the first?

10:02 <seancribbs> and when a new value is stored, then all entries for that key are purged

10:03 <jonas11235> if it find the last phase in the map, it will skip evaluation of
the earlier map phases?

10:03 <seancribbs> no, they are evaluated on a per-phase basis, if i understand your question

10:04 <jonas11235> I see you don't cache the reduce phase because you don't have
the vector clock (or actually you have more than one because of the aggregation)

10:04 <jonas11235> y, you understood, that is one approach I was think in terms
of aggregation

10:04 <jonas11235> sorry, not aggregation but optimization

10:07 <seancribbs> reduce phases are run on a single node, and can't
really be cached

10:07 <jonas11235> y?? you don't make distributed reduces?

10:07 <bingeldac> it is coming

10:07 <bingeldac> soon

10:07 <jonas11235> all nodes send the map results to the same node

10:07 <jonas11235> ah ok :)

10:08 <jonas11235> ok, I think it would be nice to put these questions about how the
cache works in some place in the project wiki

10:09 <jonas11235> I'm starting a new application, I need some complex queries and have good scalability

10:11 <jonas11235> I'm evaluating couchdb and riak, couch has already all these
caching sorted out but they don't scale well, all documents should be in all nodes
(what actually makes easy for them to implement the cache) but riak rocks in terms of
scalability and I like it better

10:12 <jonas11235> but I can't afford if the map/reduce start to take too long

10:13 <jonas11235> I think you guys are going in the right direction

10:14 <jonas11235> I would only add to your wiki the map/reduce comparative and a roadmap
for where are you going to in terms of caching

10:16 <pharkmillups> jonas11235: great suggestions. thanks!

10:17 <jonas11235> no, Thank You! you are doing a great job in this project
	09:47 <jonas11235> hi ALL

	09:47 <jonas11235> I have a question about how the map/reduce work in riak

	09:48 <jonas11235> it recalculate all documents in the bucket every time I make the map/reduce or it
	does use the information in the vector clock to just compute the new values?

	09:50 <jonas11235> in my context the end user can perform an action and I register each action
	as a document, we will have lots of inserts (and this part is very critical) but I will need
	to query how many actions each user made

	09:51 <jonas11235> will riak process all bucket again or just the new actions?

	09:51 <seancribbs> jonas11235: there is a modest amount of caching for map results

	09:51 <seancribbs> so your second request across the same data set with the same spec will be somewhat faster

	09:52 <jonas11235> good that is what I need to know :)

	09:52 <seancribbs> but it's not incremental like CouchDB's m/r

	09:52 <jonas11235> what you mean by "modest amount"? what limitations this cache
	implementation has?

	09:53 <jonas11235> sorry about asking this here

	09:53 <seancribbs> i believe there are some knobs you can turn… digging

	09:53 <jonas11235> I look for documentation and I didn't find anything about caching in
	the map/reduce

	09:54 <jonas11235> I did some digging, do you have any reference where I should look for

	09:54 <seancribbs> no, give me a moment

	09:54 <jonas11235> ok, thank you

	09:55 <jonas11235> I think this cache stuff is important in terms of application behavior

	09:56 <seancribbs> right, so the default is 100 results per vnode (partition)

	09:56 <jonas11235> the only reference I found was one paragraph on the release notes of 0.13

	09:57 <jonas11235> "In addition to this, the caching layer for JavaScript MapReduce has been completely
	re-implemented. This results in performance gains when repeating the same MapReduce jobs. Specifically,
	this work includes a new in-memory vnode LRU cache solely for map operations. The size of the
	cache is now configurable (via the 'vnode_cache_entries' entry in the riak_kv section of app.config) and
	defaults to 1000 objects."

	09:57 <seancribbs> right

	09:57 <seancribbs> so in the riak_kv section of app.config

	09:57 <seancribbs> you can set/change the vnode_cache_entries key

	09:57 <seancribbs> {vnode_cache_entries, 250},

	09:58 <seancribbs> etc

	09:58 <seancribbs> however, it would be best to benchmark your queries, see how much it
	improves things

	09:58 <seancribbs> must measure to know

	10:00 <jonas11235> ok, let me see if I understood it rightly, it will cache the map/reduce result
	and the vector clock for these objects, if the vector clock still the same it will use the cache and now
	calculate again, if the vector clock is outdated (or the element is not the cache)
	it will calculate it again?

	10:01 <seancribbs> not exactly

	10:01 <seancribbs> only map results are cached

	10:01 <jonas11235> and this cache is used in each map phase or it evaluate it from
	the last phase to the first?

	10:02 <seancribbs> and when a new value is stored, then all entries for that key are purged

	10:03 <jonas11235> if it find the last phase in the map, it will skip evaluation of
	the earlier map phases?

	10:03 <seancribbs> no, they are evaluated on a per-phase basis, if i understand your question

	10:04 <jonas11235> I see you don't cache the reduce phase because you don't have
	the vector clock (or actually you have more than one because of the aggregation)

	10:04 <jonas11235> y, you understood, that is one approach I was think in terms
	of aggregation

	10:04 <jonas11235> sorry, not aggregation but optimization

	10:07 <seancribbs> reduce phases are run on a single node, and can't
	really be cached

	10:07 <jonas11235> y?? you don't make distributed reduces?

	10:07 <bingeldac> it is coming

	10:07 <bingeldac> soon

	10:07 <jonas11235> all nodes send the map results to the same node

	10:07 <jonas11235> ah ok :)

	10:08 <jonas11235> ok, I think it would be nice to put these questions about how the
	cache works in some place in the project wiki

	10:09 <jonas11235> I'm starting a new application, I need some complex queries and have good scalability

	10:11 <jonas11235> I'm evaluating couchdb and riak, couch has already all these
	caching sorted out but they don't scale well, all documents should be in all nodes
	(what actually makes easy for them to implement the cache) but riak rocks in terms of
	scalability and I like it better

	10:12 <jonas11235> but I can't afford if the map/reduce start to take too long

	10:13 <jonas11235> I think you guys are going in the right direction

	10:14 <jonas11235> I would only add to your wiki the map/reduce comparative and a roadmap
	for where are you going to in terms of caching

	10:16 <pharkmillups> jonas11235: great suggestions. thanks!

	10:17 <jonas11235> no, Thank You! you are doing a great job in this project