-
-
Save seancribbs/732074 to your computer and use it in GitHub Desktop.
# Riak KV 0.14 will add key-filters to MapReduce queries. riak-client needs | |
# a nice and efficient syntax for this feature, please leave a comment with | |
# the format that you like best. | |
# For more info on how key-filters work see: | |
# http://www.slideshare.net/hemulen/riak-mapred-preso | |
# Preliminaries so you know what we're talking about | |
client = Riak::Client.new | |
mr = Riak::MapReduce.new(client) | |
# Option 1: method_missing hack to support key-filters by name | |
# Pros: simple to add new filters as they come available | |
# Cons: leaky abstraction, doesn't enforce need to be entire-bucket query | |
mr.add("bucket").tokenize("-", 3).string_to_int.between(2009,2010) | |
# Option 2: simple filter method | |
# Pros: feels more like phase additions | |
# Cons: doesn't enforce need to be entire-bucket query | |
mr.add("bucket").filter(:tokenize, "-", 3).filter(:string_to_int).filter(:between, 2009, 2010) | |
# Option 3: DSL-ish block syntax | |
# Pros: enforces entire-bucket query, encapsulation of filter sequence | |
# Cons: verbose, instance_eval can be ugly/problematic (instance_eval optional) | |
mr.filter("bucket") do | |
tokenize "-", 3 | |
string_to_int | |
between 2009, 2010 | |
end | |
# Option 4: dumb pass-through | |
# Pros: simplest to implement | |
# Cons: hard to verify or constrain the format of inputs | |
mr.add("bucket", [[:tokenize, "-", 3],[:string_to_int],[:between, 2009, 2010]]) |
To me #4 looks ugly, #1 looks closest to Arel, which I like. Could you not use define_method instead of method_missing? Also, I agree that #3 would be cool regardless. It might be more verbose, but it's easy to read and find out what's going on. Finally, you might actually want to investigate writing an outputter for ARel, since they did the rewrite it's supposedly easier. You could output at least the simple where queries and such. Would be super cool!
Option 4 seems reasonable,
I was looking at the current syntax without filters VS the new one with filters:
"inputs": [["alice1", "p1"], ["alice2", "p2"], ["alice3", "p5"]]
"inputs": [{
"bucket": "msft1",
"key_filters": [["tokenize", "-", 1], ["string_to_int"], ["between", 2009, 2010]]
},
{
"bucket": "msft2",
"key_filters": [["tokenize", "-", 1], ["string_to_int"], ["between", 2009, 2010]]
}]
I suppose that Riak KV 0.14 will keep retro-compatibility. But will it offer a new syntax for the first scenario ? Like:
"inputs": {
"bucket": "alice",
"keys": ["p1", "p2", "p5"]
}
By the way, this new format does not allow giving differents buckets. Woud be more like:
"inputs": [
{
"bucket": "alice",
"keys": ["p1", "p2", "p5"]
}
]
I personally prefer having a dedicated method for filtering and one for adding. Especially if you can use them in the same query.
Option 3 is very beautiful when you build a chain of filter/add/map/reduce methods.
I don't know if multiple buckets can be given as inputs. If it be so, then options 1 and 2 would hardly work. Except if you can do something like:
mr.add("bucket1").tokenize("-", 3).string_to_int.between(2009,2010).add("bucket2").tokenize("-", 3).string_to_int.between(2009,2010)
I think that the two first options are weird as they add information to sub-instructions. I mean that when you call "add" and "filter", it will update the "inputs" key, and when you call "map", it will update the "map" key.
Anyway :) I would vote for a mix between option 3 and 4. It's not so "beautiful", but it does not go too far from the final JSON. When you do some debugging with CURL, it's still useful to be able to transform the Ruby code you have in your code to JSON instructions without too much thinking.
mr.filter("bucket", [[:tokenize, "-", 3], [:string_to_int], [:between, 2009, 2010]])
mr.filter("bucket", [[:tokenize, "-", 3], [:string_to_int], [:between, 2009, 2010]]).add("bucket2", "item1")
Or I would vote for option 2, if the previous argument is invalid. But I feel that option 2 will never be able to reproduce every possibilities. I had the problem with the MongoMapper DSL.
It seems to dangerous to build an Arel-like DSL now, because the Riak KV query syntax will evolve a lot, I guess. It's not the case for SQL.
Yeah, I would also vote for a mix between 3 and 4. Though something more arel-like would be nice, I agree with nfo that the Riak KV syntax will evolve quite a bit. Option 2 looks pretty nice though I think.
@nfo: I was told that you can only use one bucket, and you either have a bucket with filters, or the previously available options (including the riak_search invocation). But I'm sure Kevin would appreciate the other ideas for his future improvements.
Implementation is started on a branch: https://github.com/seancribbs/ripple/compare/master...key-filters
@bbhoss: define_method prevented a number of stack level too deep errors I got, and feels cleaner (although I might switch to class_eval since it's faster).
Current feeling - use #4, but have #3 as a syntax sugar on top.