Skip to content

Instantly share code, notes, and snippets.

@seancribbs
Created December 7, 2010 17:10
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save seancribbs/732074 to your computer and use it in GitHub Desktop.
Save seancribbs/732074 to your computer and use it in GitHub Desktop.
# Riak KV 0.14 will add key-filters to MapReduce queries. riak-client needs
# a nice and efficient syntax for this feature, please leave a comment with
# the format that you like best.
# For more info on how key-filters work see:
# http://www.slideshare.net/hemulen/riak-mapred-preso
# Preliminaries so you know what we're talking about
client = Riak::Client.new
mr = Riak::MapReduce.new(client)
# Option 1: method_missing hack to support key-filters by name
# Pros: simple to add new filters as they come available
# Cons: leaky abstraction, doesn't enforce need to be entire-bucket query
mr.add("bucket").tokenize("-", 3).string_to_int.between(2009,2010)
# Option 2: simple filter method
# Pros: feels more like phase additions
# Cons: doesn't enforce need to be entire-bucket query
mr.add("bucket").filter(:tokenize, "-", 3).filter(:string_to_int).filter(:between, 2009, 2010)
# Option 3: DSL-ish block syntax
# Pros: enforces entire-bucket query, encapsulation of filter sequence
# Cons: verbose, instance_eval can be ugly/problematic (instance_eval optional)
mr.filter("bucket") do
tokenize "-", 3
string_to_int
between 2009, 2010
end
# Option 4: dumb pass-through
# Pros: simplest to implement
# Cons: hard to verify or constrain the format of inputs
mr.add("bucket", [[:tokenize, "-", 3],[:string_to_int],[:between, 2009, 2010]])
@seancribbs
Copy link
Author

Current feeling - use #4, but have #3 as a syntax sugar on top.

@bbhoss
Copy link

bbhoss commented Dec 7, 2010

To me #4 looks ugly, #1 looks closest to Arel, which I like. Could you not use define_method instead of method_missing? Also, I agree that #3 would be cool regardless. It might be more verbose, but it's easy to read and find out what's going on. Finally, you might actually want to investigate writing an outputter for ARel, since they did the rewrite it's supposedly easier. You could output at least the simple where queries and such. Would be super cool!

@danoyoung
Copy link

Option 4 seems reasonable,

@nfo
Copy link

nfo commented Dec 7, 2010

I was looking at the current syntax without filters VS the new one with filters:

"inputs": [["alice1", "p1"], ["alice2", "p2"], ["alice3", "p5"]]

"inputs": [{
  "bucket": "msft1",
  "key_filters": [["tokenize", "-", 1], ["string_to_int"], ["between", 2009, 2010]]
},
{
  "bucket": "msft2",
  "key_filters": [["tokenize", "-", 1], ["string_to_int"], ["between", 2009, 2010]]
}]

I suppose that Riak KV 0.14 will keep retro-compatibility. But will it offer a new syntax for the first scenario ? Like:

"inputs": {
  "bucket": "alice",
  "keys": ["p1", "p2", "p5"]
}

By the way, this new format does not allow giving differents buckets. Woud be more like:

"inputs": [
  {
    "bucket": "alice",
    "keys": ["p1", "p2", "p5"]
  }
]

I personally prefer having a dedicated method for filtering and one for adding. Especially if you can use them in the same query.

Option 3 is very beautiful when you build a chain of filter/add/map/reduce methods.

I don't know if multiple buckets can be given as inputs. If it be so, then options 1 and 2 would hardly work. Except if you can do something like:

mr.add("bucket1").tokenize("-", 3).string_to_int.between(2009,2010).add("bucket2").tokenize("-", 3).string_to_int.between(2009,2010)

I think that the two first options are weird as they add information to sub-instructions. I mean that when you call "add" and "filter", it will update the "inputs" key, and when you call "map", it will update the "map" key.

Anyway :) I would vote for a mix between option 3 and 4. It's not so "beautiful", but it does not go too far from the final JSON. When you do some debugging with CURL, it's still useful to be able to transform the Ruby code you have in your code to JSON instructions without too much thinking.

mr.filter("bucket", [[:tokenize, "-", 3], [:string_to_int], [:between, 2009, 2010]])
mr.filter("bucket", [[:tokenize, "-", 3], [:string_to_int], [:between, 2009, 2010]]).add("bucket2", "item1")

Or I would vote for option 2, if the previous argument is invalid. But I feel that option 2 will never be able to reproduce every possibilities. I had the problem with the MongoMapper DSL.

@nfo
Copy link

nfo commented Dec 7, 2010

It seems to dangerous to build an Arel-like DSL now, because the Riak KV query syntax will evolve a lot, I guess. It's not the case for SQL.

@johnae
Copy link

johnae commented Dec 7, 2010

Yeah, I would also vote for a mix between 3 and 4. Though something more arel-like would be nice, I agree with nfo that the Riak KV syntax will evolve quite a bit. Option 2 looks pretty nice though I think.

@seancribbs
Copy link
Author

@nfo: I was told that you can only use one bucket, and you either have a bucket with filters, or the previously available options (including the riak_search invocation). But I'm sure Kevin would appreciate the other ideas for his future improvements.

@seancribbs
Copy link
Author

Implementation is started on a branch: https://github.com/seancribbs/ripple/compare/master...key-filters

@bbhoss: define_method prevented a number of stack level too deep errors I got, and feels cleaner (although I might switch to class_eval since it's faster).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment