Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created June 24, 2010 12:18
Show Gist options
  • Select an option

  • Save PharkMillups/451370 to your computer and use it in GitHub Desktop.

Select an option

Save PharkMillups/451370 to your computer and use it in GitHub Desktop.
danoyoung # when passing in keys into a map/reduce job, is there a way to
send in a wildcard pattern, something like 2009_* as the key argument?
seancribbs # danoyoung: no, but you can send a whole bucket
danoyoung # I have a bucket called nsidc_0452 and store keys for
various years of data, i.e. 1998->2009
seancribbs # and then filter out ones you don't wnat
danoyoung # yea, I know I can send the whole bucket...I was just curious
if I could do a bucket,key (wildcard) comno.... ok, thanx sean
justinsheehy # if the bucket is much bigger than the set you want, you
might (or might not) do better by having a streaming list_keys operation
sent to a process that filters keys, which then streams the smaller set to a MR job.
seancribbs # ^^ justinsheehy
justinsheehy # since that won't require loading the objects from disk before filtering
danoyoung # hmm..interesting, yes...I'll look into that
danoyoung # we're going to have potentially hundreds of thousands of
objects in the bucket...not sure yet if I should break out the data by
year of not yet, something like nsidc_0452_2009, nsidc_0452_2008, etc....
danoyoung # is the list_keys functionality only in erlang client?
drev1 # keys can be listed from the REST API and proto buffs client
danoyoung # cool, does anyone know if there's a proto buff client for ruby yet?
drev1 # someone has been working on one - http://github.com/aitrus/riak-pbclient
danoyoung # got it, just searched github and saw this...thanx.
seancribbs # danoyoung: yeah, you might also look into generating the
key-list if you have adequate information about what the keys will be named
danoyoung # what we have for this current dataset (and they're all different)
is satellite measurements per day, and each day the satellite generates
eight different measurements for the area of coverage we're interested in...i.e. Greenland
danoyoung # so I'm thinking I need a key that can be unique enough for a
given day...so I was fooling around with something like 2009_001_<uuid>...but
that really doesn't give me enough info on the key to pass it onto a m/r
function...I would need to know the uuid ahead of time.
drev1 # danoyoung: is your data set always going to be 8 readings a day per area?
seancribbs # ooh here's an idea
keep some meta records about your sources of info
they might tell what periods of time are available, etc
then you could use those in an initial map phase to generate the keys
for the actual data
danoyoung # for at least this dataset, yes.
yea, I was keeping data w/n the json struct with the year, day, etc....
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment