Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created June 30, 2010 14:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PharkMillups/458692 to your computer and use it in GitHub Desktop.
Save PharkMillups/458692 to your computer and use it in GitHub Desktop.
sanity # I'm evaluating Riak for an application where I need to store tens of millions of
JSON objects (each about 1k in size) and then retrieve all of them at a later time. Can
Riak do something like this efficiently (retrieve very large numbers of objects and stream
them to a client for processing)?
benblack # you want to regularly retrieve every single object at once?
sanity # benblack: yes
benblack # do you access them individually other than that?
sanity # benblack: the data is the input to a machine learning algorithm
sanity # benblack: perhaps, although I could probably avoid it if necessary
benblack # then i don't know why you are talking about using a database at all
shove them straight into hdfs process with hadoop as needed
sanity # benblack: hmmm
benblack: I'm sure there was a reason I couldn't do that, but right now I can't think of one :-)
benblack: hypothetically though, how would Riak perform if I did use it in that way?
benblack # full scans are inefficient on every useful database i've ever used. you
would just be adding overhead.
sanity # ah wait, i remember the issue - I need to be able to go back and update
items in the training data
benblack: i did something similar in mysql and it worked ok
benblack # then your data set is small
sanity # benblack: it was about 5 million rows
benblack # then your data set is small
sanity # benblack: similar order of magnitude to the application I described at the outset
benblack # at that scale, almost anything will work.
sanity # benblack: so basically you are saying that Riak would not deal well with
retrieving tens of millions of objects in a single GET request?
benblack # yeah
sanity # benblack: is that because of the hashing? the objects would be spread
across multiple nodes at random?
sanity # benblack: any idea of Cassandra would be better at it?
s/of/if
benblack # might be. still doesn't sound like you need a database, though.
sanity # benblack: without a database how do i update specific entries in the training set?
benblack # by having each one in a different file?
sanity # benblack: tens of millions, where each file is about 1k in size? would
hdfs deal with that effectively?
benblack # no idea.
sanity # benblack: ok, thanks for your help
allen joined #riak
benblack # http://www.cloudera.com/blog/2009/02/the-small-files-problem/
so, you chunk your total training set, and update whole chunks at once
seancribbs # sanity: yes, a filesystem would be easiest for that (whether HDFS or something else)
"filesystems - the original NoSQL"
benblack # indeed
seancribbs # so… in general I agree with benblack. however, if you still want to use riak,
we discuss how to tackle your problem
sanity # seancribbs: i don't think HDFS is appropriate, its default minimum block size is 64MB :-/
seancribbs # ah
sanity # seancribbs: evaluating Cassandra you can specify a way for it to sort internally by keys,
and then you can do range queries on the keys. I assume that isn't possible with Riak given
that it locates stuff by hashing the key? i'd like to be able to say "give me every object from this
date and time to this date and time" but it seems like in Riak that would require an exhaustive search
seancribbs # sanity: you would do that with map-reduce of some osrt
sanity # seancribbs: right, but it would basically need to check every value to see whether it is
within the range - right?
seancribbs # if you know the granularity/range ahead of time, it'll be easier
sanity: maybe
sanity # seancribbs: it could be an hourly granularity, for example
seancribbs # where you would only want a single hour?
sanity # seancribbs: no, you might want 48 hours, for example
seancribbs # ah and the data itself doesn't directly correspond one-to-one to a given range/timestamp?
sanity # the data is basically impressions on a website
sanity # or, more specifically, ad impressions
seancribbs # ok
sanity # we're building an ad network and we have a machine learning algorithm
(similar to a neural network) that we need to feed several days worth of impression/click/conversion
data to, so it can learn to predict which ad will make us the most money :-)
seancribbs # so, some of this will be solved with riak search. but that can't help you right now
sounds cool
sanity # seancribbs: riak search? that is a feature currently being worked on?
benblack # what you are describing is a classic hadoop problem
seancribbs # benblack +1
sanity: yeah, hadoop is going to be better at large queries
benblack # seems like you just need to figure out a way to slightly abstract how training set
updates happen so you don't create millions of tiny files in hdfs
sanity # our machine learning algorithm is fairly CPU intensive, we can't really
write it with Javascript :-)
benblack # the link i gave earlier is really useful on that point
seancribbs # hadoop is Java and a bunch of higher-level languages
sanity # seancribbs: but Hadoop isn't a database
benblack # you really aren't describing a database problem, my man
seancribbs # sanity: I don't see any way you're not either going to have a huge index or
need hadoop's processing power
sanity # the processing can be done on a single machine in a few hours - that isn't what
worries me i just need somewhere to keep the data, such that it can be updated efficiently, and
retrieved en-bulk when needed
benblack # sanity: did you read that article i linked to on hdfs small files?
sanity # benblack: sorry, overlooked it - let me look now
benblack # it describes several options i suspect at least one could work for you
sanity # benblack: ok, thanks - I'll investigate. i just wish I could use a single solution for my
persistence needs (both the training data, and other more conventional data that I think
would fit perfectly into the Riak paradigm)
seancribbs # and also, spooling up data from a chunk of time and simply storing that as a single
object might help if your smallest granularity is an hour, make 1-hour objects for each ad/network/whatever
then you can derive the keys
benblack # sanity: part of nosql is using the right tool for the job. that means multiple tools.
sanity # benblack: true, although its better to use a single tool if possible
benblack # if there is a good fit in riak, you should definitely use it. if hadoop/hdfs works,
you get the full hadoop stack, which is pretty compelling for machine learning.
seancribbs # sanity: i think that's a dangerous thing to say
benblack # sanity: i would not agree with that.
seancribbs # although we love Riak, i'd never say it fits everybody's use-case
sanity # seems like HBase might be effective - "Hbase .... is a good choice if you need to do
MapReduce style streaming analyses with the occasional random look up" -
from http://www.cloudera.com/blog/2009/02/the-small-files-problem/
benblack # forcing things to fit the wrong paradigm is exactly what's wrong with a
lot of how relational databases are abused
seancribbs # ^^
sanity # yes, obviously you don't want to fit a square peg into a round hole - but at the same
time you don't want to use 15 different technologies in a project if you can help it
its a compromise
sanity # in this case I'm trying to figure out if using Riak for all my persistence
needs is a realistic compromize, but it sounds like it isn't
seancribbs # sanity: truth is, riak just doesn't do range queries
(without data gymnastics or expensive queries). if that's key to your problem, don't pick it
sanity # seancribbs: no, that isn't so important. its more the issue of being able to
stream very large numbers of results out of riak
seancribbs # sanity: if you require them in order, it's still a problem
if you don't, then it might be doable
m/r and link-walking queries stream
sanity # seancribbs: no, order is irrelevant to the machine learning algo
seancribbs # ok then
but you still have to filter them according to timestamp?
sanity # seancribbs: except the machine learning algorithm only cares about relatively
recent data - so i'll need to archive older data somehow
seancribbs # ok, this is why i was suggesting spooling up the data
sanity # seancribbs: only at a high level of granularity - perhaps per-day
seancribbs # hmm
sanity # what do you mean by "spooling up"?
seancribbs # as in, put multiple impressions in the same object, collected over the time period
sanity # seancribbs: does Riak do a good job with very large objects?
seancribbs # how large is "very large"
sanity # seancribbs: tens or hundreds of MBs?
seancribbs # tens are fine (although slow). hundreds will be a problem until
our "big files" feature lands
sanity # seancribbs: in what way would it be slow?
seancribbs # it's replicated… by default 3 replicas
sanity # seancribbs: ie. slow like "it would work but I wouldn't recommend it"?
seancribbs: but you could have 0 replicas if it wasn't critical data, right?
seancribbs # and decisions about the state of the object presented to the client are
mediated by retrieving it from every replica you mean 1 replic
sanity # seancribbs: if there is only one of them then its not a replica :-)
benblack # sanity: all copies are replicas, there is no master.
even if there is 1, it's a replica.
seancribbs # what benblack said
benblack # think of it as the platonic view of distributed storage
there is an idealized version of your object
sanity # benblack: we can debate the plain meaning of replica, but there isn't much point
benblack # everything in the system is a mere copy
sanity: i am giving you the operative definition for all dynamo systems. probably useful.
seancribbs # sanity: anyway, the current upper-limit for object sizes is 64MB
sanity # benblack: i'm thinking in terms of the english language definition of the
word, but it doesn't matter
benblack # sanity: don't worry, pedantic nerds are definitely welcome here, i'm proof.
sanity # seancribbs: can map-reduce be used to efficiently modify such objects
(if they are JSON)? ie. without completely replacing them just to update one field?
seancribbs # sanity: not recommended
map-reduce in Riak is for querying
sanity # seancribbs: ah, ok - good to know
sanity # seancribbs: but can the output of a map-reduce be inserted back into Riak?
seancribbs: ...without going through the client
seancribbs # sanity: yes, but again it's not recommended
for a number of reasons
sanity # seancribbs: what are they?
seancribbs # 1) map-reduce queries have timeouts which don't include the timeouts of the get/put processes
2) your whole query might crash if something doesn't work out in one function
3) reduce phases aren't run only once
and I'm sure there's other reasons
sanity # gotcha
seancribbs # the primary reason is latency, though.
(#1)
sanity # ok, well thank you for your help, I think I have a good picture of Riak's
pros and cons now for my application. Hopefully I'll be able to find a way to use
it because it really seems neat. At this point I think it may have to be Cassandra though :-/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment