Created
June 30, 2010 14:03
-
-
Save PharkMillups/458692 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sanity # I'm evaluating Riak for an application where I need to store tens of millions of | |
JSON objects (each about 1k in size) and then retrieve all of them at a later time. Can | |
Riak do something like this efficiently (retrieve very large numbers of objects and stream | |
them to a client for processing)? | |
benblack # you want to regularly retrieve every single object at once? | |
sanity # benblack: yes | |
benblack # do you access them individually other than that? | |
sanity # benblack: the data is the input to a machine learning algorithm | |
sanity # benblack: perhaps, although I could probably avoid it if necessary | |
benblack # then i don't know why you are talking about using a database at all | |
shove them straight into hdfs process with hadoop as needed | |
sanity # benblack: hmmm | |
benblack: I'm sure there was a reason I couldn't do that, but right now I can't think of one :-) | |
benblack: hypothetically though, how would Riak perform if I did use it in that way? | |
benblack # full scans are inefficient on every useful database i've ever used. you | |
would just be adding overhead. | |
sanity # ah wait, i remember the issue - I need to be able to go back and update | |
items in the training data | |
benblack: i did something similar in mysql and it worked ok | |
benblack # then your data set is small | |
sanity # benblack: it was about 5 million rows | |
benblack # then your data set is small | |
sanity # benblack: similar order of magnitude to the application I described at the outset | |
benblack # at that scale, almost anything will work. | |
sanity # benblack: so basically you are saying that Riak would not deal well with | |
retrieving tens of millions of objects in a single GET request? | |
benblack # yeah | |
sanity # benblack: is that because of the hashing? the objects would be spread | |
across multiple nodes at random? | |
sanity # benblack: any idea of Cassandra would be better at it? | |
s/of/if | |
benblack # might be. still doesn't sound like you need a database, though. | |
sanity # benblack: without a database how do i update specific entries in the training set? | |
benblack # by having each one in a different file? | |
sanity # benblack: tens of millions, where each file is about 1k in size? would | |
hdfs deal with that effectively? | |
benblack # no idea. | |
sanity # benblack: ok, thanks for your help | |
allen joined #riak | |
benblack # http://www.cloudera.com/blog/2009/02/the-small-files-problem/ | |
so, you chunk your total training set, and update whole chunks at once | |
seancribbs # sanity: yes, a filesystem would be easiest for that (whether HDFS or something else) | |
"filesystems - the original NoSQL" | |
benblack # indeed | |
seancribbs # so… in general I agree with benblack. however, if you still want to use riak, | |
we discuss how to tackle your problem | |
sanity # seancribbs: i don't think HDFS is appropriate, its default minimum block size is 64MB :-/ | |
seancribbs # ah | |
sanity # seancribbs: evaluating Cassandra you can specify a way for it to sort internally by keys, | |
and then you can do range queries on the keys. I assume that isn't possible with Riak given | |
that it locates stuff by hashing the key? i'd like to be able to say "give me every object from this | |
date and time to this date and time" but it seems like in Riak that would require an exhaustive search | |
seancribbs # sanity: you would do that with map-reduce of some osrt | |
sanity # seancribbs: right, but it would basically need to check every value to see whether it is | |
within the range - right? | |
seancribbs # if you know the granularity/range ahead of time, it'll be easier | |
sanity: maybe | |
sanity # seancribbs: it could be an hourly granularity, for example | |
seancribbs # where you would only want a single hour? | |
sanity # seancribbs: no, you might want 48 hours, for example | |
seancribbs # ah and the data itself doesn't directly correspond one-to-one to a given range/timestamp? | |
sanity # the data is basically impressions on a website | |
sanity # or, more specifically, ad impressions | |
seancribbs # ok | |
sanity # we're building an ad network and we have a machine learning algorithm | |
(similar to a neural network) that we need to feed several days worth of impression/click/conversion | |
data to, so it can learn to predict which ad will make us the most money :-) | |
seancribbs # so, some of this will be solved with riak search. but that can't help you right now | |
sounds cool | |
sanity # seancribbs: riak search? that is a feature currently being worked on? | |
benblack # what you are describing is a classic hadoop problem | |
seancribbs # benblack +1 | |
sanity: yeah, hadoop is going to be better at large queries | |
benblack # seems like you just need to figure out a way to slightly abstract how training set | |
updates happen so you don't create millions of tiny files in hdfs | |
sanity # our machine learning algorithm is fairly CPU intensive, we can't really | |
write it with Javascript :-) | |
benblack # the link i gave earlier is really useful on that point | |
seancribbs # hadoop is Java and a bunch of higher-level languages | |
sanity # seancribbs: but Hadoop isn't a database | |
benblack # you really aren't describing a database problem, my man | |
seancribbs # sanity: I don't see any way you're not either going to have a huge index or | |
need hadoop's processing power | |
sanity # the processing can be done on a single machine in a few hours - that isn't what | |
worries me i just need somewhere to keep the data, such that it can be updated efficiently, and | |
retrieved en-bulk when needed | |
benblack # sanity: did you read that article i linked to on hdfs small files? | |
sanity # benblack: sorry, overlooked it - let me look now | |
benblack # it describes several options i suspect at least one could work for you | |
sanity # benblack: ok, thanks - I'll investigate. i just wish I could use a single solution for my | |
persistence needs (both the training data, and other more conventional data that I think | |
would fit perfectly into the Riak paradigm) | |
seancribbs # and also, spooling up data from a chunk of time and simply storing that as a single | |
object might help if your smallest granularity is an hour, make 1-hour objects for each ad/network/whatever | |
then you can derive the keys | |
benblack # sanity: part of nosql is using the right tool for the job. that means multiple tools. | |
sanity # benblack: true, although its better to use a single tool if possible | |
benblack # if there is a good fit in riak, you should definitely use it. if hadoop/hdfs works, | |
you get the full hadoop stack, which is pretty compelling for machine learning. | |
seancribbs # sanity: i think that's a dangerous thing to say | |
benblack # sanity: i would not agree with that. | |
seancribbs # although we love Riak, i'd never say it fits everybody's use-case | |
sanity # seems like HBase might be effective - "Hbase .... is a good choice if you need to do | |
MapReduce style streaming analyses with the occasional random look up" - | |
from http://www.cloudera.com/blog/2009/02/the-small-files-problem/ | |
benblack # forcing things to fit the wrong paradigm is exactly what's wrong with a | |
lot of how relational databases are abused | |
seancribbs # ^^ | |
sanity # yes, obviously you don't want to fit a square peg into a round hole - but at the same | |
time you don't want to use 15 different technologies in a project if you can help it | |
its a compromise | |
sanity # in this case I'm trying to figure out if using Riak for all my persistence | |
needs is a realistic compromize, but it sounds like it isn't | |
seancribbs # sanity: truth is, riak just doesn't do range queries | |
(without data gymnastics or expensive queries). if that's key to your problem, don't pick it | |
sanity # seancribbs: no, that isn't so important. its more the issue of being able to | |
stream very large numbers of results out of riak | |
seancribbs # sanity: if you require them in order, it's still a problem | |
if you don't, then it might be doable | |
m/r and link-walking queries stream | |
sanity # seancribbs: no, order is irrelevant to the machine learning algo | |
seancribbs # ok then | |
but you still have to filter them according to timestamp? | |
sanity # seancribbs: except the machine learning algorithm only cares about relatively | |
recent data - so i'll need to archive older data somehow | |
seancribbs # ok, this is why i was suggesting spooling up the data | |
sanity # seancribbs: only at a high level of granularity - perhaps per-day | |
seancribbs # hmm | |
sanity # what do you mean by "spooling up"? | |
seancribbs # as in, put multiple impressions in the same object, collected over the time period | |
sanity # seancribbs: does Riak do a good job with very large objects? | |
seancribbs # how large is "very large" | |
sanity # seancribbs: tens or hundreds of MBs? | |
seancribbs # tens are fine (although slow). hundreds will be a problem until | |
our "big files" feature lands | |
sanity # seancribbs: in what way would it be slow? | |
seancribbs # it's replicated… by default 3 replicas | |
sanity # seancribbs: ie. slow like "it would work but I wouldn't recommend it"? | |
seancribbs: but you could have 0 replicas if it wasn't critical data, right? | |
seancribbs # and decisions about the state of the object presented to the client are | |
mediated by retrieving it from every replica you mean 1 replic | |
sanity # seancribbs: if there is only one of them then its not a replica :-) | |
benblack # sanity: all copies are replicas, there is no master. | |
even if there is 1, it's a replica. | |
seancribbs # what benblack said | |
benblack # think of it as the platonic view of distributed storage | |
there is an idealized version of your object | |
sanity # benblack: we can debate the plain meaning of replica, but there isn't much point | |
benblack # everything in the system is a mere copy | |
sanity: i am giving you the operative definition for all dynamo systems. probably useful. | |
seancribbs # sanity: anyway, the current upper-limit for object sizes is 64MB | |
sanity # benblack: i'm thinking in terms of the english language definition of the | |
word, but it doesn't matter | |
benblack # sanity: don't worry, pedantic nerds are definitely welcome here, i'm proof. | |
sanity # seancribbs: can map-reduce be used to efficiently modify such objects | |
(if they are JSON)? ie. without completely replacing them just to update one field? | |
seancribbs # sanity: not recommended | |
map-reduce in Riak is for querying | |
sanity # seancribbs: ah, ok - good to know | |
sanity # seancribbs: but can the output of a map-reduce be inserted back into Riak? | |
seancribbs: ...without going through the client | |
seancribbs # sanity: yes, but again it's not recommended | |
for a number of reasons | |
sanity # seancribbs: what are they? | |
seancribbs # 1) map-reduce queries have timeouts which don't include the timeouts of the get/put processes | |
2) your whole query might crash if something doesn't work out in one function | |
3) reduce phases aren't run only once | |
and I'm sure there's other reasons | |
sanity # gotcha | |
seancribbs # the primary reason is latency, though. | |
(#1) | |
sanity # ok, well thank you for your help, I think I have a good picture of Riak's | |
pros and cons now for my application. Hopefully I'll be able to find a way to use | |
it because it really seems neat. At this point I think it may have to be Cassandra though :-/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment