PharkMillups/gist:458692

## gistfile1.txt
sanity # I'm evaluating Riak for an application where I need to store tens of millions of
JSON objects (each about 1k in size) and then retrieve all of them at a later time. Can
Riak do something like this efficiently (retrieve very large numbers of objects and stream
them to a client for processing)?

benblack # you want to regularly retrieve every single object at once?

sanity # benblack: yes

benblack # do you access them individually other than that?

sanity # benblack: the data is the input to a machine learning algorithm

sanity # benblack: perhaps, although I could probably avoid it if necessary

benblack # then i don't know why you are talking about using a database at all

shove them straight into hdfs process with hadoop as needed

sanity # benblack: hmmm

benblack: I'm sure there was a reason I couldn't do that, but right now I can't think of one :-)

benblack: hypothetically though, how would Riak perform if I did use it in that way?

benblack # full scans are inefficient on every useful database i've ever used. you
would just be adding overhead.

sanity # ah wait, i remember the issue - I need to be able to go back and update
items in the training data
benblack: i did something similar in mysql and it worked ok

benblack # then your data set is small

sanity # benblack: it was about 5 million rows

benblack # then your data set is small

sanity # benblack: similar order of magnitude to the application I described at the outset

benblack # at that scale, almost anything will work.

sanity # benblack: so basically you are saying that Riak would not deal well with
retrieving tens of millions of objects in a single GET request?

benblack # yeah

sanity # benblack: is that because of the hashing? the objects would be spread
across multiple nodes at random?

sanity # benblack: any idea of Cassandra would be better at it?
s/of/if

benblack # might be. still doesn't sound like you need a database, though.

sanity # benblack: without a database how do i update specific entries in the training set?

benblack # by having each one in a different file?

sanity # benblack: tens of millions, where each file is about 1k in size? would
hdfs deal with that effectively?

benblack # no idea.

sanity # benblack: ok, thanks for your help

allen joined #riak

benblack # http://www.cloudera.com/blog/2009/02/the-small-files-problem/
so, you chunk your total training set, and update whole chunks at once

seancribbs # sanity: yes, a filesystem would be easiest for that (whether HDFS or something else)
"filesystems - the original NoSQL"

benblack # indeed

seancribbs # so… in general I agree with benblack. however, if you still want to use riak,
we discuss how to tackle your problem

sanity # seancribbs: i don't think HDFS is appropriate, its default minimum block size is 64MB :-/
seancribbs # ah

sanity # seancribbs: evaluating Cassandra you can specify a way for it to sort internally by keys,
and then you can do range queries on the keys. I assume that isn't possible with Riak given
that it locates stuff by hashing the key? i'd like to be able to say "give me every object from this
date and time to this date and time" but it seems like in Riak that would require an exhaustive search

seancribbs # sanity: you would do that with map-reduce of some osrt

sanity # seancribbs: right, but it would basically need to check every value to see whether it is
within the range - right?

seancribbs # if you know the granularity/range ahead of time, it'll be easier
sanity: maybe

sanity # seancribbs: it could be an hourly granularity, for example

seancribbs # where you would only want a single hour?

sanity # seancribbs: no, you might want 48 hours, for example

seancribbs # ah and the data itself doesn't directly correspond one-to-one to a given range/timestamp?

sanity # the data is basically impressions on a website

sanity # or, more specifically, ad impressions

seancribbs # ok

sanity # we're building an ad network and we have a machine learning algorithm
(similar to a neural network) that we need to feed several days worth of impression/click/conversion
data to, so it can learn to predict which ad will make us the most money :-)

seancribbs # so, some of this will be solved with riak search. but that can't help you right now
sounds cool

sanity # seancribbs: riak search? that is a feature currently being worked on?

benblack # what you are describing is a classic hadoop problem

seancribbs # benblack +1
sanity: yeah, hadoop is going to be better at large queries

benblack # seems like you just need to figure out a way to slightly abstract how training set
 updates happen so you don't create millions of tiny files in hdfs

sanity # our machine learning algorithm is fairly CPU intensive, we can't really
 write it with Javascript :-)

benblack # the link i gave earlier is really useful on that point

seancribbs # hadoop is Java and a bunch of higher-level languages

sanity # seancribbs: but Hadoop isn't a database

benblack # you really aren't describing a database problem, my man

seancribbs # sanity: I don't see any way you're not either going to have a huge index or
need hadoop's processing power

sanity # the processing can be done on a single machine in a few hours - that isn't what
worries me i just need somewhere to keep the data, such that it can be updated efficiently, and
retrieved en-bulk when needed

benblack # sanity: did you read that article i linked to on hdfs small files?

sanity # benblack: sorry, overlooked it - let me look now

benblack # it describes several options i suspect at least one could work for you

sanity # benblack: ok, thanks - I'll investigate. i just wish I could use a single solution for my
persistence needs (both the training data, and other more conventional data that I think
would fit perfectly into the Riak paradigm)

seancribbs # and also, spooling up data from a chunk of time and simply storing that as a single
object might help if your smallest granularity is an hour, make 1-hour objects for each ad/network/whatever
then you can derive the keys

benblack # sanity: part of nosql is using the right tool for the job. that means multiple tools.

sanity # benblack: true, although its better to use a single tool if possible

benblack # if there is a good fit in riak, you should definitely use it. if hadoop/hdfs works,
you get the full hadoop stack, which is pretty compelling for machine learning.

seancribbs # sanity: i think that's a dangerous thing to say

benblack # sanity: i would not agree with that.

seancribbs # although we love Riak, i'd never say it fits everybody's use-case

sanity # seems like HBase might be effective - "Hbase .... is a good choice if you need to do
MapReduce style streaming analyses with the occasional random look up" -
from http://www.cloudera.com/blog/2009/02/the-small-files-problem/

benblack # forcing things to fit the wrong paradigm is exactly what's wrong with a
lot of how relational databases are abused

seancribbs # ^^

sanity # yes, obviously you don't want to fit a square peg into a round hole - but at the same
time you don't want to use 15 different technologies in a project if you can help it
its a compromise


sanity # in this case I'm trying to figure out if using Riak for all my persistence
needs is a realistic compromize, but it sounds like it isn't

seancribbs # sanity: truth is, riak just doesn't do range queries
(without data gymnastics or expensive queries). if that's key to your problem, don't pick it

sanity # seancribbs: no, that isn't so important. its more the issue of being able to
stream very large numbers of results out of riak

seancribbs # sanity: if you require them in order, it's still a problem
if you don't, then it might be doable
m/r and link-walking queries stream
sanity # seancribbs: no, order is irrelevant to the machine learning algo

seancribbs # ok then
but you still have to filter them according to timestamp?

sanity # seancribbs: except the machine learning algorithm only cares about relatively
recent data - so i'll need to archive older data somehow

seancribbs # ok, this is why i was suggesting spooling up the data
sanity # seancribbs: only at a high level of granularity - perhaps per-day

seancribbs # hmm

sanity # what do you mean by "spooling up"?

seancribbs # as in, put multiple impressions in the same object, collected over the time period

sanity # seancribbs: does Riak do a good job with very large objects?

seancribbs # how large is "very large"

sanity # seancribbs: tens or hundreds of MBs?


seancribbs # tens are fine (although slow). hundreds will be a problem until
our "big files" feature lands

sanity # seancribbs: in what way would it be slow?

seancribbs # it's replicated… by default 3 replicas

sanity # seancribbs: ie. slow like "it would work but I wouldn't recommend it"?
seancribbs: but you could have 0 replicas if it wasn't critical data, right?

seancribbs # and decisions about the state of the object presented to the client are
mediated by retrieving it from every replica you mean 1 replic

sanity # seancribbs: if there is only one of them then its not a replica :-)

benblack # sanity: all copies are replicas, there is no master.
even if there is 1, it's a replica.

seancribbs # what benblack said

benblack # think of it as the platonic view of distributed storage
there is an idealized version of your object

sanity # benblack: we can debate the plain meaning of replica, but there isn't much point

benblack # everything in the system is a mere copy
sanity: i am giving you the operative definition for all dynamo systems. probably useful.

seancribbs # sanity: anyway, the current upper-limit for object sizes is 64MB

sanity # benblack: i'm thinking in terms of the english language definition of the
 word, but it doesn't matter

benblack # sanity: don't worry, pedantic nerds are definitely welcome here, i'm proof.

sanity # seancribbs: can map-reduce be used to efficiently modify such objects
(if they are JSON)? ie. without completely replacing them just to update one field?

seancribbs # sanity: not recommended
map-reduce in Riak is for querying

sanity # seancribbs: ah, ok - good to know


sanity # seancribbs: but can the output of a map-reduce be inserted back into Riak?
seancribbs: ...without going through the client

seancribbs # sanity: yes, but again it's not recommended
for a number of reasons


sanity # seancribbs: what are they?
seancribbs # 1) map-reduce queries have timeouts which don't include the timeouts of the get/put processes
2) your whole query might crash if something doesn't work out in one function
3) reduce phases aren't run only once
and I'm sure there's other reasons
sanity # gotcha

seancribbs # the primary reason is latency, though.
(#1)

sanity # ok, well thank you for your help, I think I have a good picture of Riak's
 pros and cons now for my application. Hopefully I'll be able to find a way to use
 it because it really seems neat. At this point I think it may have to be Cassandra though :-/
	sanity # I'm evaluating Riak for an application where I need to store tens of millions of
	JSON objects (each about 1k in size) and then retrieve all of them at a later time. Can
	Riak do something like this efficiently (retrieve very large numbers of objects and stream
	them to a client for processing)?

	benblack # you want to regularly retrieve every single object at once?

	sanity # benblack: yes

	benblack # do you access them individually other than that?

	sanity # benblack: the data is the input to a machine learning algorithm

	sanity # benblack: perhaps, although I could probably avoid it if necessary

	benblack # then i don't know why you are talking about using a database at all

	shove them straight into hdfs process with hadoop as needed

	sanity # benblack: hmmm

	benblack: I'm sure there was a reason I couldn't do that, but right now I can't think of one :-)

	benblack: hypothetically though, how would Riak perform if I did use it in that way?

	benblack # full scans are inefficient on every useful database i've ever used. you
	would just be adding overhead.

	sanity # ah wait, i remember the issue - I need to be able to go back and update
	items in the training data
	benblack: i did something similar in mysql and it worked ok

	benblack # then your data set is small

	sanity # benblack: it was about 5 million rows

	benblack # then your data set is small

	sanity # benblack: similar order of magnitude to the application I described at the outset

	benblack # at that scale, almost anything will work.

	sanity # benblack: so basically you are saying that Riak would not deal well with
	retrieving tens of millions of objects in a single GET request?

	benblack # yeah

	sanity # benblack: is that because of the hashing? the objects would be spread
	across multiple nodes at random?

	sanity # benblack: any idea of Cassandra would be better at it?
	s/of/if

	benblack # might be. still doesn't sound like you need a database, though.

	sanity # benblack: without a database how do i update specific entries in the training set?

	benblack # by having each one in a different file?

	sanity # benblack: tens of millions, where each file is about 1k in size? would
	hdfs deal with that effectively?

	benblack # no idea.

	sanity # benblack: ok, thanks for your help

	allen joined #riak

	benblack # http://www.cloudera.com/blog/2009/02/the-small-files-problem/
	so, you chunk your total training set, and update whole chunks at once

	seancribbs # sanity: yes, a filesystem would be easiest for that (whether HDFS or something else)
	"filesystems - the original NoSQL"

	benblack # indeed

	seancribbs # so… in general I agree with benblack. however, if you still want to use riak,
	we discuss how to tackle your problem

	sanity # seancribbs: i don't think HDFS is appropriate, its default minimum block size is 64MB :-/
	seancribbs # ah

	sanity # seancribbs: evaluating Cassandra you can specify a way for it to sort internally by keys,
	and then you can do range queries on the keys. I assume that isn't possible with Riak given
	that it locates stuff by hashing the key? i'd like to be able to say "give me every object from this
	date and time to this date and time" but it seems like in Riak that would require an exhaustive search

	seancribbs # sanity: you would do that with map-reduce of some osrt

	sanity # seancribbs: right, but it would basically need to check every value to see whether it is
	within the range - right?

	seancribbs # if you know the granularity/range ahead of time, it'll be easier
	sanity: maybe

	sanity # seancribbs: it could be an hourly granularity, for example

	seancribbs # where you would only want a single hour?

	sanity # seancribbs: no, you might want 48 hours, for example

	seancribbs # ah and the data itself doesn't directly correspond one-to-one to a given range/timestamp?

	sanity # the data is basically impressions on a website

	sanity # or, more specifically, ad impressions

	seancribbs # ok

	sanity # we're building an ad network and we have a machine learning algorithm
	(similar to a neural network) that we need to feed several days worth of impression/click/conversion
	data to, so it can learn to predict which ad will make us the most money :-)

	seancribbs # so, some of this will be solved with riak search. but that can't help you right now
	sounds cool

	sanity # seancribbs: riak search? that is a feature currently being worked on?

	benblack # what you are describing is a classic hadoop problem

	seancribbs # benblack +1
	sanity: yeah, hadoop is going to be better at large queries

	benblack # seems like you just need to figure out a way to slightly abstract how training set
	updates happen so you don't create millions of tiny files in hdfs

	sanity # our machine learning algorithm is fairly CPU intensive, we can't really
	write it with Javascript :-)

	benblack # the link i gave earlier is really useful on that point

	seancribbs # hadoop is Java and a bunch of higher-level languages

	sanity # seancribbs: but Hadoop isn't a database

	benblack # you really aren't describing a database problem, my man

	seancribbs # sanity: I don't see any way you're not either going to have a huge index or
	need hadoop's processing power

	sanity # the processing can be done on a single machine in a few hours - that isn't what
	worries me i just need somewhere to keep the data, such that it can be updated efficiently, and
	retrieved en-bulk when needed

	benblack # sanity: did you read that article i linked to on hdfs small files?

	sanity # benblack: sorry, overlooked it - let me look now

	benblack # it describes several options i suspect at least one could work for you

	sanity # benblack: ok, thanks - I'll investigate. i just wish I could use a single solution for my
	persistence needs (both the training data, and other more conventional data that I think
	would fit perfectly into the Riak paradigm)

	seancribbs # and also, spooling up data from a chunk of time and simply storing that as a single
	object might help if your smallest granularity is an hour, make 1-hour objects for each ad/network/whatever
	then you can derive the keys

	benblack # sanity: part of nosql is using the right tool for the job. that means multiple tools.

	sanity # benblack: true, although its better to use a single tool if possible

	benblack # if there is a good fit in riak, you should definitely use it. if hadoop/hdfs works,
	you get the full hadoop stack, which is pretty compelling for machine learning.

	seancribbs # sanity: i think that's a dangerous thing to say

	benblack # sanity: i would not agree with that.

	seancribbs # although we love Riak, i'd never say it fits everybody's use-case

	sanity # seems like HBase might be effective - "Hbase .... is a good choice if you need to do
	MapReduce style streaming analyses with the occasional random look up" -
	from http://www.cloudera.com/blog/2009/02/the-small-files-problem/

	benblack # forcing things to fit the wrong paradigm is exactly what's wrong with a
	lot of how relational databases are abused

	seancribbs # ^^

	sanity # yes, obviously you don't want to fit a square peg into a round hole - but at the same
	time you don't want to use 15 different technologies in a project if you can help it
	its a compromise


	sanity # in this case I'm trying to figure out if using Riak for all my persistence
	needs is a realistic compromize, but it sounds like it isn't

	seancribbs # sanity: truth is, riak just doesn't do range queries
	(without data gymnastics or expensive queries). if that's key to your problem, don't pick it

	sanity # seancribbs: no, that isn't so important. its more the issue of being able to
	stream very large numbers of results out of riak

	seancribbs # sanity: if you require them in order, it's still a problem
	if you don't, then it might be doable
	m/r and link-walking queries stream
	sanity # seancribbs: no, order is irrelevant to the machine learning algo

	seancribbs # ok then
	but you still have to filter them according to timestamp?

	sanity # seancribbs: except the machine learning algorithm only cares about relatively
	recent data - so i'll need to archive older data somehow

	seancribbs # ok, this is why i was suggesting spooling up the data
	sanity # seancribbs: only at a high level of granularity - perhaps per-day

	seancribbs # hmm

	sanity # what do you mean by "spooling up"?

	seancribbs # as in, put multiple impressions in the same object, collected over the time period

	sanity # seancribbs: does Riak do a good job with very large objects?

	seancribbs # how large is "very large"

	sanity # seancribbs: tens or hundreds of MBs?


	seancribbs # tens are fine (although slow). hundreds will be a problem until
	our "big files" feature lands

	sanity # seancribbs: in what way would it be slow?

	seancribbs # it's replicated… by default 3 replicas

	sanity # seancribbs: ie. slow like "it would work but I wouldn't recommend it"?
	seancribbs: but you could have 0 replicas if it wasn't critical data, right?

	seancribbs # and decisions about the state of the object presented to the client are
	mediated by retrieving it from every replica you mean 1 replic

	sanity # seancribbs: if there is only one of them then its not a replica :-)

	benblack # sanity: all copies are replicas, there is no master.
	even if there is 1, it's a replica.

	seancribbs # what benblack said

	benblack # think of it as the platonic view of distributed storage
	there is an idealized version of your object

	sanity # benblack: we can debate the plain meaning of replica, but there isn't much point

	benblack # everything in the system is a mere copy
	sanity: i am giving you the operative definition for all dynamo systems. probably useful.

	seancribbs # sanity: anyway, the current upper-limit for object sizes is 64MB

	sanity # benblack: i'm thinking in terms of the english language definition of the
	word, but it doesn't matter

	benblack # sanity: don't worry, pedantic nerds are definitely welcome here, i'm proof.

	sanity # seancribbs: can map-reduce be used to efficiently modify such objects
	(if they are JSON)? ie. without completely replacing them just to update one field?

	seancribbs # sanity: not recommended
	map-reduce in Riak is for querying

	sanity # seancribbs: ah, ok - good to know


	sanity # seancribbs: but can the output of a map-reduce be inserted back into Riak?
	seancribbs: ...without going through the client

	seancribbs # sanity: yes, but again it's not recommended
	for a number of reasons


	sanity # seancribbs: what are they?
	seancribbs # 1) map-reduce queries have timeouts which don't include the timeouts of the get/put processes
	2) your whole query might crash if something doesn't work out in one function
	3) reduce phases aren't run only once
	and I'm sure there's other reasons
	sanity # gotcha

	seancribbs # the primary reason is latency, though.
	(#1)

	sanity # ok, well thank you for your help, I think I have a good picture of Riak's
	pros and cons now for my application. Hopefully I'll be able to find a way to use
	it because it really seems neat. At this point I think it may have to be Cassandra though :-/