PharkMillups/gist:584572

## gistfile1.txt
15:25 <mheld> question to pose for y'all

15:25 <mheld> why would one use riak over mongodb?

15:26 <jdmaturen> cue benblack

15:26 <mheld> I don't mean it to be flamebait

15:26 <mheld> I really don't

15:26 <bingeldac> heh

15:26 <bingeldac> some of it is covered on the wiki

15:26 <bingeldac> in the comparison article
( here ---> http://wiki.basho.com/display/RIAK/Riak+Compared+to+MongoDB)

15:27 <mheld> those are differences, but not really a "this is when you use riak and this is when you use mongo"

15:28 <ericflo> use mongo when you could use a relational db, there's really not much
difference except flexible schema IMO

15:29 <benblack> mheld: think i answered this for you previously.

15:29 <mheld> benblack: if you did, I'm sorry. I've forgotton it

15:29 <mheld> forgotten

15:29 <benblack> first part of the answer is: they really have little in common, so the comparison is forced.

15:30 <mheld> mhmm

15:31 <benblack> second part is: if you want a rich query language that is familiar from the
relational world, don't have that much data, don't have hard durability requirements, and don't
require real distribution, then mongo is great

15:31 <benblack> if you can deal with a different data and query model, have a lot of data, care
about data durability, and don't want to make managing replication and distribution a full time
job, then you probably will prefer riak

15:32 <mheld> that's good :-)

15:33 <benblack> but, again, they are really not comparable and you can definitely use them
effectively in combination

15:33 <benblack> think of mongo more like memcache with a better query interface and you
are not far off

15:33 <mheld> do people ever use just riak as a backend?

15:33 <mheld> or am I being naive?

15:34 <benblack> just as a backend for what?

15:34 <mheld> database

15:34 <mheld> for web services

15:34 <benblack> you mean like people usually use memcache in front of mysql?

15:35 <mheld> like using riak + mysql

15:35 <mheld> vs just riak

15:35 <benblack> think you missed my question

15:36 <benblack> if your goal is to have one database to do everything you should stick
with an rdbms. that isn't how you best use nosql systems.

15:36 <benblack> you should expect to use different ones in combination

15:36 <mheld> hmm

15:37 <benblack> that's why questions trying to compare mongo vs riak (or whatever else)
are not straightforward

15:37 <mheld> I see

15:38 <mheld> I'd like to use just one database

15:39 <mheld> but it doesn't feel right to use a rdbms for the data munchey stuff we're doing

15:40 <benblack> why would you like to use one database?

15:41 <mheld> well, every type of data we have would have to have the same operations done to them

15:42 <mheld> I've got users which have different relationships to pieces of data

15:43 <mheld> and while it'd make sense to store the users in a rdbms

15:44 <mheld> there'd be a lot of weird data relating users to the other stuff

15:45 <benblack> you can of course try to push everything to the same db

15:45 <benblack> just letting you know it is common to have several databases

15:46 <ericflo> mheld: how much data are you dealing with?

15:48 <mheld> ericflo: now, not much... on the level of hundreds of megabytes

15:49 <mheld> but we're anticipating a much larger level within the next few weeks

15:49 <ericflo> oh wow, yeah any relational database out there can handle that

15:49 <ericflo> how much larger?

15:49 <mheld> orders of magnitude

15:49 <ericflo> mheld: how many?

15:49 <mheld> ideally terabytes

15:49 <mheld> petabytes eventually

15:51 <benblack> mheld: ok. this is a common problem.

15:51 <benblack> "well, ideally, we'll have more data than can possibly fit in anything we
can conceive of designing or building!"

15:51 <benblack> you have to be realistic and specific.

15:51 <mheld> we'll we've just acquired our first major partner

15:52 <mheld> and we're in talks with another

15:52 <mheld> and we need to start acquiring more data

15:53 <benblack> sure, big stuff coming

15:53 <benblack> you still have to do capacity planning

15:53 <benblack> because you aren't putting a petabyte in riak right now

15:53 <mheld> mhmm

15:53 <benblack> (or anything else)

15:54 <benblack> folks with petabytes of data to store and process have full time engineers
working on storage infrastructure.

15:54 <benblack> they aren't taking an existing system and running it as is

15:55 <mheld> I'm not doubting that, I just want to be sure I'm not looking down the wrong path

15:55 <mheld> I rather like riak

15:55 <benblack> step 1 is being realistic about data growth

15:55 <benblack> realistic and _specific_

15:56 <mheld> alright, say I've given up on the petabyte dream

15:56 <mheld> we're thinking hundreds of gigs

15:56 <benblack> ok

15:56 <benblack> and what kind of processing?

15:57 <mheld> we're essentially the history of the web and how people interact with it

15:57 <mheld> over time

15:57 <mheld> essentially capturing*

15:57 <mheld> I accidentally the verb

15:57 <mheld> we're doing trends, recommendations, and analytics

15:58 <mheld> market searching, info trending

16:01 <benblack> i don't want to scare you away from riak, which i love dearly, but you
are describing the sweet spot for hadoop.

16:01 <mheld> ha, it's funny that you mention that

16:02 <mheld> because I've just purchased a few hadoop books

16:03 <mheld> why would this fit hadoop better than riak?

16:04 <benblack> because you are describing taking in a large amount of behavioral data
and doing bulk analytics on it

16:04 <benblack> that's what hadoop is for

16:05 <mheld> hmm

16:13 <mheld> in my head, the only difference (data processing-wise) between riak and
hadoop/hbase is the internal data structure (k/v store vs columnar rdbms)

16:13 <mheld> is that not right?

16:13 <benblack> nope, not right

16:14 <mheld> they still both do mapreduce

16:14 <benblack> java and erlang are both programming languages

16:14 <jdmaturen> benblack and I are both human

16:15 <benblack> map reduce is a processing model, it implies nothing about whether a specific
implementation of that model is appropriate for a given application

16:15 <benblack> hadoop is built to do large-scale, batch processing for analysis

16:15 <benblack> riak's map reduce is primarily for interactive querying

16:16 <benblack> that one is written in java and the other (essentially) javascript is a
good indicator they are not for the same purpose

16:16 <mheld> how is that different?

16:16 <mheld> can you not use riak to do batch processing?

16:17 <benblack> wow, ok

16:17 <benblack> mheld: what programming languages do you generally use?

16:18 <mheld> if I'm pissing you off, I'll leave you alone, I'm just trying to understand
this and apparently failing

16:18 <mheld> day to day I'll use java, ruby, and scala

16:18 <benblack> ok

16:18 <benblack> why do you use ruby vs java?

16:19 <benblack> they are both programming languages, right?

16:19 <mheld> functional constructs that just aren't there in java

16:19 <mheld> is why I'd use ruby

16:19 <benblack> ok

16:19 <benblack> any kind of performance differences?

16:19 <mheld> java is generally faster for most things

16:19 <mheld> iirc

16:19 <benblack> ...by a lot.

16:20 <mheld> mhm

16:20 <benblack> but you want to write something straightforward, and compact

16:20 <benblack> less concerned about performance

16:20 <benblack> you'd probably use ruby, right?

16:21 <mheld> yes

16:21 <mheld> oh

16:21 <makmanalp> lambdaj is pretty neat, btw, but i can't help but feel that it's a hackjob

16:21 <makmanalp> and doesn't really give you everything

16:22 <mheld> benblack: I think I get what you're painting a picture of

16:22 <mheld> I'm just being retarded

16:22 <makmanalp> i'd be better off using something like clojure if the jvm is a must


16:22 <mheld> makmanalp: why not scala?
16:22 <makmanalp> mheld: scala looks weird to me, but that's a matter of personal taste

16:22 <makmanalp> i've used scheme a lot

16:23 <mheld> ah

16:23 <makmanalp> so clojure is familiar

16:23 <benblack> mheld: riak is really great at a lot of things (like being stupidly
simple to operate, scale up and down, tuning consistency, tuning backends, etc)

16:23 <benblack> mheld: but bulk analytics on a petabyte, not so much.


16:24 <mheld> hmm

16:24 <makmanalp> and so it programming languages, etc

16:24 <makmanalp> AI, depending on the professor

16:24 <mheld> makmanalp: ah sweet

16:25 <mheld> makmanalp: HTDP?

6:49 <davidc_> I knew I was missing a channel...

18:19 <mheld> how fast are riak mapreduce queries compared to sql queries?

18:19 <mheld> well, sql queries with some magic done on them

18:21 <ericflo> mheld: sql queries don't have some set time that they all take, and
neither do riak mapreduce jobs.

18:22 <ericflo> mheld: It also depends on what storage backend you use, and what
version of Riak, and what language you choose for your map/reduce tasks, and if you leverage
any of the built-in functions, etc.

18:24 <mheld> say I enter "greasy bacon" on google.com

18:25 <mheld> what happens?

18:25 <mheld> not as in GET requests

18:26 <mheld> as in database queries

18:27 <benblack> nothing remotely like what you will have with any database you've ever used

18:27 <benblack> they are precomputing enormous indices, then querying in parallel across
a huge number of machines

18:27 <benblack> the indices and the backend queries are specifically structured for the purpose

18:27 <mheld> how do they get it all done in such a small time?

18:28 <ericflo> 2 minutes of searching found this:
http://www.ams.org/samplings/feature-column/fcarc-pagerank

18:28 <benblack> mheld: as i just said.

18:35 <mheld> anybody know anything about
http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html ?

18:36 <benblack> nobody outside google.

18:36 <* jdmaturen> waits for the paper

18:36 <benblack> you might also enjoy the paper from google on Dremel

18:37 <jdmaturen> yes
	15:25 <mheld> question to pose for y'all

	15:25 <mheld> why would one use riak over mongodb?

	15:26 <jdmaturen> cue benblack

	15:26 <mheld> I don't mean it to be flamebait

	15:26 <mheld> I really don't

	15:26 <bingeldac> heh

	15:26 <bingeldac> some of it is covered on the wiki

	15:26 <bingeldac> in the comparison article
	( here ---> http://wiki.basho.com/display/RIAK/Riak+Compared+to+MongoDB)

	15:27 <mheld> those are differences, but not really a "this is when you use riak and this is when you use mongo"

	15:28 <ericflo> use mongo when you could use a relational db, there's really not much
	difference except flexible schema IMO

	15:29 <benblack> mheld: think i answered this for you previously.

	15:29 <mheld> benblack: if you did, I'm sorry. I've forgotton it

	15:29 <mheld> forgotten

	15:29 <benblack> first part of the answer is: they really have little in common, so the comparison is forced.

	15:30 <mheld> mhmm

	15:31 <benblack> second part is: if you want a rich query language that is familiar from the
	relational world, don't have that much data, don't have hard durability requirements, and don't
	require real distribution, then mongo is great

	15:31 <benblack> if you can deal with a different data and query model, have a lot of data, care
	about data durability, and don't want to make managing replication and distribution a full time
	job, then you probably will prefer riak

	15:32 <mheld> that's good :-)

	15:33 <benblack> but, again, they are really not comparable and you can definitely use them
	effectively in combination

	15:33 <benblack> think of mongo more like memcache with a better query interface and you
	are not far off

	15:33 <mheld> do people ever use just riak as a backend?

	15:33 <mheld> or am I being naive?

	15:34 <benblack> just as a backend for what?

	15:34 <mheld> database

	15:34 <mheld> for web services

	15:34 <benblack> you mean like people usually use memcache in front of mysql?

	15:35 <mheld> like using riak + mysql

	15:35 <mheld> vs just riak

	15:35 <benblack> think you missed my question

	15:36 <benblack> if your goal is to have one database to do everything you should stick
	with an rdbms. that isn't how you best use nosql systems.

	15:36 <benblack> you should expect to use different ones in combination

	15:36 <mheld> hmm

	15:37 <benblack> that's why questions trying to compare mongo vs riak (or whatever else)
	are not straightforward

	15:37 <mheld> I see

	15:38 <mheld> I'd like to use just one database

	15:39 <mheld> but it doesn't feel right to use a rdbms for the data munchey stuff we're doing

	15:40 <benblack> why would you like to use one database?

	15:41 <mheld> well, every type of data we have would have to have the same operations done to them

	15:42 <mheld> I've got users which have different relationships to pieces of data

	15:43 <mheld> and while it'd make sense to store the users in a rdbms

	15:44 <mheld> there'd be a lot of weird data relating users to the other stuff

	15:45 <benblack> you can of course try to push everything to the same db

	15:45 <benblack> just letting you know it is common to have several databases

	15:46 <ericflo> mheld: how much data are you dealing with?

	15:48 <mheld> ericflo: now, not much... on the level of hundreds of megabytes

	15:49 <mheld> but we're anticipating a much larger level within the next few weeks

	15:49 <ericflo> oh wow, yeah any relational database out there can handle that

	15:49 <ericflo> how much larger?

	15:49 <mheld> orders of magnitude

	15:49 <ericflo> mheld: how many?

	15:49 <mheld> ideally terabytes

	15:49 <mheld> petabytes eventually

	15:51 <benblack> mheld: ok. this is a common problem.

	15:51 <benblack> "well, ideally, we'll have more data than can possibly fit in anything we
	can conceive of designing or building!"

	15:51 <benblack> you have to be realistic and specific.

	15:51 <mheld> we'll we've just acquired our first major partner

	15:52 <mheld> and we're in talks with another

	15:52 <mheld> and we need to start acquiring more data

	15:53 <benblack> sure, big stuff coming

	15:53 <benblack> you still have to do capacity planning

	15:53 <benblack> because you aren't putting a petabyte in riak right now

	15:53 <mheld> mhmm

	15:53 <benblack> (or anything else)

	15:54 <benblack> folks with petabytes of data to store and process have full time engineers
	working on storage infrastructure.

	15:54 <benblack> they aren't taking an existing system and running it as is

	15:55 <mheld> I'm not doubting that, I just want to be sure I'm not looking down the wrong path

	15:55 <mheld> I rather like riak

	15:55 <benblack> step 1 is being realistic about data growth

	15:55 <benblack> realistic and _specific_

	15:56 <mheld> alright, say I've given up on the petabyte dream

	15:56 <mheld> we're thinking hundreds of gigs

	15:56 <benblack> ok

	15:56 <benblack> and what kind of processing?

	15:57 <mheld> we're essentially the history of the web and how people interact with it

	15:57 <mheld> over time

	15:57 <mheld> essentially capturing*

	15:57 <mheld> I accidentally the verb

	15:57 <mheld> we're doing trends, recommendations, and analytics

	15:58 <mheld> market searching, info trending

	16:01 <benblack> i don't want to scare you away from riak, which i love dearly, but you
	are describing the sweet spot for hadoop.

	16:01 <mheld> ha, it's funny that you mention that

	16:02 <mheld> because I've just purchased a few hadoop books

	16:03 <mheld> why would this fit hadoop better than riak?

	16:04 <benblack> because you are describing taking in a large amount of behavioral data
	and doing bulk analytics on it

	16:04 <benblack> that's what hadoop is for

	16:05 <mheld> hmm

	16:13 <mheld> in my head, the only difference (data processing-wise) between riak and
	hadoop/hbase is the internal data structure (k/v store vs columnar rdbms)

	16:13 <mheld> is that not right?

	16:13 <benblack> nope, not right

	16:14 <mheld> they still both do mapreduce

	16:14 <benblack> java and erlang are both programming languages

	16:14 <jdmaturen> benblack and I are both human

	16:15 <benblack> map reduce is a processing model, it implies nothing about whether a specific
	implementation of that model is appropriate for a given application

	16:15 <benblack> hadoop is built to do large-scale, batch processing for analysis

	16:15 <benblack> riak's map reduce is primarily for interactive querying

	16:16 <benblack> that one is written in java and the other (essentially) javascript is a
	good indicator they are not for the same purpose

	16:16 <mheld> how is that different?

	16:16 <mheld> can you not use riak to do batch processing?

	16:17 <benblack> wow, ok

	16:17 <benblack> mheld: what programming languages do you generally use?

	16:18 <mheld> if I'm pissing you off, I'll leave you alone, I'm just trying to understand
	this and apparently failing

	16:18 <mheld> day to day I'll use java, ruby, and scala

	16:18 <benblack> ok

	16:18 <benblack> why do you use ruby vs java?

	16:19 <benblack> they are both programming languages, right?

	16:19 <mheld> functional constructs that just aren't there in java

	16:19 <mheld> is why I'd use ruby

	16:19 <benblack> ok

	16:19 <benblack> any kind of performance differences?

	16:19 <mheld> java is generally faster for most things

	16:19 <mheld> iirc

	16:19 <benblack> ...by a lot.

	16:20 <mheld> mhm

	16:20 <benblack> but you want to write something straightforward, and compact

	16:20 <benblack> less concerned about performance

	16:20 <benblack> you'd probably use ruby, right?

	16:21 <mheld> yes

	16:21 <mheld> oh

	16:21 <makmanalp> lambdaj is pretty neat, btw, but i can't help but feel that it's a hackjob

	16:21 <makmanalp> and doesn't really give you everything

	16:22 <mheld> benblack: I think I get what you're painting a picture of

	16:22 <mheld> I'm just being retarded

	16:22 <makmanalp> i'd be better off using something like clojure if the jvm is a must


	16:22 <mheld> makmanalp: why not scala?
	16:22 <makmanalp> mheld: scala looks weird to me, but that's a matter of personal taste

	16:22 <makmanalp> i've used scheme a lot

	16:23 <mheld> ah

	16:23 <makmanalp> so clojure is familiar

	16:23 <benblack> mheld: riak is really great at a lot of things (like being stupidly
	simple to operate, scale up and down, tuning consistency, tuning backends, etc)

	16:23 <benblack> mheld: but bulk analytics on a petabyte, not so much.


	16:24 <mheld> hmm

	16:24 <makmanalp> and so it programming languages, etc

	16:24 <makmanalp> AI, depending on the professor

	16:24 <mheld> makmanalp: ah sweet

	16:25 <mheld> makmanalp: HTDP?

	6:49 <davidc_> I knew I was missing a channel...

	18:19 <mheld> how fast are riak mapreduce queries compared to sql queries?

	18:19 <mheld> well, sql queries with some magic done on them

	18:21 <ericflo> mheld: sql queries don't have some set time that they all take, and
	neither do riak mapreduce jobs.

	18:22 <ericflo> mheld: It also depends on what storage backend you use, and what
	version of Riak, and what language you choose for your map/reduce tasks, and if you leverage
	any of the built-in functions, etc.

	18:24 <mheld> say I enter "greasy bacon" on google.com

	18:25 <mheld> what happens?

	18:25 <mheld> not as in GET requests

	18:26 <mheld> as in database queries

	18:27 <benblack> nothing remotely like what you will have with any database you've ever used

	18:27 <benblack> they are precomputing enormous indices, then querying in parallel across
	a huge number of machines

	18:27 <benblack> the indices and the backend queries are specifically structured for the purpose

	18:27 <mheld> how do they get it all done in such a small time?

	18:28 <ericflo> 2 minutes of searching found this:
	http://www.ams.org/samplings/feature-column/fcarc-pagerank

	18:28 <benblack> mheld: as i just said.

	18:35 <mheld> anybody know anything about
	http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html ?

	18:36 <benblack> nobody outside google.

	18:36 <* jdmaturen> waits for the paper

	18:36 <benblack> you might also enjoy the paper from google on Dremel

	18:37 <jdmaturen> yes
No results found