PharkMillups/gist:902437

## gistfile1.txt
13:57 <kraay> Hello everyone -- I'm evaluating using riak as a document store.
We'd be storing about 300k JSON docs of identical structure/complexity.
However, I'd like to provide the end user with ability to use JPath
(a weak form of XPath) as a means of traversing the JSON doc and
returning those related bits. What would be the best way of doing
this? via link-walking? map-reduce? And can I do either in RT?

13:59 <aphyr> kraay: RT?

13:59 <kraay> Real Time

13:59 <aphyr> What do you think realtime means? :)

14:00 <kraay> Let's say 30 ms

14:00 <aphyr> For a single document? Not a problem.

14:00 <kraay> How about for 300k documents

14:00 <aphyr> So the user knows which document they want to operate on,
and provides a JPATH query?

14:01 <aphyr> Or the user wants to run JPath against the entire dataset at once?

14:02 <kraay> Well, the user would send a subset of document "ids" (which would
possibly be 300k) and we'd take that and the JPath query to return the specified fields.

14:03 <kraay> The JPath would be merely as a means of walking the JSON doc.

14:03 <aphyr> If you know of any JSON parser which can parse 300K
documents in under 30ms, I am extremely interested to hear about it. :)

14:03 <aphyr> That aside, yeah doesn't sound too hard.

14:04 <aphyr> You'd probably want to do this with mapreduce. Issue an MR query
with the list of IDs. For each ID, pass the JPath expression to the map phase
and have it run the query against that document.

14:04 <aphyr> Then reduce the set of results together.

14:04 <aphyr> Would scale linearly with nodes and run in something like
O(documents_requested/nodes)

14:05 <aphyr> 30ms is definitely achievable for 200 documents or so

14:05 <aphyr> But I expect passing hundreds of thousands of documents into your
MR query is going to take a significant amount of time

14:06 <kraay> Couldn't I just scale out the number of mappers out?

14:07 <aphyr> Submit multiple jobs concurrently? Sure.

14:07 <aphyr> Probably would more time though.

14:07 <aphyr> *take more time*

14:07 <kraay> Sure, ofcourse.

14:08 <aphyr> I suspect it would only be worth it if issuing the request
to Riak took more time than managing the distribution yourself would take.

14:08 <aphyr> Riak can only begin the map jobs when it's finished
receiving/deserializing your request so for 300K ids there could be a nontrivial
delay.

14:09 <aphyr> Never tried something that size but I'm sure others have. :)

14:10 <kraay> Hmmm... would the gains be worth the effort, if I used protocol
buffers or BSON?

14:10 <kraay> ...if that's even possible

14:10 <aphyr> For your document serialization?

14:10 <aphyr> I would use erlang terms.

14:10 <aphyr> The serialization would be extremely fast and it would be a
good match for an erlang MR job.

14:11 <kraay> Ah, I've been iching to dabble a bit more into erlang :)

14:11 <aphyr> Yeah, I've found huge performance gains from converting JS MR phases into erlang.

14:11 <aphyr> Be aware that erlang JSON parsing is... obtuse.

14:12 <kraay> Heh, I was monkeying around with JSON parsing in Scala --
finding a suitable library is also a bit of a challenge :)

14:12 <aphyr> mochijson2 converts objects to {struct, [{key, value}, {key, value}, ...]}

14:13 <aphyr> Also recall that *building* an object in memory may take more
time than required; since the serialization format is linear you're already
paying an O(document size) cost

14:13 <aphyr> You might be able to do things faster by choosing your serialization
cleverly and doing it as a stream parser or something.

14:14 <aphyr> There's a generality--speed continuum; you'll have to decide the tradeoff.

14:14 <kraay> Well, then why not use a simple link-walk to represent the JSON data?

14:15 <aphyr> Depends on what shape your data is

14:15 <aphyr> Link-walking is essentially parsing a list of erlang terms stored as
metadata around the object

14:15 <aphyr> then returning specific terms as the result of a map query.

14:16 <aphyr> You'll incur an additional cost from the latency of jumping from
object to object

14:16 <aphyr> (which may be on different hosts)

14:16 <aphyr> also, riak mapreduce currently finishes *all* jobs in the same phase
before moving to the next phase

14:16 <aphyr> So you could see blocking issues

14:17 <aphyr> It would almost certainly be slower than just storing all the data as
erlang terms in the same object

14:18 <kraay> ...and then accept the overhead in having to put the entire
document in memory

14:19 <aphyr> If your documents are large that could be a problem.

14:19 <kraay> A single JSON document will be a map, with about 10 keys. Each
value will contain a nested set of maps about 3 layers deep.

14:20 <aphyr> So roughly 1000 things total?

14:20 <kraay> Sorry, no -- approx 100

14:20 <aphyr> Oh yeah that's tiny

14:21 <aphyr> I would put it all in one object

14:21 <aphyr> You probably won't even notice the deserialization time

14:21 <aphyr> If it becomes an issue you could use a stream parser to cut your
time by a constant factor

14:22 <kraay> ah, ok -- then regarding the buckets. Would it be possible to
stuff the 300k entries into a single bucket?

14:22 <aphyr> Sure!

14:22 <aphyr> 300k is small potatoes

14:22 <aphyr> (buckets by the way are just key prefixes)

14:22 <kraay> the idea being that eventually we might have different source of
JSON data, and the intuative idea would be to stuff them into seperate buckets

14:22 <kraay> great

14:22 <aphyr> So keep your bucket names small

14:23 <aphyr> Same goes for keys

14:23 <aphyr> There's an in-memory cost of ~40bytes + key length

14:23 <kraay> Wow, this a awesome information

14:24 <kraay> Do you have any exposure with the Java client libraries to riak?

14:24 <aphyr> haha I avoid java like the plague

14:25 <kraay> What's your preferred flavor?

14:25 <aphyr> Ruby/python/erlang

14:26 <aphyr> I believe there's a lot of development going on around the java
client right now

14:26 <aphyr> Many mailing list posts

14:26 <aphyr> You might take a look and get in contact with one of them?

14:26 <kraay> Ah, yeah the last project (a SaaS) was written in ruby -- It was
great for rapid development, but really difficult to verify and test

14:27 <aphyr> I find my problem isn't so often ruby as it is external services...

14:27 <aphyr> I use bacon for testing, works great, but when you're dealing with
DBs, networks, etc... stuff hits the fan

14:27 <aphyr> Erlang is really growing on me though. Much slower to write, but the results are usually terse and predictably correct.

14:28 <aphyr> Take a look at QuickCheck

14:29 <kraay> oooh - I like it

14:30 <kraay> The strong type checking is what prompted me to look into Scala --
I guess erlang has to have to same thing.

14:31 <aphyr> Erlang is strange... doesn't really have "types" per se

14:31 <aphyr> But the static verifier is very thorough

14:31 <aphyr> I am excited about scala for sure, haven't had the chance to play with it yet.

14:33 <kraay> oops, I've got to run -- one kid just woke up and stumbled in.

14:33 <aphyr> later!

14:33 <kraay> Thank you, very, very much for all your insight
14:33 <kraay> have a good day/night/morning :)
	13:57 <kraay> Hello everyone -- I'm evaluating using riak as a document store.
	We'd be storing about 300k JSON docs of identical structure/complexity.
	However, I'd like to provide the end user with ability to use JPath
	(a weak form of XPath) as a means of traversing the JSON doc and
	returning those related bits. What would be the best way of doing
	this? via link-walking? map-reduce? And can I do either in RT?

	13:59 <aphyr> kraay: RT?

	13:59 <kraay> Real Time

	13:59 <aphyr> What do you think realtime means? :)

	14:00 <kraay> Let's say 30 ms

	14:00 <aphyr> For a single document? Not a problem.

	14:00 <kraay> How about for 300k documents

	14:00 <aphyr> So the user knows which document they want to operate on,
	and provides a JPATH query?

	14:01 <aphyr> Or the user wants to run JPath against the entire dataset at once?

	14:02 <kraay> Well, the user would send a subset of document "ids" (which would
	possibly be 300k) and we'd take that and the JPath query to return the specified fields.

	14:03 <kraay> The JPath would be merely as a means of walking the JSON doc.

	14:03 <aphyr> If you know of any JSON parser which can parse 300K
	documents in under 30ms, I am extremely interested to hear about it. :)

	14:03 <aphyr> That aside, yeah doesn't sound too hard.

	14:04 <aphyr> You'd probably want to do this with mapreduce. Issue an MR query
	with the list of IDs. For each ID, pass the JPath expression to the map phase
	and have it run the query against that document.

	14:04 <aphyr> Then reduce the set of results together.

	14:04 <aphyr> Would scale linearly with nodes and run in something like
	O(documents_requested/nodes)

	14:05 <aphyr> 30ms is definitely achievable for 200 documents or so

	14:05 <aphyr> But I expect passing hundreds of thousands of documents into your
	MR query is going to take a significant amount of time

	14:06 <kraay> Couldn't I just scale out the number of mappers out?

	14:07 <aphyr> Submit multiple jobs concurrently? Sure.

	14:07 <aphyr> Probably would more time though.

	14:07 <aphyr> take more time

	14:07 <kraay> Sure, ofcourse.

	14:08 <aphyr> I suspect it would only be worth it if issuing the request
	to Riak took more time than managing the distribution yourself would take.

	14:08 <aphyr> Riak can only begin the map jobs when it's finished
	receiving/deserializing your request so for 300K ids there could be a nontrivial
	delay.

	14:09 <aphyr> Never tried something that size but I'm sure others have. :)

	14:10 <kraay> Hmmm... would the gains be worth the effort, if I used protocol
	buffers or BSON?

	14:10 <kraay> ...if that's even possible

	14:10 <aphyr> For your document serialization?

	14:10 <aphyr> I would use erlang terms.

	14:10 <aphyr> The serialization would be extremely fast and it would be a
	good match for an erlang MR job.

	14:11 <kraay> Ah, I've been iching to dabble a bit more into erlang :)

	14:11 <aphyr> Yeah, I've found huge performance gains from converting JS MR phases into erlang.

	14:11 <aphyr> Be aware that erlang JSON parsing is... obtuse.

	14:12 <kraay> Heh, I was monkeying around with JSON parsing in Scala --
	finding a suitable library is also a bit of a challenge :)

	14:12 <aphyr> mochijson2 converts objects to {struct, [{key, value}, {key, value}, ...]}

	14:13 <aphyr> Also recall that building an object in memory may take more
	time than required; since the serialization format is linear you're already
	paying an O(document size) cost

	14:13 <aphyr> You might be able to do things faster by choosing your serialization
	cleverly and doing it as a stream parser or something.

	14:14 <aphyr> There's a generality--speed continuum; you'll have to decide the tradeoff.

	14:14 <kraay> Well, then why not use a simple link-walk to represent the JSON data?

	14:15 <aphyr> Depends on what shape your data is

	14:15 <aphyr> Link-walking is essentially parsing a list of erlang terms stored as
	metadata around the object

	14:15 <aphyr> then returning specific terms as the result of a map query.

	14:16 <aphyr> You'll incur an additional cost from the latency of jumping from
	object to object

	14:16 <aphyr> (which may be on different hosts)

	14:16 <aphyr> also, riak mapreduce currently finishes all jobs in the same phase
	before moving to the next phase

	14:16 <aphyr> So you could see blocking issues

	14:17 <aphyr> It would almost certainly be slower than just storing all the data as
	erlang terms in the same object

	14:18 <kraay> ...and then accept the overhead in having to put the entire
	document in memory

	14:19 <aphyr> If your documents are large that could be a problem.

	14:19 <kraay> A single JSON document will be a map, with about 10 keys. Each
	value will contain a nested set of maps about 3 layers deep.

	14:20 <aphyr> So roughly 1000 things total?

	14:20 <kraay> Sorry, no -- approx 100

	14:20 <aphyr> Oh yeah that's tiny

	14:21 <aphyr> I would put it all in one object

	14:21 <aphyr> You probably won't even notice the deserialization time

	14:21 <aphyr> If it becomes an issue you could use a stream parser to cut your
	time by a constant factor

	14:22 <kraay> ah, ok -- then regarding the buckets. Would it be possible to
	stuff the 300k entries into a single bucket?

	14:22 <aphyr> Sure!

	14:22 <aphyr> 300k is small potatoes

	14:22 <aphyr> (buckets by the way are just key prefixes)

	14:22 <kraay> the idea being that eventually we might have different source of
	JSON data, and the intuative idea would be to stuff them into seperate buckets

	14:22 <kraay> great

	14:22 <aphyr> So keep your bucket names small

	14:23 <aphyr> Same goes for keys

	14:23 <aphyr> There's an in-memory cost of ~40bytes + key length

	14:23 <kraay> Wow, this a awesome information

	14:24 <kraay> Do you have any exposure with the Java client libraries to riak?

	14:24 <aphyr> haha I avoid java like the plague

	14:25 <kraay> What's your preferred flavor?

	14:25 <aphyr> Ruby/python/erlang

	14:26 <aphyr> I believe there's a lot of development going on around the java
	client right now

	14:26 <aphyr> Many mailing list posts

	14:26 <aphyr> You might take a look and get in contact with one of them?

	14:26 <kraay> Ah, yeah the last project (a SaaS) was written in ruby -- It was
	great for rapid development, but really difficult to verify and test

	14:27 <aphyr> I find my problem isn't so often ruby as it is external services...

	14:27 <aphyr> I use bacon for testing, works great, but when you're dealing with
	DBs, networks, etc... stuff hits the fan

	14:27 <aphyr> Erlang is really growing on me though. Much slower to write, but the results are usually terse and predictably correct.

	14:28 <aphyr> Take a look at QuickCheck

	14:29 <kraay> oooh - I like it

	14:30 <kraay> The strong type checking is what prompted me to look into Scala --
	I guess erlang has to have to same thing.

	14:31 <aphyr> Erlang is strange... doesn't really have "types" per se

	14:31 <aphyr> But the static verifier is very thorough

	14:31 <aphyr> I am excited about scala for sure, haven't had the chance to play with it yet.

	14:33 <kraay> oops, I've got to run -- one kid just woke up and stumbled in.

	14:33 <aphyr> later!

	14:33 <kraay> Thank you, very, very much for all your insight
	14:33 <kraay> have a good day/night/morning :)