Created
April 4, 2011 21:03
-
-
Save PharkMillups/902437 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
13:57 <kraay> Hello everyone -- I'm evaluating using riak as a document store. | |
We'd be storing about 300k JSON docs of identical structure/complexity. | |
However, I'd like to provide the end user with ability to use JPath | |
(a weak form of XPath) as a means of traversing the JSON doc and | |
returning those related bits. What would be the best way of doing | |
this? via link-walking? map-reduce? And can I do either in RT? | |
13:59 <aphyr> kraay: RT? | |
13:59 <kraay> Real Time | |
13:59 <aphyr> What do you think realtime means? :) | |
14:00 <kraay> Let's say 30 ms | |
14:00 <aphyr> For a single document? Not a problem. | |
14:00 <kraay> How about for 300k documents | |
14:00 <aphyr> So the user knows which document they want to operate on, | |
and provides a JPATH query? | |
14:01 <aphyr> Or the user wants to run JPath against the entire dataset at once? | |
14:02 <kraay> Well, the user would send a subset of document "ids" (which would | |
possibly be 300k) and we'd take that and the JPath query to return the specified fields. | |
14:03 <kraay> The JPath would be merely as a means of walking the JSON doc. | |
14:03 <aphyr> If you know of any JSON parser which can parse 300K | |
documents in under 30ms, I am extremely interested to hear about it. :) | |
14:03 <aphyr> That aside, yeah doesn't sound too hard. | |
14:04 <aphyr> You'd probably want to do this with mapreduce. Issue an MR query | |
with the list of IDs. For each ID, pass the JPath expression to the map phase | |
and have it run the query against that document. | |
14:04 <aphyr> Then reduce the set of results together. | |
14:04 <aphyr> Would scale linearly with nodes and run in something like | |
O(documents_requested/nodes) | |
14:05 <aphyr> 30ms is definitely achievable for 200 documents or so | |
14:05 <aphyr> But I expect passing hundreds of thousands of documents into your | |
MR query is going to take a significant amount of time | |
14:06 <kraay> Couldn't I just scale out the number of mappers out? | |
14:07 <aphyr> Submit multiple jobs concurrently? Sure. | |
14:07 <aphyr> Probably would more time though. | |
14:07 <aphyr> *take more time* | |
14:07 <kraay> Sure, ofcourse. | |
14:08 <aphyr> I suspect it would only be worth it if issuing the request | |
to Riak took more time than managing the distribution yourself would take. | |
14:08 <aphyr> Riak can only begin the map jobs when it's finished | |
receiving/deserializing your request so for 300K ids there could be a nontrivial | |
delay. | |
14:09 <aphyr> Never tried something that size but I'm sure others have. :) | |
14:10 <kraay> Hmmm... would the gains be worth the effort, if I used protocol | |
buffers or BSON? | |
14:10 <kraay> ...if that's even possible | |
14:10 <aphyr> For your document serialization? | |
14:10 <aphyr> I would use erlang terms. | |
14:10 <aphyr> The serialization would be extremely fast and it would be a | |
good match for an erlang MR job. | |
14:11 <kraay> Ah, I've been iching to dabble a bit more into erlang :) | |
14:11 <aphyr> Yeah, I've found huge performance gains from converting JS MR phases into erlang. | |
14:11 <aphyr> Be aware that erlang JSON parsing is... obtuse. | |
14:12 <kraay> Heh, I was monkeying around with JSON parsing in Scala -- | |
finding a suitable library is also a bit of a challenge :) | |
14:12 <aphyr> mochijson2 converts objects to {struct, [{key, value}, {key, value}, ...]} | |
14:13 <aphyr> Also recall that *building* an object in memory may take more | |
time than required; since the serialization format is linear you're already | |
paying an O(document size) cost | |
14:13 <aphyr> You might be able to do things faster by choosing your serialization | |
cleverly and doing it as a stream parser or something. | |
14:14 <aphyr> There's a generality--speed continuum; you'll have to decide the tradeoff. | |
14:14 <kraay> Well, then why not use a simple link-walk to represent the JSON data? | |
14:15 <aphyr> Depends on what shape your data is | |
14:15 <aphyr> Link-walking is essentially parsing a list of erlang terms stored as | |
metadata around the object | |
14:15 <aphyr> then returning specific terms as the result of a map query. | |
14:16 <aphyr> You'll incur an additional cost from the latency of jumping from | |
object to object | |
14:16 <aphyr> (which may be on different hosts) | |
14:16 <aphyr> also, riak mapreduce currently finishes *all* jobs in the same phase | |
before moving to the next phase | |
14:16 <aphyr> So you could see blocking issues | |
14:17 <aphyr> It would almost certainly be slower than just storing all the data as | |
erlang terms in the same object | |
14:18 <kraay> ...and then accept the overhead in having to put the entire | |
document in memory | |
14:19 <aphyr> If your documents are large that could be a problem. | |
14:19 <kraay> A single JSON document will be a map, with about 10 keys. Each | |
value will contain a nested set of maps about 3 layers deep. | |
14:20 <aphyr> So roughly 1000 things total? | |
14:20 <kraay> Sorry, no -- approx 100 | |
14:20 <aphyr> Oh yeah that's tiny | |
14:21 <aphyr> I would put it all in one object | |
14:21 <aphyr> You probably won't even notice the deserialization time | |
14:21 <aphyr> If it becomes an issue you could use a stream parser to cut your | |
time by a constant factor | |
14:22 <kraay> ah, ok -- then regarding the buckets. Would it be possible to | |
stuff the 300k entries into a single bucket? | |
14:22 <aphyr> Sure! | |
14:22 <aphyr> 300k is small potatoes | |
14:22 <aphyr> (buckets by the way are just key prefixes) | |
14:22 <kraay> the idea being that eventually we might have different source of | |
JSON data, and the intuative idea would be to stuff them into seperate buckets | |
14:22 <kraay> great | |
14:22 <aphyr> So keep your bucket names small | |
14:23 <aphyr> Same goes for keys | |
14:23 <aphyr> There's an in-memory cost of ~40bytes + key length | |
14:23 <kraay> Wow, this a awesome information | |
14:24 <kraay> Do you have any exposure with the Java client libraries to riak? | |
14:24 <aphyr> haha I avoid java like the plague | |
14:25 <kraay> What's your preferred flavor? | |
14:25 <aphyr> Ruby/python/erlang | |
14:26 <aphyr> I believe there's a lot of development going on around the java | |
client right now | |
14:26 <aphyr> Many mailing list posts | |
14:26 <aphyr> You might take a look and get in contact with one of them? | |
14:26 <kraay> Ah, yeah the last project (a SaaS) was written in ruby -- It was | |
great for rapid development, but really difficult to verify and test | |
14:27 <aphyr> I find my problem isn't so often ruby as it is external services... | |
14:27 <aphyr> I use bacon for testing, works great, but when you're dealing with | |
DBs, networks, etc... stuff hits the fan | |
14:27 <aphyr> Erlang is really growing on me though. Much slower to write, but the results are usually terse and predictably correct. | |
14:28 <aphyr> Take a look at QuickCheck | |
14:29 <kraay> oooh - I like it | |
14:30 <kraay> The strong type checking is what prompted me to look into Scala -- | |
I guess erlang has to have to same thing. | |
14:31 <aphyr> Erlang is strange... doesn't really have "types" per se | |
14:31 <aphyr> But the static verifier is very thorough | |
14:31 <aphyr> I am excited about scala for sure, haven't had the chance to play with it yet. | |
14:33 <kraay> oops, I've got to run -- one kid just woke up and stumbled in. | |
14:33 <aphyr> later! | |
14:33 <kraay> Thank you, very, very much for all your insight | |
14:33 <kraay> have a good day/night/morning :) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment