Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created April 4, 2011 21:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PharkMillups/902437 to your computer and use it in GitHub Desktop.
Save PharkMillups/902437 to your computer and use it in GitHub Desktop.
13:57 <kraay> Hello everyone -- I'm evaluating using riak as a document store.
We'd be storing about 300k JSON docs of identical structure/complexity.
However, I'd like to provide the end user with ability to use JPath
(a weak form of XPath) as a means of traversing the JSON doc and
returning those related bits. What would be the best way of doing
this? via link-walking? map-reduce? And can I do either in RT?
13:59 <aphyr> kraay: RT?
13:59 <kraay> Real Time
13:59 <aphyr> What do you think realtime means? :)
14:00 <kraay> Let's say 30 ms
14:00 <aphyr> For a single document? Not a problem.
14:00 <kraay> How about for 300k documents
14:00 <aphyr> So the user knows which document they want to operate on,
and provides a JPATH query?
14:01 <aphyr> Or the user wants to run JPath against the entire dataset at once?
14:02 <kraay> Well, the user would send a subset of document "ids" (which would
possibly be 300k) and we'd take that and the JPath query to return the specified fields.
14:03 <kraay> The JPath would be merely as a means of walking the JSON doc.
14:03 <aphyr> If you know of any JSON parser which can parse 300K
documents in under 30ms, I am extremely interested to hear about it. :)
14:03 <aphyr> That aside, yeah doesn't sound too hard.
14:04 <aphyr> You'd probably want to do this with mapreduce. Issue an MR query
with the list of IDs. For each ID, pass the JPath expression to the map phase
and have it run the query against that document.
14:04 <aphyr> Then reduce the set of results together.
14:04 <aphyr> Would scale linearly with nodes and run in something like
O(documents_requested/nodes)
14:05 <aphyr> 30ms is definitely achievable for 200 documents or so
14:05 <aphyr> But I expect passing hundreds of thousands of documents into your
MR query is going to take a significant amount of time
14:06 <kraay> Couldn't I just scale out the number of mappers out?
14:07 <aphyr> Submit multiple jobs concurrently? Sure.
14:07 <aphyr> Probably would more time though.
14:07 <aphyr> *take more time*
14:07 <kraay> Sure, ofcourse.
14:08 <aphyr> I suspect it would only be worth it if issuing the request
to Riak took more time than managing the distribution yourself would take.
14:08 <aphyr> Riak can only begin the map jobs when it's finished
receiving/deserializing your request so for 300K ids there could be a nontrivial
delay.
14:09 <aphyr> Never tried something that size but I'm sure others have. :)
14:10 <kraay> Hmmm... would the gains be worth the effort, if I used protocol
buffers or BSON?
14:10 <kraay> ...if that's even possible
14:10 <aphyr> For your document serialization?
14:10 <aphyr> I would use erlang terms.
14:10 <aphyr> The serialization would be extremely fast and it would be a
good match for an erlang MR job.
14:11 <kraay> Ah, I've been iching to dabble a bit more into erlang :)
14:11 <aphyr> Yeah, I've found huge performance gains from converting JS MR phases into erlang.
14:11 <aphyr> Be aware that erlang JSON parsing is... obtuse.
14:12 <kraay> Heh, I was monkeying around with JSON parsing in Scala --
finding a suitable library is also a bit of a challenge :)
14:12 <aphyr> mochijson2 converts objects to {struct, [{key, value}, {key, value}, ...]}
14:13 <aphyr> Also recall that *building* an object in memory may take more
time than required; since the serialization format is linear you're already
paying an O(document size) cost
14:13 <aphyr> You might be able to do things faster by choosing your serialization
cleverly and doing it as a stream parser or something.
14:14 <aphyr> There's a generality--speed continuum; you'll have to decide the tradeoff.
14:14 <kraay> Well, then why not use a simple link-walk to represent the JSON data?
14:15 <aphyr> Depends on what shape your data is
14:15 <aphyr> Link-walking is essentially parsing a list of erlang terms stored as
metadata around the object
14:15 <aphyr> then returning specific terms as the result of a map query.
14:16 <aphyr> You'll incur an additional cost from the latency of jumping from
object to object
14:16 <aphyr> (which may be on different hosts)
14:16 <aphyr> also, riak mapreduce currently finishes *all* jobs in the same phase
before moving to the next phase
14:16 <aphyr> So you could see blocking issues
14:17 <aphyr> It would almost certainly be slower than just storing all the data as
erlang terms in the same object
14:18 <kraay> ...and then accept the overhead in having to put the entire
document in memory
14:19 <aphyr> If your documents are large that could be a problem.
14:19 <kraay> A single JSON document will be a map, with about 10 keys. Each
value will contain a nested set of maps about 3 layers deep.
14:20 <aphyr> So roughly 1000 things total?
14:20 <kraay> Sorry, no -- approx 100
14:20 <aphyr> Oh yeah that's tiny
14:21 <aphyr> I would put it all in one object
14:21 <aphyr> You probably won't even notice the deserialization time
14:21 <aphyr> If it becomes an issue you could use a stream parser to cut your
time by a constant factor
14:22 <kraay> ah, ok -- then regarding the buckets. Would it be possible to
stuff the 300k entries into a single bucket?
14:22 <aphyr> Sure!
14:22 <aphyr> 300k is small potatoes
14:22 <aphyr> (buckets by the way are just key prefixes)
14:22 <kraay> the idea being that eventually we might have different source of
JSON data, and the intuative idea would be to stuff them into seperate buckets
14:22 <kraay> great
14:22 <aphyr> So keep your bucket names small
14:23 <aphyr> Same goes for keys
14:23 <aphyr> There's an in-memory cost of ~40bytes + key length
14:23 <kraay> Wow, this a awesome information
14:24 <kraay> Do you have any exposure with the Java client libraries to riak?
14:24 <aphyr> haha I avoid java like the plague
14:25 <kraay> What's your preferred flavor?
14:25 <aphyr> Ruby/python/erlang
14:26 <aphyr> I believe there's a lot of development going on around the java
client right now
14:26 <aphyr> Many mailing list posts
14:26 <aphyr> You might take a look and get in contact with one of them?
14:26 <kraay> Ah, yeah the last project (a SaaS) was written in ruby -- It was
great for rapid development, but really difficult to verify and test
14:27 <aphyr> I find my problem isn't so often ruby as it is external services...
14:27 <aphyr> I use bacon for testing, works great, but when you're dealing with
DBs, networks, etc... stuff hits the fan
14:27 <aphyr> Erlang is really growing on me though. Much slower to write, but the results are usually terse and predictably correct.
14:28 <aphyr> Take a look at QuickCheck
14:29 <kraay> oooh - I like it
14:30 <kraay> The strong type checking is what prompted me to look into Scala --
I guess erlang has to have to same thing.
14:31 <aphyr> Erlang is strange... doesn't really have "types" per se
14:31 <aphyr> But the static verifier is very thorough
14:31 <aphyr> I am excited about scala for sure, haven't had the chance to play with it yet.
14:33 <kraay> oops, I've got to run -- one kid just woke up and stumbled in.
14:33 <aphyr> later!
14:33 <kraay> Thank you, very, very much for all your insight
14:33 <kraay> have a good day/night/morning :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment