Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created July 8, 2010 17:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PharkMillups/468320 to your computer and use it in GitHub Desktop.
Save PharkMillups/468320 to your computer and use it in GitHub Desktop.
copious # seancribbs: are you taking questions ahead of time for tomorrows webinar ?
seancribbs # you can ask right now ;)
copious # well, I think others would be interested in the answers too
one question, what are the ramifcations of having millions of buckets
seancribbs # depending on the intended use, I advocate them
copious # and which backends are more appropriate for large # of small buckets vs. small
numbers of large buckets. I was thinking of further use for riak down our road,
and was of solving various data access patterns, one of which would involve millions
of buckets. Which as I understand it, would not be good to use with the innostore
backend, since that would be one innodb file per bucket
seancribbs # per bucket per partition
benblack # you are going to have that problem with any useful backend, at present,
i expect is there a reason you need millions of buckets in particular, or is
it just for the uri path?
seancribbs # inno is the best choice if you're going to do full-bucket
scans (list-keys) for that reason, but at large numbers of buckets
it will be swapping file handles a lot
copious # I was pondering a schema where we have /messages/<message-id> -- this is
one bucket with billions of documents
copious # each of those messages has an author, some "authors" have many thousands
of messages and I would have /authors/<author-id> bucket, with millions of authors in the
authors bucket
seancribbs # right, so you might create an author_1284295629_messages/ bucket that
uses inno, where other things use a different backend and that bucket would just
contain lightweight objects that point to the original messages
copious # and to tie the messages to the authors, I could use Link:, or
I could have /<author-id>/<message-id> with basically a Link to the message document
benblack # suggest actually using links rather than proliferating buckets like that
copious # well, the issue is keeping the bi-directional links in place. It
would updating the /authors/<author-id> document quite a bit, and a not-insignificant
portion of the authors would end up having thousands upon thousands of Links to messages
seancribbs # benblack: the question then becomes whether you want to pay
the cost of listing keys or loading a huge object
benblack # indeed
seancribbs # copious: i think you are fine as long as you have only one author
benblack # copious: how many authors?
seancribbs # (per message)
benblack # perhaps i misunderstood which thing was numerous
copious # 1 author per message
seancribbs # then you should never have to change the link on the message
copious # in relational terms author has many messages, where many can be in
the thousands yup, the message -> author link is pretty trivial, is
that many times we will want to say" I want all messages from a particular author"
so I could either list all the keys in a /author-id bucket, or follow all
the links in a /authors/author-id document. I just figured that maintaining
the /author-id bucket would be easier than maintaining the /authors/author-id document
seancribbs # this is why knowing the cardinality is important
numerality?
i don't recall the correct word
copious # the scale of N in 1:N relationships ?
seancribbs # yeah
copious # yeah, in this case, N can become quite large
benblack # cardinality
seancribbs # then you definitely want their insertion to be bottlenecked by updating
the author benblack: i was right the first time then, heh
benblack # indeed
seancribbs # definitely _don't_ want
copious # seancribbs: agreed, thats why I was thinking that /<author-id> buckets would
be the proper way to go in this situation
seancribbs # yes. it's not an ideal solution, but it should work
seancribbs # the message of my webinar tomorrow is "everything has tradeoffs"
copious # then its the manner of making sure the backend is okay with millions
of buckets. seancribbs: I comletely agree :-) or I should say "sucks the least
when having millions of buckets" seancribbs: thanks. benblack: thanks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment