killerstorm/dreddit.md

## dreddit.md

      
    Raw
  

              dreddit.md
            
          
    An architecture for distributed p2p reddit-like thing

Problem?

If you're a reddit user you probably know that from time to time reddit goes slow or even down due to unexpectedly high load (many users trying
to use site simultaneously) and after that users create posts like "Fix it! Can't you just buy more servers?". I've found a pretty interesting
question in one of those threads once -- why can't we make a distributed, p2p reddit(-like thing) so it won't depend on lots of expensive servers?
As a person who has some clue on how p2p systems work (I've implemented an ED2k client once, for example) I've commented that there are
unresolved technical problems. P2P was amazingly successful for some applications, notably, filesharing. But using it for a highly-interactive
site like reddit would be quite a challenge -- reddit's operations do not map easily onto well-known constructs like distributed hash tables (DHT),
and so implementing it would require inventing new stuff. (Well, probably you can hack something together with DHTs and such, but I bet it
will be spectacularly slow and unusable.)
But now each time reddit goes slow I'm thinking how such distributed, p2p reddit-like thing could be built. And at some point
I've found it is not as hard as I though initially -- it will require A LOT of work, of course, but it boils down to few relatively simple and
well-understood constructs if you think about it in a right way.
So I'd like to share my thoughts will other people. Please note that no such system was built and it is not even an instruction how to build one.
It is merely a "thinking aloud" to initiate a potentially interesting discussion.
So what is reddit?

That's a good question. Before we design a solution we need to know the problem, that is, we need a model which describes system good enough,
but omits unnecessary details. Reddit is large and modeling all of it at once is not possible.
I think a good thing to begin with is a commenting subsystem -- if we can do comments then doing the rest should be possible too.
Comments are reasonably hard to implement, but they are not too hairy, and it is good.
Implementing a comment system is not a rocket science, of course, but there are constraints -- it should be scalable and fast at same time.
It shouldn't take more than few seconds to load comments. There should be no lag -- you should get new/updated comments and upvotes on time
scales on order of seconds.
Going back to idea of using DHT in trivial and naive fashion -- it is possible to implement commenting this way, having one or more key/value
pairs for each comment. But for a large enough thread you'll have to do thousands of DHT lookups to render such tree, and it is going to take
lots of time. Besides that, DHT updates are not particularly fast and reliable, that is, it might take a lot of time until you'll get new comments,
which sucks.
Now if we want it to be fast and interactive we need to take updates and changes as a cornerstone of the model. Let's consder "reddit" being
a constant stream of events -- comments/edits/downvotes/upvotes/deletes -- for the purpose of this model. We want distribute network clients
to be able to access those events in some way, as having these events they will able to reproduce comment threads and other stuff.
Very basic architecture

Now we can think of participants and a general idea of an architecture (please note -- this is just one of ways to do it;
further in this article we'll explore many different models and architectures):
 _________        _________________        _____________________
/ clients \ <--- /distributed store\ <--- /reddit origin servers\
\_________/ <--- \_________________/ <--- \_____________________/

So here's how it goes: users post comments, upvote/downvote stuff and so on via reddit site. This way they generate a stream of events.
These events are fed into a "distributed store" thing -- DHT or something like that. Later clients can query distributed store to get those
events without disturbing reddit origin servers, thus not generating additional load, which is the point.
Note that clients will get only read-only view this way, which is less than perfect. But think about that -- if reddit is down, won't
you like to have at least a read-only copy of it? Morever, if clients will be only using distributed store but not origin servers that will
ofload reddit servers and so it will help to bring it back online so posting will be available again.
Now it is not hard to make a sketch for a protocol which will implement this.
First, was is an event, how is it structured?
Obviously, we need some way to group events in some way, as handling all of them at once is not feasible. Let's introduce a concept of event
feed, which for purposes of this system is just some opaque identifier which helps to identify a group of events. For example, each reddit post
can be associated with a comment feed which can be identified by an URL like this:
http://www.reddit.com/r/technology/comments/eqhxe/pwned_by_the_owner_what_happens_when_you_steal_a/

Or it can be just an integer ID if we introduce additional level of indirection. It doesn't matter at this point.
Now we can define an event as a tuple: (feed_id, timestamp, event_kind, details).
Let's say somebody have posted a link, it got its comment feed_id, say, 31337. (It is out of scope of this protocol how it implemented.)
Then people start commenting, it is represented as a stream of events (JSON, if you do not mind):
   #a comment is posted
   {feed_id: 31337, timestamp: "17:01", event_kind: "add_comment", comment_id: 1, 
       parent_comment: 0, user_id: "killerstorm", message: "Not programming!"}
   #it is downvoted by somebody
   {feed_id: 31337, timestamp: "17:02", event_kind: "downvote", comment_id: "1"}
   #reply is posted
   {feed_id: 31337, timestamp: "17:03", event_kind: "add_comment", comment_id: 2, parent_comment: 1,  
      user_id: "qgyh3", message: "So what?"}
   #comment is deleted
   {feed_id: 31337, timestamp: "17:04", event_kind: "delete_comment", comment_id: 1}   

It is obvious that we can re-create comment tree from this information, but I haven't addressed issues I've blamed "naive DHT" for yet.
Scalability v1

Here's a very basic use case: you open a comment page and your client should pull comments for that page from the distributed store, asking
for events with a certain feed_id.
But distributed store is a number of servers, whom should it ask particularly?
As we have feed_id in query it is quite natural to use it to divide work among servers. That is, each server should be responsible for a
certain set of feed_id's and client should be able to find server(s) which are responsible for feed_id it is interested in (but I will not
describe particular mechanism yet, so far we're intersted only in server-side workings).
Now we guess how it might work at server side. reddit designates a number of servers to be feed servers for distributed store -- they will
push new events into the store. Under reasonable assumptions it scales perfectly well -- it is possible to spawn as many feed servers as
possible, web servers (which recieve events from users) will push events to feed servers and then feed servers will feed the masses.
But it doesn't make sense to distributed store servers to connect to reddit feed servers directly -- that would put too much load on them.
Instead they should pull events from each other. A very simple tree (or, rather, forest) topology would do:
L.2 server 1 \
L.2 server 2  <-- L.1 server 1-3\
L.2 server 3 /                   /---- Origin feed server 1-6
L.2 server 4 \                   \---- 
L.2 server 5  <-- L.1 server 4-6/
L.2 server 6 /

L.2 server 7 \
L.2 server 8  <-- L.1 server 7-9\
L.2 server 9 /                   /---- Origin feed server 6-12
L.2 server 10\                   \---- 
L.2 server 11 <-- L.1 server 9-12/
L.2 server 12/

It puts a little strain on each particular server -- origin feed servers feed only two and each level 1 server feed three, but number of
servers in distributed store can grow exponentially as we add more layers. So it limited only by a number of servers which participate, there
are no bottlenecks.
What's about number ranges like 4-6? It is our first and very simple responsibility and server discovery mechanism.
Let's say we're interested in feed number 31337. We know that feeds are orgnized into 12 buckets, so take modulus 31337 mod 12 = 5.
Now we know we need to contact "L.2 server 5". Or, rather, one of them if there are many servers responsible for bucket 5. So we need its IP address.
Wait a minute, isn't it the same huge problem we've started with? Not really, we can assume that number of servers is much lower than number
of events and feeds (and they change infrequently), so it is much easier and cheaper to distribute this information. There are many possible
ways to do it -- a simple gossip protocol which will spread a flat list of servers (should work well with number of servers on scale of thousands),
some form of DHT, or just DNS -- it is interesting because infrastructure is already there and DNS records are even cached.
Here's how it could work with DNS: client does query 5.4-6.1-6.distributed.reddit.com and it returns list of domains of servers which are responsible
for bucket 5. Now client can ask A and SRV records for these names to find ports and IP addresses. This information is cached by nearby DNS
servers so load on reddit's DNS is pretty low.