Skip to content

Instantly share code, notes, and snippets.

@juliangruber
Created October 6, 2012 22:51
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save juliangruber/3846423 to your computer and use it in GitHub Desktop.
Save juliangruber/3846423 to your computer and use it in GitHub Desktop.
Distributed Datastore

Distributed Datastore

Building a distributed database/api on an eventually consistent distribution mechanism like scuttlebutt.

Data model

There are users and (possibly anonymous) posts with permalinks that only take strings/buffers.

Think of information you...

  • need to store permanently, possibly in multiple countries at the same time
  • need to crud in realtime, possibly collaboratively

Open questions

  • storing lists like timelines in k/v-pairs
  • how to connect nodes
  • following people vs. rss feeds
  • can replication delay be kept under 2s (time to send someone a link)

Components

host

This is a server/node that stores, replicates and merges data.

It emits its

  • current revision
  • mean response time
  • geolocation
  • error rate
  • heartbeat

namestore

A seperate DB used to check the availablity of unique names like usernames or slugs consistently across the network. Its data model is a simple set that contains all registered names and can say if one name is in it and if not, add it. Queried on every registration and anonymous post creation.

If we don't have this we can only use sha's like git does, no usernames and stuff.

This needs to be a seperate DB.

Requirements:

  • Atomic: ask and write in an atomic step
  • Consistent: that's the whole point of it

router

The main http/websocket/whatever frontend (clustered).

Picks a host to route to based on score s, calculated with values from host h, dataset d and client c:

  • replication delay
  • mean reponse time
  • proximity
  • error rate

and parameters i,j,k,l:

s = 1/(i*h.replication_delay(d) + j*h.mean_reponse_time + k*|(client.location - h.location)| + l*h.error_rate)

monitor

  • Checks each host's heartbeat
  • Creates & kills hosts on demand
  • Connects hosts

Connection Strategies

Can nodes (re)connect randomly or is monitor required?

a <-> c <-> b <-> d     c = 1|2
  • if a node in the middle fails, the graph is split and monitor has to connect nodes
a <-> c <-> b <-> d     c = 2
 \---------------/
  • if a node in the middle fails, the graph degrades to the chain above
 /> b <\
a   |   d               c = 2|3
 \> c </
 /> b <\
a <-+-> d               c = 3
 \> c </
  /> b <\
 /   |   \
a <- c -> e             c = 3|4
 \   |   /
  \> d </
a              n = 1

a <-> b        n = 2

a <-> b        n = 3
 \ c /

 /  d  \
a <(+)> b      n = 4
 \  c  /

Steps

Registration

var name = 'juliangruber';
var password = 'foobar';

if (!namestore.register('user:'+name)) return false;
model.set('user:'+name, {
  name: name,
  password: sha(password)
});

New post

Post URL: /<user>/<title>/.

var user = 'juliangruber';
var title = title || new Date;
var msg = 'my post';

var path = 'user:'+user+':post:'+title;
if (model.get(path)) return false;

model.set(path, { msg: msg });
// Add to public timeline
// Add to user timeline
// Add to follower inboxes

New anonymous post

Post URL: /<slug>/.

var slug = slug || generateSHA();
var msg = 'my post';

// check title or regenerate sha
var path = 'post:'+slug;
while (!namestore.register(path)) {
  if (!isSHA(slug)) return false;
  path = 'post:'+(slug = generateSHA());
}

model.set(path, { msg: msg });
// Add to public timeline
@dominictarr
Copy link

thinking about stuff like this too, have a look at https://github.com/dominictarr/crdt

It's a subclass of scuttlebutt but has more a sophisticated model, that has partial updates, sets, and ordered sets.

I really like this approach: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html working towards something like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment