Skip to content

Instantly share code, notes, and snippets.

@max-mapper
Last active August 29, 2015 14:18
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save max-mapper/9d598a25834f9c136417 to your computer and use it in GitHub Desktop.
Save max-mapper/9d598a25834f9c136417 to your computer and use it in GitHub Desktop.
dat beta design notes
  • dat data model
    • commit graph (mutable)
      • Merkle DAG
      • integrity check a root hash, and know all children are ok too (efficient!)
      • deduplication of common children
      • can content address all the nodes in the datastructure
      • can reveal single path from node to a root (with adjacent hashes) and prove object must be in dag
    • operation database (mutable)
      • protobufs in leveldb
      • prototbufs support efficient encodings and backwards compat
      • leveldb is easily swappable for other embedded data stores
    • blob store (immutable)
      • stored on disk or any abstract-blob-store compatible thing
  • replication
    • get duplex stream to remote
      • dat CLI client will default to http but anything will do
    • establish multiplexed transport
      • length prefixed protocol buffers
      • modules: length-prefixed-stream, multiplex, protocol-buffers
      • our protobuf lib is faster than json in most cases :)
    • graph exchange phase
      • optimized to minimize roundtrips
      • goal is to sync hashes (graph nodes)
      • state your intent (pushing/pulling/both)
      • receive other sides intent
      • send your heads (one head per owner)
      • each head has a owner id and local change #
      • whoever is receiving data is 'passive'
      • for each head (owner) see if you need newer data
      • send back what you need
      • other end starts streaming newer data for those owners
      • if a node references a new owner, the pusher first sends the new head
      • the puller can then decide if they want to ask the pusher to resolve that owner's log
      • finally, the graph will be consistent for the requested heads
      • optimization: store node depth on insert, use to send most important (shallower) nodes first
    • operation fetch
      • can be done in parallel with the graph exchange phase
      • read commit data in a graph node
      • commits are a group of atomic operations
      • ask the remote for the operation k/v pairs in the commit
      • uses the same multiplexed connection using channels for each commit
      • on the puller the operations are stored in the local mutable db
    • blob fetch
      • blobs are just a special type of operation where the data is external (abstract-blob-store)
      • if a operation references a blob then it will cause a blob fetch to kick off
      • default blob exchange is over the same multiplexed connection but operations can define custom blob sources
      • custom sources examples: bittorrent, s3, ftp, ipfs
  • local features
    • rollback/checkout
      • accessing the dat from some point of reference (aka time travel)
      • you can 'checkout' a dat to a hash in the graph, and all operations are done relative to that hash
      • checkouts are just a hash, so you can do: dat clone foo.com/dat/3272y398h2934h9osdfh
    • branches (owner logs can have multiple heads)
      • our current design is to replicate entire graphs so you can't do e.g. "per-branch clone"
        • but, graphs are just metadata so we think this is ok (we may be wrong)
        • e.g. you can just get the graph and only resolve branch ops/blobs
    • merging
      • we want to support custom merge strategies for 3-way data merges (automated and manual)
      • dat will not merge your data for you (unless it is fast-forwardable), it is up to you to pick your strategy
      • instead of halting on conflicts we do implicit branches. we anticipate and welcome conflicts
      • streaming merge algorithm (naive implementation)
        • diff stream
          • this is a readable stream
          • first you do a diff operation against two hashes in the dat graph
          • this creates 2 dat read streams, each with a checkout at their relative hash
          • keys come out of leveldb in lexicographic sorted order, so technically this is a streaming mergesort
          • for example, assume both streams will first emit a key 'a'
          • if the value of 'a' is the same on both sides, there is no conflict (does not emit, maybe can be enabled in API?)
          • if the value differs, a conflict should be emitted from the diff stream e.g. {[{key: 'a', value: 'hello', head: hash1}, {key: 'a', value: 'goodbye', head: hash2}]}
          • if only one side has the key (e.g. the key is new), it should be emitted (defaults to true, but can be disabled in API?)
          • after comparing all keys in both sides, the stream ends
        • merge stream
          • this is a writable stream
          • each piece of data written to this will be translated into dat merge calls for specific keys
          • the merge api is db.merge(versions, value)
          • diff stream and merge stream are composable, the idea is you pipe diff stream -> custom resolver stream -> merge stream
    • blob operations
      • you can dat move or dat copy a blob from inside the immutable blob store into the users fs. move removes it from the blob store, copy makes a copy.
  • etc
    • api is designed for automation/programmatic usage. source code is finicky, but data has structure.
    • all i/o is streaming
    • designed to work in a browser too (you can swap out the mutable and immutable data stores with browser-friendly alternatives)
    • we are interested in rabin fingerprinting (rolling checksums) for blob diffing
  • differences from git
    • our graphs store less data (data is external)
    • you can checkout an individual branch in git, but in dat you have to get the whole metadata graph (you have to resolve the ops/blobs in dat separately)
    • git resolves latest -> backwards, which is simpler but means you can't resume a failed clone easily. not an issue for most code repos, but is more important with larger graphs
    • we do it oldest -> forwards, which means you can support real-time live graph updates
    • git does not dereference blobs by default, you must use an extension such as git-annex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment