- dat data model
- commit graph (mutable)
- Merkle DAG
- integrity check a root hash, and know all children are ok too (efficient!)
- deduplication of common children
- can content address all the nodes in the datastructure
- can reveal single path from node to a root (with adjacent hashes) and prove object must be in dag
- operation database (mutable)
- protobufs in leveldb
- prototbufs support efficient encodings and backwards compat
- leveldb is easily swappable for other embedded data stores
- blob store (immutable)
- stored on disk or any abstract-blob-store compatible thing
- commit graph (mutable)
- replication
- get duplex stream to remote
- dat CLI client will default to http but anything will do
- establish multiplexed transport
- length prefixed protocol buffers
- modules: length-prefixed-stream, multiplex, protocol-buffers
- our protobuf lib is faster than json in most cases :)
- graph exchange phase
- optimized to minimize roundtrips
- goal is to sync hashes (graph nodes)
- state your intent (pushing/pulling/both)
- receive other sides intent
- send your heads (one head per owner)
- each head has a owner id and local change #
- whoever is receiving data is 'passive'
- for each head (owner) see if you need newer data
- send back what you need
- other end starts streaming newer data for those owners
- if a node references a new owner, the pusher first sends the new head
- the puller can then decide if they want to ask the pusher to resolve that owner's log
- finally, the graph will be consistent for the requested heads
- optimization: store node depth on insert, use to send most important (shallower) nodes first
- operation fetch
- can be done in parallel with the graph exchange phase
- read commit data in a graph node
- commits are a group of atomic operations
- ask the remote for the operation k/v pairs in the commit
- uses the same multiplexed connection using channels for each commit
- on the puller the operations are stored in the local mutable db
- blob fetch
- blobs are just a special type of operation where the data is external (abstract-blob-store)
- if a operation references a blob then it will cause a blob fetch to kick off
- default blob exchange is over the same multiplexed connection but operations can define custom blob sources
- custom sources examples: bittorrent, s3, ftp, ipfs
- get duplex stream to remote
- local features
- rollback/checkout
- accessing the dat from some point of reference (aka time travel)
- you can 'checkout' a dat to a hash in the graph, and all operations are done relative to that hash
- checkouts are just a hash, so you can do: dat clone foo.com/dat/3272y398h2934h9osdfh
- branches (owner logs can have multiple heads)
- our current design is to replicate entire graphs so you can't do e.g. "per-branch clone"
- but, graphs are just metadata so we think this is ok (we may be wrong)
- e.g. you can just get the graph and only resolve branch ops/blobs
- our current design is to replicate entire graphs so you can't do e.g. "per-branch clone"
- merging
- we want to support custom merge strategies for 3-way data merges (automated and manual)
- dat will not merge your data for you (unless it is fast-forwardable), it is up to you to pick your strategy
- instead of halting on conflicts we do implicit branches. we anticipate and welcome conflicts
- streaming merge algorithm (naive implementation)
- diff stream
- this is a readable stream
- first you do a diff operation against two hashes in the dat graph
- this creates 2 dat read streams, each with a checkout at their relative hash
- keys come out of leveldb in lexicographic sorted order, so technically this is a streaming mergesort
- for example, assume both streams will first emit a key 'a'
- if the value of 'a' is the same on both sides, there is no conflict (does not emit, maybe can be enabled in API?)
- if the value differs, a conflict should be emitted from the diff stream e.g. {[{key: 'a', value: 'hello', head: hash1}, {key: 'a', value: 'goodbye', head: hash2}]}
- if only one side has the key (e.g. the key is new), it should be emitted (defaults to true, but can be disabled in API?)
- after comparing all keys in both sides, the stream ends
- merge stream
- this is a writable stream
- each piece of data written to this will be translated into dat merge calls for specific keys
- the merge api is db.merge(versions, value)
- diff stream and merge stream are composable, the idea is you pipe diff stream -> custom resolver stream -> merge stream
- diff stream
- blob operations
- you can
dat move
ordat copy
a blob from inside the immutable blob store into the users fs.move
removes it from the blob store,copy
makes a copy.
- you can
- rollback/checkout
- etc
- api is designed for automation/programmatic usage. source code is finicky, but data has structure.
- all i/o is streaming
- designed to work in a browser too (you can swap out the mutable and immutable data stores with browser-friendly alternatives)
- we are interested in rabin fingerprinting (rolling checksums) for blob diffing
- differences from git
- our graphs store less data (data is external)
- you can checkout an individual branch in git, but in dat you have to get the whole metadata graph (you have to resolve the ops/blobs in dat separately)
- git resolves latest -> backwards, which is simpler but means you can't resume a failed clone easily. not an issue for most code repos, but is more important with larger graphs
- we do it oldest -> forwards, which means you can support real-time live graph updates
- git does not dereference blobs by default, you must use an extension such as git-annex
Last active
August 29, 2015 14:18
-
-
Save max-mapper/9d598a25834f9c136417 to your computer and use it in GitHub Desktop.
dat beta design notes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment