- dat data model
- commit graph (mutable)
- Merkle DAG
- integrity check a root hash, and know all children are ok too (efficient!)
- deduplication of common children
- can content address all the nodes in the datastructure
- can reveal single path from node to a root (with adjacent hashes) and prove object must be in dag
- operation database (mutable)
- protobufs in leveldb
- prototbufs support efficient encodings and backwards compat
- leveldb is easily swappable for other embedded data stores
- blob store (immutable)
- stored on disk or any abstract-blob-store compatible thing
- commit graph (mutable)
- replication
- get duplex stream to remote
- dat CLI client will default to http but anything will do
- establish multiplexed transport
- length prefixed protocol buffers
- modules: length-prefixed-stream, multiplex, protocol-buffers
- our protobuf lib is faster than json in most cases :)
- graph exchange phase
- optimized to minimize roundtrips
- goal is to sync hashes (graph nodes)
- state your intent (pushing/pulling/both)
- receive other sides intent
- send your heads (one head per owner)
- each head has a owner id and local change #
- whoever is receiving data is 'passive'
- for each head (owner) see if you need newer data
- send back what you need
- other end starts streaming newer data for those owners
- if a node references a new owner, the pusher first sends the new head
- the puller can then decide if they want to ask the pusher to resolve that owner's log
- finally, the graph will be consistent for the requested heads
- optimization: store node depth on insert, use to send most important (shallower) nodes first
- operation fetch
- can be done in parallel with the graph exchange phase
- read commit data in a graph node
- commits are a group of atomic operations
- ask the remote for the operation k/v pairs in the commit
- uses the same multiplexed connection using channels for each commit
- on the puller the operations are stored in the local mutable db
- blob fetch
- blobs are just a special type of operation where the data is external (abstract-blob-store)
- if a operation references a blob then it will cause a blob fetch to kick off
- default blob exchange is over the same multiplexed connection but operations can define custom blob sources
- custom sources examples: bittorrent, s3, ftp, ipfs
- get duplex stream to remote
- local features
- rollback/checkout
- accessing the dat from some point of reference (aka time travel)
- you can 'checkout' a dat to a hash in the graph, and all operations are done relative to that hash
- checkouts are just a hash, so you can do: dat clone foo.com/dat/3272y398h2934h9osdfh
- branches (owner logs can have multiple heads)
- our current design is to replicate entire graphs so you can't do e.g. "per-branch clone"
- but, graphs are just metadata so we think this is ok (we may be wrong)
- e.g. you can just get the graph and only resolve branch ops/blobs
- our current design is to replicate entire graphs so you can't do e.g. "per-branch clone"
- merging
- we want to support custom merge strategies for 3-way data merges (automated and manual)
- dat will not merge your data for you (unless it is fast-forwardable), it is up to you to pick your strategy
- instead of halting on conflicts we do implicit branches. we anticipate and welcome conflicts
- streaming merge algorithm (naive implementation)
- diff stream
- this is a readable stream
- first you do a diff operation against two hashes in the dat graph
- this creates 2 dat read streams, each with a checkout at their relative hash
- keys come out of leveldb in lexicographic sorted order, so technically this is a streaming mergesort
- for example, assume both streams will first emit a key 'a'
- if the value of 'a' is the same on both sides, there is no conflict (does not emit, maybe can be enabled in API?)
- if the value differs, a conflict should be emitted from the diff stream e.g. {[{key: 'a', value: 'hello', head: hash1}, {key: 'a', value: 'goodbye', head: hash2}]}
- if only one side has the key (e.g. the key is new), it should be emitted (defaults to true, but can be disabled in API?)
- after comparing all keys in both sides, the stream ends
- merge stream
- this is a writable stream
- each piece of data written to this will be translated into dat merge calls for specific keys
- the merge api is db.merge(versions, value)
- diff stream and merge stream are composable, the idea is you pipe diff stream -> custom resolver stream -> merge stream
- diff stream
- blob operations
- you can
dat moveordat copya blob from inside the immutable blob store into the users fs.moveremoves it from the blob store,copymakes a copy.
- you can
- rollback/checkout
- etc
- api is designed for automation/programmatic usage. source code is finicky, but data has structure.
- all i/o is streaming
- designed to work in a browser too (you can swap out the mutable and immutable data stores with browser-friendly alternatives)
- we are interested in rabin fingerprinting (rolling checksums) for blob diffing
- differences from git
- our graphs store less data (data is external)
- you can checkout an individual branch in git, but in dat you have to get the whole metadata graph (you have to resolve the ops/blobs in dat separately)
- git resolves latest -> backwards, which is simpler but means you can't resume a failed clone easily. not an issue for most code repos, but is more important with larger graphs
- we do it oldest -> forwards, which means you can support real-time live graph updates
- git does not dereference blobs by default, you must use an extension such as git-annex
dat beta design notes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment