max-mapper/readme.md

## readme.md

      
    Raw
  

              readme.md
            
          
dat data model

commit graph (mutable)

Merkle DAG
integrity check a root hash, and know all children are ok too (efficient!)
deduplication of common children
can content address all the nodes in the datastructure
can reveal single path from node to a root (with adjacent hashes) and prove object must be in dag


operation database (mutable)

protobufs in leveldb
prototbufs support efficient encodings and backwards compat
leveldb is easily swappable for other embedded data stores


blob store (immutable)

stored on disk or any abstract-blob-store compatible thing


replication

get duplex stream to remote

dat CLI client will default to http but anything will do


establish multiplexed transport

length prefixed protocol buffers
modules: length-prefixed-stream, multiplex, protocol-buffers
our protobuf lib is faster than json in most cases :)


graph exchange phase

optimized to minimize roundtrips
goal is to sync hashes (graph nodes)
state your intent (pushing/pulling/both)
receive other sides intent
send your heads (one head per owner)
each head has a owner id and local change #
whoever is receiving data is 'passive'
for each head (owner) see if you need newer data
send back what you need
other end starts streaming newer data for those owners
if a node references a new owner, the pusher first sends the new head
the puller can then decide if they want to ask the pusher to resolve that owner's log
finally, the graph will be consistent for the requested heads
optimization: store node depth on insert, use to send most important (shallower) nodes first


operation fetch

can be done in parallel with the graph exchange phase
read commit data in a graph node
commits are a group of atomic operations
ask the remote for the operation k/v pairs in the commit
uses the same multiplexed connection using channels for each commit
on the puller the operations are stored in the local mutable db


blob fetch

blobs are just a special type of operation where the data is external (abstract-blob-store)
if a operation references a blob then it will cause a blob fetch to kick off
default blob exchange is over the same multiplexed connection but operations can define custom blob sources
custom sources examples: bittorrent, s3, ftp, ipfs


local features

rollback/checkout

accessing the dat from some point of reference (aka time travel)
you can 'checkout' a dat to a hash in the graph, and all operations are done relative to that hash
checkouts are just a hash, so you can do: dat clone foo.com/dat/3272y398h2934h9osdfh


branches (owner logs can have multiple heads)

our current design is to replicate entire graphs so you can't do e.g. "per-branch clone"

but, graphs are just metadata so we think this is ok (we may be wrong)
e.g. you can just get the graph and only resolve branch ops/blobs


merging

we want to support custom merge strategies for 3-way data merges (automated and manual)
dat will not merge your data for you (unless it is fast-forwardable), it is up to you to pick your strategy
instead of halting on conflicts we do implicit branches. we anticipate and welcome conflicts
streaming merge algorithm (naive implementation)

diff stream

this is a readable stream
first you do a diff operation against two hashes in the dat graph
this creates 2 dat read streams, each with a checkout at their relative hash
keys come out of leveldb in lexicographic sorted order, so technically this is a streaming mergesort
for example, assume both streams will first emit a key 'a'
if the value of 'a' is the same on both sides, there is no conflict (does not emit, maybe can be enabled in API?)
if the value differs, a conflict should be emitted from the diff stream e.g. {[{key: 'a', value: 'hello', head: hash1}, {key: 'a', value: 'goodbye', head: hash2}]}
if only one side has the key (e.g. the key is new), it should be emitted (defaults to true, but can be disabled in API?)
after comparing all keys in both sides, the stream ends


merge stream

this is a writable stream
each piece of data written to this will be translated into dat merge calls for specific keys
the merge api is db.merge(versions, value)
diff stream and merge stream are composable, the idea is you pipe diff stream -> custom resolver stream -> merge stream


blob operations

you can dat move or dat copy a blob from inside the immutable blob store into the users fs. move removes it from the blob store, copy makes a copy.


etc

api is designed for automation/programmatic usage. source code is finicky, but data has structure.
all i/o is streaming
designed to work in a browser too (you can swap out the mutable and immutable data stores with browser-friendly alternatives)
we are interested in rabin fingerprinting (rolling checksums) for blob diffing


differences from git

our graphs store less data (data is external)
you can checkout an individual branch in git, but in dat you have to get the whole metadata graph (you have to resolve the ops/blobs in dat separately)
git resolves latest -> backwards, which is simpler but means you can't resume a failed clone easily. not an issue for most code repos, but is more important with larger graphs
we do it oldest -> forwards, which means you can support real-time live graph updates
git does not dereference blobs by default, you must use an extension such as git-annex