kellabyte/dawn.md Secret

## dawn.md

      
    Raw
  

              dawn.md
            
          
    Dawn draft

Dawn of a new modern distributed file system with coordination and database-like features.
Abstract

File systems and databases are both used for storing, accessing and processing data. File systems typically have a hierarchical data model while the relational model is still the most popular for databases. The NoSQL movement has re-introduced diversity in databases with key/value, column family, graph and other data models gaining traction. One could argue file systems are databases since both generally have the same set of problems to solve and in many cases are implemented using the same data structures like B-Tree's or Log-Structured Merge Tree's.
In 1977, the VMS operating system provided reusable distributed locking with it's Distributed Lock Manager. There are a few exceptions like VMS but these days operating systems and file systems have a reduced host-like role. Most databases store their persisted storage on a file system but the database is the one with the bells and whistles that provide the features that applications and services rely on for transactions, coordination and secondary indexing.
In the spirit of Unix, the ability to compose small programs and layer reusable components together is a powerful concept. Have you ever needed a job executed by a shell script to be transactional or required coordination between commands running on different machines? I think we can drive the Unix philosophy forward to provide these things by creating a file system that offers advanced features seen in databases that works with existing tools like shell scripts, cat, find, grep and others without modification to these programs. Further more I think many of these features can be made more approachable than they are today by presenting them as file system operations users are already familiar with.
While most of the ideas from Dawn exist in local filesystems such as TxF, ZFS or distributed filesystems like Ceph, GlusterFS or even HDFS and the infrastructure you can layer on top, none apply these concepts in unison or in the way Dawn proposes.
Features


Portable userland filesystem.
Distributed.
Explicit multi-operation transactions modelled into the filesystem itself.
Explicit distributed locking operations modelled into the filesystem itself.
Secondary indexing with rich query support.

Summary

The major innovations in Dawn is that transactions and coordination primitives are exposed as filesystem constructs such as files and directories. This enables any programming language or any bash script to benefit from the following capabilities simply by interacting with a filesystem.

Create and commit transactions with MVCC-like snapshotting using multi-operations of reads and writes.
Cross-machine coordination.

Modelling these capabilities in the filesytem itself enables a whole new world of applications and scripts. For example, 2 bash scripts that were never meant to coodinate across machines with each other could coordinate with zero code change if the computer operator configured the scripts to point to directories that represent coordination primitives in the filesystem. No special API calls, kernel calls or Zookeeper/etcd clients are required.
Dawn needs to decide a consistent approach to operations and whether they should be modelled as filesystem directories, reuse posix commands or create it's own commands. The more we reuse common filesystem concepts the less existing applications and scripts need to change to gain new capabilities. However bending the filesystem to do certain operations may become less natural. For example exposing errors becomes difficult.
1. Userland filesystem

TODO.
2. Distributed

TODO.
3. Transactions

Dawn exposes transactions as a virtual directory so that any application whether it is a bash script or program written in any language can create transactions and benefit from transactional scope and transactional guarantees without binding to specific API's. The file path is a query statement. It can contain information like the transaction scope, not just the "file" name. This can be really advantageous
Creating a transaction

Let's pretend we have this filesystem mounted to /foo. We might be storing website orders in /foo/orders/*.json. What if /.snapshots was a special directory, similar to /proc on unix systems. This would work both on Unix and Windows.
To create a new transaction in a bash script is as follows.
Option 1.
mkdir /snapshots/sometransaction_id
cd /snapshots/sometransaction_id

/snapshots/sometransaction_id now represents a point-in-time snapshot of / and any program operating in that directory is working within the transactions isolated scope.
sometransaction_id is a new transaction and the MVCC backend has created us an isolated version of the root filesystem. Whatever we read or write in this directory is a representation of / from the time the mkdir happened. Potentially we could also support something like sub tree transaction scope.
mkdir /foo/.snapshots/sometransaction_id`

Option 2.
ln /etc /snapshots/sometransaction_id
cd /snapshots/sometransaction_id

Option 3.
begin /etc /snapshots/sometransaction_id
cd /snapshots/sometransaction_id

Committing a transaction

Option 1.
mv /snapshots/sometransaction_id /commit

Option 2.
commit /snapshots/sometransaction_id

Rolling back a transaction

Option 1.
rm -rf /snapshots/sometransaction_id

Option 2.
rollback /snapshots/sometransaction_id

Since we are just dealing with directories, you can make powershell or bash scripts that can create and commit transactions while not having to change the command line tools that read/write data like grep etc.
Risks

There's an interestingly flexible but also dangerous aspect to this. Now any transaction can access any other transactions data by just referencing the directory path of another transaction. So you could write uncommitted data to another transaction by just doing a cp/copy. But as long as you stay within your transaction directory you're safe. But this is interesting because I get a finer grained level of control. In a multi-operation transaction in an RDBMS, I don't know any databases that allow me to describe that I want one single change to be visible by transactions I hand pick. I can't think of a use case for this yet but the level of control sounds interesting and it's explicit, not accidental.
4. Coordination primitives

TODO.
5. Secondary indexing and querying

TODO. Mimetypes tell the filesystem how to index files. Hello WinFS? Can we do better?