Dawn of a new modern distributed file system with coordination and database-like features.
File systems and databases are both used for storing, accessing and processing data. File systems typically have a hierarchical data model while the relational model is still the most popular for databases. The NoSQL movement has re-introduced diversity in databases with key/value, column family, graph and other data models gaining traction. One could argue file systems are databases since both generally have the same set of problems to solve and in many cases are implemented using the same data structures like B-Tree's or Log-Structured Merge Tree's.
In 1977, the VMS operating system provided reusable distributed locking with it's Distributed Lock Manager. There are a few exceptions like VMS but these days operating systems and file systems have a reduced host-like role. Most databases store their persisted storage on a file system but the database is the one with the bells and whistles that provide the features that applications and services rely on for transactions, coordination and secondary indexing.
In the spirit of Unix, the ability to compose small programs and layer reusable components together is a powerful concept. Have you ever needed a job executed by a shell script to be transactional or required coordination between commands running on different machines? I think we can drive the Unix philosophy forward to provide these things by creating a file system that offers advanced features seen in databases that works with existing tools like shell scripts, cat, find, grep and others without modification to these programs. Further more I think many of these features can be made more approachable than they are today by presenting them as file system operations users are already familiar with.
While most of the ideas from Dawn exist in local filesystems such as TxF
, ZFS
or distributed filesystems like Ceph
, GlusterFS
or even HDFS
and the infrastructure you can layer on top, none apply these concepts in unison or in the way Dawn proposes.
- Portable userland filesystem.
- Distributed.
- Explicit multi-operation transactions modelled into the filesystem itself.
- Explicit distributed locking operations modelled into the filesystem itself.
- Secondary indexing with rich query support.
The major innovations in Dawn is that transactions and coordination primitives are exposed as filesystem constructs such as files and directories. This enables any programming language or any bash script to benefit from the following capabilities simply by interacting with a filesystem.
- Create and commit transactions with MVCC-like snapshotting using multi-operations of reads and writes.
- Cross-machine coordination.
Modelling these capabilities in the filesytem itself enables a whole new world of applications and scripts. For example, 2 bash scripts that were never meant to coodinate across machines with each other could coordinate with zero code change if the computer operator configured the scripts to point to directories that represent coordination primitives in the filesystem. No special API calls, kernel calls or Zookeeper/etcd clients are required.
Dawn needs to decide a consistent approach to operations and whether they should be modelled as filesystem directories, reuse posix commands or create it's own commands. The more we reuse common filesystem concepts the less existing applications and scripts need to change to gain new capabilities. However bending the filesystem to do certain operations may become less natural. For example exposing errors becomes difficult.
TODO.
TODO.
Dawn exposes transactions as a virtual directory so that any application whether it is a bash script or program written in any language can create transactions and benefit from transactional scope and transactional guarantees without binding to specific API's. The file path is a query statement. It can contain information like the transaction scope, not just the "file" name. This can be really advantageous
Let's pretend we have this filesystem mounted to /foo
. We might be storing website orders in /foo/orders/*.json
. What if /.snapshots
was a special directory, similar to /proc
on unix systems. This would work both on Unix and Windows.
To create a new transaction in a bash script is as follows.
Option 1.
mkdir /snapshots/sometransaction_id
cd /snapshots/sometransaction_id
/snapshots/sometransaction_id
now represents a point-in-time snapshot of /
and any program operating in that directory is working within the transactions isolated scope.
sometransaction_id
is a new transaction and the MVCC backend has created us an isolated version of the root filesystem. Whatever we read or write in this directory is a representation of /
from the time the mkdir happened. Potentially we could also support something like sub tree transaction scope.
mkdir /foo/.snapshots/sometransaction_id`
Option 2.
ln /etc /snapshots/sometransaction_id
cd /snapshots/sometransaction_id
Option 3.
begin /etc /snapshots/sometransaction_id
cd /snapshots/sometransaction_id
Option 1.
mv /snapshots/sometransaction_id /commit
Option 2.
commit /snapshots/sometransaction_id
Option 1.
rm -rf /snapshots/sometransaction_id
Option 2.
rollback /snapshots/sometransaction_id
Since we are just dealing with directories, you can make powershell or bash scripts that can create and commit transactions while not having to change the command line tools that read/write data like grep etc.
There's an interestingly flexible but also dangerous aspect to this. Now any transaction can access any other transactions data by just referencing the directory path of another transaction. So you could write uncommitted data to another transaction by just doing a cp/copy. But as long as you stay within your transaction directory you're safe. But this is interesting because I get a finer grained level of control. In a multi-operation transaction in an RDBMS, I don't know any databases that allow me to describe that I want one single change to be visible by transactions I hand pick. I can't think of a use case for this yet but the level of control sounds interesting and it's explicit, not accidental.
TODO.
TODO. Mimetypes tell the filesystem how to index files. Hello WinFS? Can we do better?
I like the ideas expressed for transactions. Since transactions represent the tree at a point in time, wouldn't
cp
be a more appropriate representation of a transaction, a-la:What happens when a transaction rolls back but there is a process holding the transaction's directory inode open, such as CWD? Does the snapshot stay accessible to the process that's holding inode, or should every operation error out?
What about merging? Two processes attempting to modify the same file in separate transactions? First commit wins? The other errors out? Because of the semantics of commit, I think it makes more sense to create a separate command for that. Rollback can probably be implemented as a simple
rm -r
.Good ideas! Hope to see this developed further.