Skip to content

Instantly share code, notes, and snippets.

@kellabyte
Last active December 29, 2020 02:58
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save kellabyte/573a693530d811710dee to your computer and use it in GitHub Desktop.
Save kellabyte/573a693530d811710dee to your computer and use it in GitHub Desktop.

Dawn draft

Dawn of a new modern distributed file system with coordination and database-like features.

Abstract

File systems and databases are both used for storing, accessing and processing data. File systems typically have a hierarchical data model while the relational model is still the most popular for databases. The NoSQL movement has re-introduced diversity in databases with key/value, column family, graph and other data models gaining traction. One could argue file systems are databases since both generally have the same set of problems to solve and in many cases are implemented using the same data structures like B-Tree's or Log-Structured Merge Tree's.

In 1977, the VMS operating system provided reusable distributed locking with it's Distributed Lock Manager. There are a few exceptions like VMS but these days operating systems and file systems have a reduced host-like role. Most databases store their persisted storage on a file system but the database is the one with the bells and whistles that provide the features that applications and services rely on for transactions, coordination and secondary indexing.

In the spirit of Unix, the ability to compose small programs and layer reusable components together is a powerful concept. Have you ever needed a job executed by a shell script to be transactional or required coordination between commands running on different machines? I think we can drive the Unix philosophy forward to provide these things by creating a file system that offers advanced features seen in databases that works with existing tools like shell scripts, cat, find, grep and others without modification to these programs. Further more I think many of these features can be made more approachable than they are today by presenting them as file system operations users are already familiar with.

While most of the ideas from Dawn exist in local filesystems such as TxF, ZFS or distributed filesystems like Ceph, GlusterFS or even HDFS and the infrastructure you can layer on top, none apply these concepts in unison or in the way Dawn proposes.

Features

  1. Portable userland filesystem.
  2. Distributed.
  3. Explicit multi-operation transactions modelled into the filesystem itself.
  4. Explicit distributed locking operations modelled into the filesystem itself.
  5. Secondary indexing with rich query support.

Summary

The major innovations in Dawn is that transactions and coordination primitives are exposed as filesystem constructs such as files and directories. This enables any programming language or any bash script to benefit from the following capabilities simply by interacting with a filesystem.

  • Create and commit transactions with MVCC-like snapshotting using multi-operations of reads and writes.
  • Cross-machine coordination.

Modelling these capabilities in the filesytem itself enables a whole new world of applications and scripts. For example, 2 bash scripts that were never meant to coodinate across machines with each other could coordinate with zero code change if the computer operator configured the scripts to point to directories that represent coordination primitives in the filesystem. No special API calls, kernel calls or Zookeeper/etcd clients are required.

Dawn needs to decide a consistent approach to operations and whether they should be modelled as filesystem directories, reuse posix commands or create it's own commands. The more we reuse common filesystem concepts the less existing applications and scripts need to change to gain new capabilities. However bending the filesystem to do certain operations may become less natural. For example exposing errors becomes difficult.

1. Userland filesystem

TODO.

2. Distributed

TODO.

3. Transactions

Dawn exposes transactions as a virtual directory so that any application whether it is a bash script or program written in any language can create transactions and benefit from transactional scope and transactional guarantees without binding to specific API's. The file path is a query statement. It can contain information like the transaction scope, not just the "file" name. This can be really advantageous

Creating a transaction

Let's pretend we have this filesystem mounted to /foo. We might be storing website orders in /foo/orders/*.json. What if /.snapshots was a special directory, similar to /proc on unix systems. This would work both on Unix and Windows.

To create a new transaction in a bash script is as follows.

Option 1.

mkdir /snapshots/sometransaction_id
cd /snapshots/sometransaction_id

/snapshots/sometransaction_id now represents a point-in-time snapshot of / and any program operating in that directory is working within the transactions isolated scope.

sometransaction_id is a new transaction and the MVCC backend has created us an isolated version of the root filesystem. Whatever we read or write in this directory is a representation of / from the time the mkdir happened. Potentially we could also support something like sub tree transaction scope.

mkdir /foo/.snapshots/sometransaction_id`

Option 2.

ln /etc /snapshots/sometransaction_id
cd /snapshots/sometransaction_id

Option 3.

begin /etc /snapshots/sometransaction_id
cd /snapshots/sometransaction_id

Committing a transaction

Option 1.

mv /snapshots/sometransaction_id /commit

Option 2.

commit /snapshots/sometransaction_id

Rolling back a transaction

Option 1.

rm -rf /snapshots/sometransaction_id

Option 2.

rollback /snapshots/sometransaction_id

Since we are just dealing with directories, you can make powershell or bash scripts that can create and commit transactions while not having to change the command line tools that read/write data like grep etc.

Risks

There's an interestingly flexible but also dangerous aspect to this. Now any transaction can access any other transactions data by just referencing the directory path of another transaction. So you could write uncommitted data to another transaction by just doing a cp/copy. But as long as you stay within your transaction directory you're safe. But this is interesting because I get a finer grained level of control. In a multi-operation transaction in an RDBMS, I don't know any databases that allow me to describe that I want one single change to be visible by transactions I hand pick. I can't think of a use case for this yet but the level of control sounds interesting and it's explicit, not accidental.

4. Coordination primitives

TODO.

5. Secondary indexing and querying

TODO. Mimetypes tell the filesystem how to index files. Hello WinFS? Can we do better?

@francois
Copy link

francois commented Dec 1, 2017

I like the ideas expressed for transactions. Since transactions represent the tree at a point in time, wouldn't cp be a more appropriate representation of a transaction, a-la:

# open transaction
cp /etc /foo/snapshots/some_transaction_id
cd /foo/snapshots/some_transaction_id
# do some stuff
commit /foo/snapshots/some_transaction_id

What happens when a transaction rolls back but there is a process holding the transaction's directory inode open, such as CWD? Does the snapshot stay accessible to the process that's holding inode, or should every operation error out?

What about merging? Two processes attempting to modify the same file in separate transactions? First commit wins? The other errors out? Because of the semantics of commit, I think it makes more sense to create a separate command for that. Rollback can probably be implemented as a simple rm -r.

Good ideas! Hope to see this developed further.

@petar
Copy link

petar commented Dec 1, 2017

To me, dawn bears resemblance to the Self-certifying File system of David Mazieres: en.wikipedia.org/wiki/Self-cert…

SFS takes care of global security, which would be needed by Dawn in one form or another. Dawn would than be adding transactions on top it seems. Looks like overall the architecture of Dawn could be very much that of SFS with local client/server dae ons that act as kernel FUSE plugins.

@r-marques
Copy link

If I understand correctly the idea here is to offload common database characteristics into the underlying file system. This would make the file system behave more like a database and it would make it easier to develop DBMS on top of the file system.

Would this be much different from just taking a database, add a posix file system API on top and use it has a file system?

@distobj
Copy link

distobj commented Dec 5, 2017

I've never been a big fan of stretching the file system abstraction over the network (for all those Waldo '94 reasons). So while the bash commands are illustrative, I hope they're not prescriptive. Perhaps resources instead of files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment