warpfork/unixfsv2.md

## unixfsv2.md

      
    Raw
  

              unixfsv2.md
            
          
    Filesystems & IPLD

Introduction

IPLD is a system for creating decentralized systems based on content-addressable data primitives.
One of the things people frequently want to do when building applications is handle files, and describe filesystems.
As a result, we've compiled some thoughts and recommendations about how to describe filesystems in IPLD,
and also produced some specifications that we suggest systems use and build upon.
There will not be "one true way" to describe filesystems in IPLD.
IPLD is an open-ended ecosystem and there can be many different ways to accomplish goals using IPLD.
What we discuss in this document will be just one way that we've thought through particularly thoroughly.
If it fits your needs as an application designer, we hope you will use it; if not, we hope it is at least useful inspiration.
This document will also introduce some of the features of IPLD that we suggest using to describe filesystems,
and demonstrate how we apply them.  Even if you build different filesystem descriptions than ours here,
these tools and the rough idea of how we compose them should probably be highly reusable.
Unixfsv2

Unixfsv2 is the name of a set of conventions we propose for handling filesystems in IPLD.
(There is also a unixfsv1!  However, it's very different, and comes from IPFS, before IPLD was recognizably extracted.
We won't talk much more about it here.)
Unixfsv2 is designed to fit naturally within the IPLD Data Model,
and is described in IPLD Schemas for clarity,
and leverages features like IPLD ADLs to solve tricky problems like large data sharding in a nicely layered way.
File

The type for files in unixfs is quite simple: they're just a big blob:
type File bytes
  using { ADL="FBL" }

The Flexible Byte Layout ADL is used to allow support for arbitrarily large data.
(The FBL ADL is good at random access, and it's not strict about internal tree structure,
and it can also simply be a bunch of bytes with no linking at all for small data.)
Named Files

Some applications want to describe single named files.
(Think: attachments in an issue tracker, etc.)
This use case involves no attributes, no directory structures, etc -- just names.
We propose the following simple structure be used for this:
type NamedFile struct {
	filename String
	body     File
}

Recall that for small quantities of bytes, the File type can still just be an inline set of bytes --
meaning the entire NamedFile structure could be in one block if it wants to;
or, it could be the start of a document spanning multiple blocks, depending on the behavior of the ADL working on File.
Directories

Directories are essentially a map from filename to a file... or other directory!  And, it may involve some sort of attributes.
Let's focus on the map structure for first (this is just for didactic purposes, what we actually recommend will be below):
type Filename string

type File bytes
  using { ADL="FBL" }

type Directory {Filename:AnyFile}
  using { ADL="HAMT" }

type AnyFile union {
	| "f" File
	| "d" &Directory
} representation keyed

Notice how when we use a Directory type, we don't use the NamedFile type at all.
Filenames are already part of the structure of a Directory: it would be redundant to use the NamedFile type.
However, both still eventually lead to a File type -- this is the most important part to share, since it's likely the largest piece of data.
We use another ADL here: the Directory type uses a "HAMT" (Hash Array Mapped Trie).
A HAMT is a system for sharding a map across multiple blocks of data (it's somewhat similar to a B+ tree, but also has some rules which result in a canonicalized form, which has nice emergent behaviors when used in decentralized systems).
Using this ADL means we can support directories of nearly any size.
Note how in the AnyFile union, the Directory member is prefixed with an & symbol.
This means a link should be there -- the Directory data will be in a new block, and we'll point to it here with a CID.
Directories (for real)

Okay, let's expand that a little bit.  (This'll be more the real thing.)
We also need to account for attributes.  Right now, let's keep that to an Attribs type, and we'll decide what it actually is in the next section.
Also, let's throw in another file type -- symlinks.
Here's what we get now:
type Filename string

type File bytes
  using { ADL="FBL" }

type Directory {Filename:DirEnt}
  using { ADL="HAMT" }

type Symlink struct {
	target String
}

type DirEnt struct {
	attribs Attribs
	content AnyFile
}

type AnyFile union {
	| "f" File
	| "d" &Directory
	| "l" Symlink
} representation keyed

type Attribs struct {
	# we'll discuss this in the next section;
	# for now, it's enough to reserve the position where it's used.
}

Here, the Attribs info is embedded into directories.
The number of blocks expected and ways they will sharded is the same in this schema as the previous one, despite the added types!
Notice how easy it was to add the Symlink type to the AnyFile union, also.
Attributes

Ahh, "attributes".  Here be dragons.
There are many different concepts of "attributes" out there.
Windows filesystem attributes.
Mac filesystem attributes.
POSIX filesystem attributes (which ones?).
Tar format attributes.
Zip format attributes
"Xtended" attributes.
Many of these concepts of "attributes" are close -- but none of them are exactly equal to any one of the others.
What, then, do we do about this?
IPLD Schemas to the rescue: We're not going to pick a single approach here, but rather outline several:
applications can choose which of these concepts of "attributes" they want to plug into their overall schema.
For example:
type Attribs struct {
	executable Bool
}

This is one of the simpler attributes models one could use: is the file "executable"?
(This is a unix'y concept moreso than a windows concept -- but it's also a brazen simplification of the unix concept.)
Or, we could make a much larger set of attributes described:
type Attribs struct {
	mtime Int # In time since Epoch, in 1-second granularity.
	posix Int # The familiar unixy 0777 mask packing.
	sticky Bool
	setuid Bool
	setgid Bool
	uid Int
	gid Int
}

Or, we could make a third set, which includes the posix and mtime fields above, but ditches uid/gid/setuid/setgid/sticky.
One can make two different schemas, one with each of these definitions of Attribs, and use either of them -- or both.  Or more than two!
Remember: IPLD Schemas are structual typing, not nominative -- which means they can be applied as pattern recognition for data.
Other Systems

Unixfsv2 will probably not be a do-all, end-all system.
Unixfsv2 is aiming to provide a simple standard for filesystems and directories that can be large in size.
There's many more things Unixfsv2 is not trying to solve.  For example:

Signed (but freely readable) contents
Encryption
"Capability" systems (encrypted or otherwise)
Efficient tracking of partial mutations, or conflict resolution

... and so on!
We hope that we can make as many parts of this spec reusable (piecemeal if necessary) as possible.
Sharing the leaf structure

In general, we hope that we can have shared conventions for data structures on the leaves --
e.g., it's especially useful if we can have everyone agree on the File type,
because if those leaves on a large DAG of filesystem content are shared by two different systems,
even if the directory structure over them is distinctive, a great deal of the overall content will be deduplicatable.
Sharing the vocabulary and design patterns

Hopefully sharing vocabulary and patterns of design is still useful :)
Sharing topology for pathing and Selectors

// todo discuss further
Popularizing understandings of attributes

As mentioned earlier -- probably there won't be "one" "true" understanding of filesystem "attributes".
But maybe we can compile a small set of them which a wide range of different projects agree to recognize.
This can make interop between different projects easier.
Related Docs and Prior Art


Unixfsv2 gets some early drafts and discussion here: https://github.com/ipld/specs/blob/caa5af41702b026683f2c35c2dc701fc88c31f98/design/history/exploration-reports/2019.06-unixfsv2-spike-01.md
This concept of various schemas is also discussed in this gist: https://gist.github.com/warpfork/3948bd951e93c0f0b4e355d78b736f83
Filesystem attributes (with a particular lean towards linux perspectives) and tree layouts were workshopped once upon a time in this shared document: https://hackmd.io/4lqtycvdQN2WTspBLpy3qw
The issues and pull-requests in this repo contain various discussions of filesystems and attributes: https://github.com/ipld/legacy-unixfs-v2/