Skip to content

Instantly share code, notes, and snippets.

@sstelfox
Last active May 8, 2024 17:54
Show Gist options
  • Save sstelfox/c6b6cfe998b6fa5377aebca3313405d5 to your computer and use it in GitHub Desktop.
Save sstelfox/c6b6cfe998b6fa5377aebca3313405d5 to your computer and use it in GitHub Desktop.
title created_at updated_at public tags
BanyanFS Internal Spec Draft
2024-01-31 18:15:01 -0500
2024-03-17 18:18:35 -0400
false
needs-refinement

BanyanFS Internal Spec Draft

  • Fully encrypted
    • Format managed write authorization control
    • Varying access encrypted keys for selective permissioning of filesystem, data, and history
    • Audited change-log built and synced as a property of the filesystem
    • Can be disabled to provide a public view while preserving the authenticity guarantees
    • Key rotation built in for access removal (applies only to current and future changes in the filesystem)
  • Sparse file system view support
    • Mounting and exploration of massive filesystems on demand
  • Privacy preserving while allow space reclamation
  • Merkle-CRDT based journal
    • Supports snapshot based journal trimming to prevent unbounded growth
    • Guaranteed convergence of disjoint views as long as there is overlap between the snapshot's journals
    • Operation aggregation
    • Provides fast delta based replication with automatic conflict resolution
  • Built-in support for external filesystem references
  • Flexible block sizes
    • Low overhead storage of large files
    • Low overhead for single or many small files
  • Data block aggregation
    • Efficiently pack many small files into a single data replication unit
  • Optimized for streaming over the network, sparse access, and long term storage
    • Selective inclusion of segments based on the application needs
    • Tolerant of out-of-order update delivery
    • Separation of metadata and data synchronization for fast sparse filesystem access
  • Built in low-overhead bit-corruption recovery for long duration storage on physical media
  • Remote automated maintenance of outdated storage information
  • Native inclusion of data references for UnixFS / IPFS
  • Does not make use of 128 bit integer to ease use in a wide variety of programming languages

Considerations

TODO: Cover all considerations and trade off I can remember making here, start noting them down

  • Unencrypted visibility of filesystem ID
  • Blind key identities for private filesystems (unable to see anything about the keys that are members of the set without being a member of the set)
  • Discover-ability of future versions (requires some kind of external coordination system or truth oracle). This is outside the scope of the format but could be a central platform or a blockchain.
  • Historical key presence: Allow reverting or selectively migrating access to the new version. Can be cleaned up by the implementer at their discretion. Provides an persistent indicator in the filesystem's history of the origin of data.
  • The format should never re-encrypt data segments without reason as it is an expensive operation and the data may not be locally available.
  • Unauthenticated parsing does not include any variable length data with the exception of a tightly bounded count of key participants present and requires no trust or complex parsing to test or inspect before being cryptographically verified.
  • All data will be stored little-endian unless otherwise specified

Cryptography

Each unique actor that will interact with this filesystem is identified by the public portion of a ECDSA key using the NIST P-384 curve (also known as the secp384r1 curve).

Internally in the format the public key is reduced to a weakened "Key ID" with final identity being tied by confirming signatures against the public key of record which is also included in the format. See the Key ID section for additional details on these identifiers.

Content IDs

Data, nodes, journal entries, and filesystem state are all represented by 32-byte blake3 hashes which are referred to as CIDs. External users or actors may be inclined to reference the same components through the use of the CIDv1 standard. Converting our internal CID format to the CIDv1 specification format requires pre-pending the bytes 0x00, 0x01, 0x1e to our internal format. These bytes have been omitted for protocol efficiency.

Symmetric Encryption / Decryption

  • 32 byte keys
  • XChaCha20Poly1035
  • 24 byte nonce
  • Authenticated data is included for some encrypted payloads as defined in the section.

Escrowed Key Sharing

When generating versions of data and to allow selective access to different data with a content segment, key material needs to be shared asynchronously with intended recipients. The recipients always have their public key present in the format and this knowledge is leveraged to perform a reproducible diffie-hellman key exchange asynchronously.

Encrypting

  • ECDH Key Exchange
    • Use private ephemeral generated and discarded one-time ECDSA P-384 curve
    • Use the public key of the intended secret recipient
  • Create an HKDF with the SHA256 digest using the derived secret material from the ECDH exchange
  • Perform an expand operation on the HKDF to produce 32 bytes out of key material.
    • Use this key material as a symmetric key for any content that needs to be encrypted and shared with the recipient
  • Share the ephemeral public key, and the encrypted data with the recipient

Decrypting

  • Receive the ephemeral public key, and the encrypted data
  • ECDH Key Exchange
    • Use your private key
    • Use the received ephemeral public key
  • Create an HKDF with the SHA256 digest using the derived secret material from the ECDH exchange
  • Perform an expand operation on the HKDF to produce 32 bytes out of key material.
    • Use this key material as a symmetric key to decrypt the data

Signature Generation

  • ECDSA keys using the NIST P-384 curve
    • Signatures using SHA3-384

Format Details

This format defines an on-disk and on-net protocol for exchanging and storing both a complete copy of the BanyanFS filesystem as well as delta and partial history snapshots of the filesystem itself. This document will refer to one complete instance of this format as a segment regardless of whether it is being stored on disk or traversing a network.

# 'Identity Header:32,Filesystem ID:16,Public Settings:16,Content Payload (1+):32?bits=32,oddchar=-'  
+---------------------------------------------------------------+  
|                        Identity Header                        |  
+---------------------------------------------------------------+  
|         Filesystem ID         |        Public Settings        |  
+---------------------------------------------------------------+  
|                      Content Payload (1+)                     |  
+---------------------------------------------------------------+

The above diagram represents the overall structure of a segment and should not be taken to scale for the data that it contains. Each section is described in detail with an accurate bit representation diagram.

Everything from the beginning of the Identity Header through the end of the Signature in the above diagram is considered to be the "Segment Header" and will be referred to as such for the remainder of this spec.

External Data Types

We encode a series of external data such as cryptographic keys into the protocol. We have chosen the most common minimum binary representation of such objects. The sizes, formats, and details of construction for these types is defined in this section and will be referenced in the subsequent sections.

  • Authentication Tag: Produced as a result of encrypting a stream with the XChaCha20Poly1305 stream cipher, this is a 16 byte field.
  • Blake3 Hash: While the blake3 specification allows for varying length outputs, this format uses the default length of 32 bytes. Key IDs are truncated versions of the full 32 byte version implemented as described in the Key ID section.
  • CIDs: Internally this format omits the multibase, version identifier, and multicodec portions of standard CIDv1. This version opinionates on the choice of no base encoding (Multibase 0x00), and blake3 (Multicodec 0x1e). This results in static 32-byte identifiers wherever we need them in the format.
  • Nonce: XChaCha20 uses an extended nonce of 24 bytes reducing the potential to re-use the same number when this is generated fully randomly. The implementor may choose to use a CSPRNG to generate these values. If guaranteed re-use resistance is desired, the current journal clock, combined with node unique information in the filesystem for whatever operation is being performed can be used as a uniqueness oracle. Future reversions of this spec will include a recommendation for this derivation.
  • Private / Signing Key: This specification makes use of NIST P-384 private code points for asynchronous private keys. These can be represented minimally as 48 byte binary strings are more commonly stored in DER or PEM format. This format does not store or encode private keys but does provide some test vectors of sample private keys. The sample keys are stored in the minimal format. Implementors may store the private key as they choose.
  • Public / Verification Key: Derived from a P-384 private or signing key. The public key when encoded is converted into the Elliptic-Curve-Point-to-Octet-String encoding described in SEC 1: Elliptic Curve Cryptography (Version 2.0) section 2.3.3 (page 10). Encoded this is 49 bytes.
  • Symmetric Key: Keys used inside this format are used with the XChaCha20Poly1305 authenticated stream cipher which requires 32 byte keys. It is expected that the implementor, when required to, is both familiar with and follows best practices for random key generation.
  • UUID: TODO doc

Key ID

Key IDs are a 2-byte truncated fingerprint of the P-384 public keys used by this format.

There is at most 255 possible key slots available at any given time. Two bytes is generally sufficient to prevent collisions but they will occur. In the event of a key ID collision when verification needs to take place, each key should be tried, oldest key first, to validate a signature before rejecting or settling on the identity of the author.

To generate a Key ID, generate a 32-byte blake3 hash over the encoded variant of the NIST P-384 Public / Verification Key (as described in the External Data Type section). The two most significant bytes of the resulting hash are that's public key's Key ID.

More complex persistent key identification and registration schemes were considered but ultimately rejected as they only greatly reduced the probability of collisions not ruling them out entirely.

Identity Header

# 'Magic Bytes (BYFS):32,R:1,Version:7?bits=40,oddchar=-'
 0                   1                   2                   3                  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
+-------------------------------------------------------------------------------+
|                       Magic Bytes (BYFS)                      |R|   Version   |
+-------------------------------------------------------------------------------+
  • 4 bytes, Magic Bytes (BYFS): Magic signature hard coded to the the ASCII bytes BYFS (0x42, 0x59, 0x46, 0x53). This is a loving reference to Banyan.
  • 1 bit, R: Reserved for future use. Encoders MUST NOT SET this to bit. Decoders SHOULD ignore this bit.
  • 7 bits, Version: An unsigned 7-bit integer representing the current version of the BanyanFS this segment is using.
    • Version 0 is reserved. Decoders SHOULD reject files with field set to zero (0bx000_0000).
    • Version 1 is defined in this document. Encoders following this specification MUST set this value to 1 (0bx000_0001). Decoders MUST assume segments with this value conform to this specification.

Filesystem ID

The Filesystem ID is a 16-byte little-endian encoded UUID as defined in RFC4122 and draft-peabody-dispatch-new-uuid-format-04. This value acts as a persistent identifier for a singular filesystem separate from the CID that represents the current HEAD version of the root of the filesystem.

The value must meeting the following requirements:

  • The bits encoding the filesystem ID MUST NOT consist of all zeros or all ones.
  • The format SHOULD make a best effort to not reuse filesystem IDs that have already been generated.
    • Generating random values for an ID such as generating correct instances of UUIDv4 or UUIDv8 IDs is considered sufficient
    • Utilizing monotonically increasing clock based variants such as UUIDv1, UUIDv6, UUIDv7, or UUIDv8 is considered sufficient
  • It is RECOMMENDED to generate either a correct UUIDv4 or UUIDv8 (fully-pseudorandom) for this value
  • The implementation MUST store the little-endian byte encoded version of the UUID instead of the user-readable version.

Public Settings

# 'Reserved:24,ECC:4,Pvt:4?bits=32,oddchar=-'
+---------------------------------------------------------------+  
|                    Reserved                   |  ECC  |  Pvt  |  
+---------------------------------------------------------------+
  • 6 bits, Reserved: Reserved for future use. Encoders MUST NOT SET these bits. Decoders SHOULD ignore these bits.
  • 1 bit, ECC: Whether error correction codes are present at the end of the segment.
  • 1 bit, Private: Filesystem privacy flag. This bit indicates whether this segment of the filesystem is encrypted. Adjusts the format of the Access Control section.

Content Payload

The content segments section of the segment contains the primary content of the filesystem and depending on the intended use of the construction may contain different sections. In both streaming and archival uses cases it is potentially useful to aggregate multiple content payloads into a single file or network stream.

Each unique content payload MUST be for the same filesystem identifier and MUST NOT change between public and private settings between payloads. Each Content Payload should be self-contained and should be treated as complete transactional units or snapshots.

# 'Escrowed Access Keys (Private):32,History Start:16,Permission Control:16,Realized View (Opt):32,Journal (Opt):16,Journal Index (Opt):16,Maintenance Log (Opt):16,Data (Opt):16,History End:16,Signature:16,ECC:16?bits=32,oddchar=-'
+---------------------------------------------------------------+  
|                 Escrowed Access Keys (Private)                |  
+---------------------------------------------------------------+  
|         History Start         |       Permission Control      |  
+---------------------------------------------------------------+  
|                      Realized View (Opt)                      |  
+---------------------------------------------------------------+  
|         Journal (Opt)         |      Journal Index (Opt)      |  
+---------------------------------------------------------------+  
|     Maintenance Log (Opt)     |           Data (Opt)          |  
+---------------------------------------------------------------+  
|          History End          |           Signature           |  
+---------------------------------------------------------------+  
|              ECC              |  
+-------------------------------+

Encryption for Private Filesystems

The nature of replaying the journal requires all participants to be able to know who has appropriate access to do what, and a copy of their public key. This information is stored in the Permission Control section and binds specifics identities to their inclusion. By doing a level of encryption indirection we can enhance the privacy of participants with little extra cost.

When the private bit has been set in the preceding Public Setting section, the subsequent Escrowed Access Keys, Permission Control, and Content Options sections will be encrypted with a symmetric key that is available to authorized identities via the Escrowed Access Key portion of this format.

Public versions of the filesystem still have this section but it is simplified and has slightly different but compatible semantics, instead its responsibility is to identify the valid author keys that can be used to sign journal entries both in the current snapshots and from streaming sources.

Regardless of the presence of encryption we start with a shared count of the number of keys that will be present.

# 'Key Count:8?bits=8,oddchar=-'  
0                 
0 1 2 3 4 5 6 7  
+---------------+  
|   Key Count   |  
+---------------+
  • Key Count: 1 byte

This value is the number of keys present in the Escrowed Access Keys section AND the Permission Control section. When encoding an encrypted filesystem, encoders MUST ensure that the order of keys present in the Escrowed Access Key section and the Permission Control section are the same.

Encoders SHOULD sort the key entries lexographically on the Key ID and Ephemeral Pubkey bits for consistent generation between implementations. Decoders SHOULD allow for these to be out of order.

Escrowed Access Keys

Filesystem participants gains access to the encrypted History Start and Permission Control sections by first acquiring the shared symmetric key for it that was encrypted by the content payload's author.

Decoders should calculate their Key ID as defined in the appropriate section and locate all Escrowed Access Keys that have a matching Key ID. In practice this is almost always going to be a single key but collisions are possible so all matching keys MUST be checked before assuming access is not available.

Each Escrowed Access Key is half of a Diffie-Hellman key exchange that participants private keys can complete as described the in Escrowed Key Sharing section. Decryption of the internal key is considered successful if the ChaCha20Poly1305 authentication tag validates.

For more details on the specific mechanism and operation of encryption please refer to the section "Symmetric Encryption / Decryption" in this specification.

If the decryption or validation fails and ECC information is available, recovery should be attempted following the process documented in the "Error Recovery" section.

The encrypted escrowed keys have the following format:

# 'Key ID:16,Ephemeral PubKey:392,Nonce:192,Enc Filesystem Key:256,Tag:128?bits=36,oddchar=-'  
0                   1                   2                   3             
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5  
+-----------------------------------------------------------------------+  
|             Key ID            |                                       |  
+-------------------------------+                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                            Ephemeral PubKey                           |  
+                       +-----------------------------------------------+  
|                       |                                               |  
+-----------------------+                                               +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                 Nonce                                 |  
+                                               +-----------------------+  
|                                               |                       |  
+-----------------------------------------------+                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                           Enc Filesystem Key                          |  
+                                                       +---------------+  
|                                                       |               |  
+-------------------------------------------------------+               +  
|                                                                       |  
+                                                                       +  
|                                                                       |  
+                                                                       +  
|                                  Tag                                  |  
+                       +-----------------------------------------------+  
|                       |  
+-----------------------+

After successful decryption the encoder will have a 32 byte key that can be used to decrypt the meta portions of the codec format which include the Permission Control, History Start and History End sections. The Permision Control section

TODO: document encrypted format (nonce + payload + tag)

TODO: document size calc nonce size (24 bytes) + history start (9 bytes) + key entries ((89 byte for crypto overhead + 16 bytes for symmetric keys) * key count) + tag size (16 bytes)

When decrypted the internal layout of the data is documented in the History Start and Permission Control sections respectively.

Permission Control

The permission control section allows selective cryptographic access for different identities to the different sections of the format in private filesystems, and allows signaling who are trusted writers to the filesystem for public filesystems.

  • Key Id
  • Public key
  • Vector clock
  • 1 byte, Access Setting: As defined immediately after this definition.
  • 1 byte presence + Filesystem key (private only)
  • 1 byte presence + Data access key (private only)
  • 1 byte presence + Maintenance view key (private only)

The Access Setting byte has the following structure:

# 'Prot:4,Ownr:4,Hist:4,Rsrv:8,FSK:4,MntK:4,DatK:4?oddchar=-'
+---------------------------------------------------------------+
|  Prot |  Ownr |  Hist |      Rsrv     |  FSK  |  MntK |  DatK |
+---------------------------------------------------------------+
  • 1 bit, Prot: Protected Bit - Key can't be deleted unless the owner revokes this bit via a journal action or the filesystem is forked with new owners.
  • 1 bit, Ownr: Owner Bit - Write access to the journal (and effectively the filesystem), authorized to sign changes, segments, and content payloads.
  • 1 bit, Hist: Historical Bit - Indicates the key had access in a previous fork but has since lost write access. Safe to remove the key.
  • 1 bit, Rsrv: Reserved for future use. Encoders MUST NOT SET this bit. Decoders SHOULD ignore this bit.
  • 1 bit, FSK: Realized View Bit - Whether the key has been granted access to the symmetric key protecting the realized view. Decoders SHOULD ignore this bit when the filesystem is public.
  • 1 bit, MntK: Maintenance Key Bit - Whether the key has been granted access to the symmetric key protecting the maintenance log. Decoders SHOULD ignore this bit when the filesystem is public.
  • 1 bit, DataK: Data Bit - Whether the key has been granted access to the symmetric key protecting the file data keys.

Content Options

1 byte, defined following this one. Indicates which optional sections are present in this content payload.

The Content Options byte has the following structure:

# 'Reserved:20,RFs:4,Mnt:4,Data:4?oddchar=-'
+---------------------------------------------------------------+
|                Reserved               |  RFs  |  Mnt  |  Data |
+---------------------------------------------------------------+
  • 5 bits, Reserved: Reserved for future use. Encoders MUST NOT SET these bits. Decoders SHOULD ignore these bits.
  • 1 bit, RFs: Whether the content payload contains a realized view of the filesystem, this can only be set in the first Content Payload section sent in a segment.
  • 1 bit, Mnt: Whether this segment includes maintenance data
  • 1 bit, Data: Whether this segment includes data blocks

The Content Options MUST have at least one of the content section bits set. This means no Content Payload is allowed to have no content present.

Journal Start

This history of the filesystem may be truncated from time to time. The Journal Start Vector and Markle Root CID indicate what the state of the world is as it stands at the beginning of this content payload. The content options fields indicate which format features are in use in later sections and which symmetric section keys to be present in the Permission Control section.

# 'Merkle Root CID:32,Journal Start Vector:32?bits=32,oddchar=-'    0                   1                   2                   3     
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1  
+---------------------------------------------------------------+  
|                        Merkle Root CID                        |  
+---------------------------------------------------------------+  
|                      Journal Start Vector                     |  
+---------------------------------------------------------------+  
  • Merkle Root CID: 32 bytes, The CID of the root node after applying all known operations to the filesystem other than those included in the journal (if any).

  • Journal Start Vector: 8 bytes, The beginning vector clock of the journal after applying all known operations to the filesystem other than those included in the journal (if any).

  • An origin filesystem (one that was freshly created) MUST have its Merkle Root CID set to all zeros.

  • An encoder SHOULD populate the journal start vector a random value. An encoder MAY start this vector at all zeros.

Realized View

If present this section is a more traditional representation of the state of the Filesystem as it appears at the beginning of the history range defined for this content payload. Any journal entries present MUST NOT have been applied to the state of the realized filesystem.

When more than than one Content Payload is present only the first one MAY contain a Realized View. Encoders MUST omit the Realized View on all subsequent Content Payloads and ensure the RFs field in the History Start's Content Option's is NOT SET.

  • Root CID, must point to a folder node
  • BTree of CID -> Node ID mapping
  • BTree of Node ID -> Block Offset

Filesystem Transfer Representation

A distinction is made here between the transfer representation and how users of the filesystem will want to operate on the data. The transfer and storage format is optimized for simplicity in encoding and decoding while being streaming friendly which are not always compatible with pure storage efficiency or have data structures efficient to perform operations over.

There are some recommendations for operational based data structure and the various capabilities they should include in the section titled Filesystem Operations and Behavior.

Common node data:

  • Derived data (not persisted):
    • Node ID
    • Parent node ID
    • CID
    • Size
  • Mandatory metadata:
    • owner: ActorId.
    • permissions: executable, immutable, owner_write_only
    • created_at: Unix Timestamp in milliseconds
    • modified_at: Unix Timestamp in milliseconds
  • Vector clock (the journal clock value when it last changed, must not be included in CIDs)

A name must be a valid UTF-8 string. It must not contain a forward slash (/) or a backslash (\), angle brackets (< >), carriage returns (\r), or new lines (\n). The name must not be . (single period) or .. (double period).

Permissions:

# 'Reserved:20,Exec:4,Immu:4,OwnW:4?bits=32,oddchar=-'  
+---------------------------------------------------------------+  
|                Reserved               |  Exec |  Immu |  OwnW |  
+---------------------------------------------------------------+
  • Reserved: 5 bits (todo)
  • Exec: 1 bit, executable (todo) only relevant to File and Link Nodes. Encoders MUST NOT set this on other nodes. Decoders SHOULD ignore this on other nodes.
  • Immu: 1 bit, immutable (todo)
  • OwnW: 1 bit, owner write only (todo)
File Node
  • File encryption key (if encrypted)

  • Content Enum

    • Stub: size, used to preregister a file path and get a unique handle on the location before data has actually been uploaded
    • Full: List of:
      • Data block CID
      • Offset
      • Length
  • Node ID: 16 bytes

  • Node Options: 1 byte.

    • 1 bit, Compressed: Whether the data in the content list is compressed or not
    • 3 bit, Reserved: Reserved bits
    • 4 bits, Node Type: See the Node Type Registry and Primitives sections for the mapping of appropriate values. Decoders that do not recognize a Node Type MUST be ignored and changes are not allowed to those object types to prevent corruption of future node types.
  • Content List: All the items present in the File node's content list. This is already ordered and the order must be maintained. Each entry is composed of:

    • 32 byte CID
    • 4 byte offset field, represented as a little endian 32 bit unsigned integer
    • 4 byte length field, represented as a little endian 32 bit unsigned integer
  • Attribute Pairs: All attributes are encoded using the rules outlined in the Attribute Registry

Associated Data Node

TODO: Low priority

Directory Node
  • Mandatory metadata:
    • permissions (simplified)
    • contents_size (cached calculated value of the size or contents_size of all children)
  • Children:
    • HashMap of name -> (NodeType, NodeId, CID)

The CID must taken into account when calculating a directory's CID.

Internal Link Node
  • Target node ID
  • Target node type
Native Mount Node
  • Size is going to be cachedof the remote view, can be set to zero
  • BanyanFS:
    • Filesystem ID of remote filesystem
    • Root CID reference?
    • Vector clock reference?
    • Escrowed access key (Optional)

Journal (Opt)

A full or partial series of operations that represent all changes made to the filesystem. These entries by their construction form a grow only CRDT set in the form of a Merkle tree. Merging any two copies will always be valid even when missing intermediate entries.

Any entry present must exist after the latest history range values in the segment.

  • Previous Journal Entry CID
  • Author Key ID
  • Author's vector clock
  • Unix Timestamp
  • Operations
    • List of CRDT Operations
  • Author Signature

Journal Index (Opt)

The journal index is an optional separate section that allows for quickly locating and retrieving known journal entries from the linear journal log by ID, testing inclusion of operations, in the tree and a mapping from the CID to the vector clock value of where that CID fits into the tree.

  • Journal entry start index
  • CID B-tree -> (Vector Clock, Index in Journal) or just Vector Clock if journal isn't present

Maintenance (Opt)

The maintenance log is an optional section that allows clients to communicate details of storage maintenance operations that need to be taken on the remote end. When decrypted by an authorized party, this contains a list of the CIDs for all data blocks that the journal entries in this content payload creates, as well as all blocks that were removed by journal entries in this content payload along with the vector clocks of the filesystem when those changes were made.

This should include all data blocks added in the duration even if the data blocks themselves are not present in the data section.

  • List of:
    • CID of data block
    • Vector clock value when change went into effect
    • Type of change (added / removed)

Data (Opt)

The data section contains blocks of the content that make up the files in the filesystem. The encryption of these blocks is controlled by the appropriate entry in the filesystem that points at their data. An individual data block does not have any identifiers linking it back to the filesystem but acts as an independent data unit.

Individual blocks intentionally do not reveal any details about the originating filesystem. They are intended to be stored independently so include their own version information about the originating filesystem.

A single file may be broken up into many blocks. When this occurs data from the filesystem internals (blocks, their ordering, and the encryption key if encrypted) is needed to reconstruct the original file even when the data itself is unencrypted.

A data block may contain many individual smaller files for efficiency. There is no delimiters or other indicators inside the data block when this occurs. The offsets and encryption details are contained within the filesystem internals.

Data blocks may be 256KiB, 8MiB, 64MiB, and 512MiB. Encoders should try to pack chunks of individual files in as few blocks as possible, utilizing only one block of a smaller type for the the last block (whatever remains). The last block should be padded up to the full size of the appropriate block.

Small files MUST only be packed into 258KiB and 8MiB chunks and SHOULD only be packed into the blocks with other smaller files. An encoder SHOULD NOT try to fit small files into the remaining space at the tail end of a large file that has been broken up.

# 'Version:8,Block Opt:8,CID:8,Len (Trunc):8,Data (Variable):32,ECC (Opt Variable):32'
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  
|    Version    |   Block Size  |      CID      |  Len (Trunc)  |  
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  
|                        Data (Variable)                        |  
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  
|                       ECC (Opt Variable)                      |  
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • Version: 1 byte. Encoders following this specification MUST be set to 0x01
  • Block Opt:
    • 1 bit: Truncated. Encoders MUST set this only for unencrypted data from public filesystems. If this is present the data may omit
    • 1 bit: ECC present. Indicates whether ECC data for the block is included at the end of the block.
    • 5 bits: Reserved. Encoders MUST NOT set these bits. Decoders MUST ignore these bits.
    • 2 bits: Block size. 0b00 = 256KiB, 0b01 = 8MiB, 0b10 = 64MiB, 0b11 = 512MiB.
  • CID: 32 bytes. See external formats for more details.
  • Len: 4 bytes. Present when the data is unencrypted AND the Truncated bit is set. Contains the exact number of bytes in the Data section of this block.
  • Data: Will be the length of one of the four sizes for the blocks as determined by the block size bits. The raw contents of this block may be encrypted as determined by the filesystem.
  • ECC: Length is determined the size of the block. See ECC generation and validation section.

If stored on its own, the file should have the magic bytes for the ASCII string BYFD prepended to the data block.

Signature

This signature is should be calculated over the Identity Header, Filesystem ID, Public Settings, and the entirety of this Content Payload block.

# 'KeyID:16,Signature:32?bits=32,oddchar=-'  
 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------------------------------------------------------+
|             KeyID             |           Signature           |
+---------------------------------------------------------------+
|                               |
+-------------------------------+
  • Key ID: 2 byte Key ID indicating the identity of the key of the signer
  • Signature: 96 byte signature. Calculated as documented in the Cryptography section.

ECC (Opt)

# 'Parity Locator:80,Parity Chunks (variable):16,Chunk Checksums (variable):32?bits=32,oddch  
ar=-'  
0                   1                   2                   3     
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1  
+---------------------------------------------------------------+  
|                                                               |  
+                                                               +  
|                         Parity Locator                        |  
+                               +-------------------------------+  
|                               |    Parity Chunks (variable)   |  
+---------------------------------------------------------------+  
|                   Chunk Checksums (variable)                  |  
+---------------------------------------------------------------+
  • Parity Locator: A fixed bit pattern that can be used in the event corruption to identify the beginning of the error recovery section.
  • Parity Chunks:
    • Hashes over fixed length chunks of bytes. Longer segments will have more parity chunks. Length and count TBD, need to run some numbers.
    • Each chunk will become a piece of a reed-soloman erasure chunk creating additional parity chunks of the same size.
    • Should tune to allow for up to 10% data corruption.
  • Chunk Checksums: For each chunk (original and parity), there will be a blake3 hash over the content of the chunk to act as an integrity rejection check when recovering corrupted segments using the parity chunks.

Filesystem Operations and Behavior

When working with CRDTs, the most important consideration is the intended semantics of how conflicts are automatically resolved (or even if they're left up to the user). To that end, each composite definition is going to include a note about the intended semantics, while the primitive section is going to describe the technical behavior or the building blocks making up the composites.

Primitives

TODO

Associated Data Type Registry

Associated Data Types are a 2 byte sequence that can be mapped to a little-endian encoded unsigned integer to map against the following table. These values are used in File object type to include derived data from the file such as image previews

Associated Data Type Key String Identifier Description
Reserved 0
Image Preview image_preview 1 TODO
Video Preview video_preview 2 TODO
Reserved 3-255
Text Content Vector Embedding text_vector_embedding 256 TODO
Image Content Vector Embedding image_vector_embedding 257 TODO
Reserved 258-8191
Experimental / User Defined 8192-32767 Custom types defined by applications that are not in the registry.
Reserved 32768-65535 Top bit is reserved for future expansion

Applications that make use of custom types should fail safely (pretending as if the value isn't present) if the value fails to meet the custom application decoding requirements as the stored type may be from another application.

Attribute Registry

Attribute Name Byte Identifier Value Length Custom Attribute Key Description
Reserved 0x00
Owner 0x01 32 bytes The encoded owner's Actor ID (todo: validate this size)
Permissions 0x02 1 byte The encoded permissions of the file node as described in the format details.
Created At 0x03 8 bytes 64 little endian signed integer representing the number of milliseconds since the unix epoch when this node was initially created. Is not mutable.

This value is always in UTC.
Modified At 0x04 8 bytes 64 little endian signed integer representing the number of milliseconds since the unix epoch when this node was last modified. It is only modified when a change other than this attribute is made to the filesystem node. Matches the created_at when the node is initially created.

This value is always in UTC.
Mime Type 0x05 Variable mime_type A standardized mime type matching HTTP Content Type formats. May be up to 255 character in length.
Custom Attribute 0xff Variable For any attribute not listed in this registry, it's considered a custom attribute and should should be encoded as described following this registry table.
  • Registered attribute may only appear once each
  • Custom attributes can not have multiple entries with the same key
  • Custom attributes MUST NOT allow a reserved name

Custom attribute encoding:

# 'Attr Type:8,Key Length:8,Key:8,Value Length:8,Value:8?bits=40,oddchar=-'  
+-------------------------------------------------------------------------------+
|   Attr Type   |   Key Length  |      Key      |  Value Length |     Value     |
+-------------------------------------------------------------------------------+
  • Attr Type: The attribute registry identifier, any unnamed attribute encodes as a Custom Attribute.
  • Key Length (Opt): 1 byte. Only present for custom attributes. The number of bytes that make up the key as an unsigned 8 bit integer.
  • Key Bytes (Opt): Only present for custom attributes. Variable length, should be the length described in the Key Length attribute.
  • Value Length (Opt): 1 byte. Onlyt present for attributes listed as have a variable type in the table above. The number of bytes that make up the value as an unsigned 8 bit integer.
  • Value Bytes: Variable length, should be the length described in the Value Length attribute.

Node Type Registry

Each unique identifier within a composite type needs only to be unique within its type. We also allow in some circumstances a mixed collection of different composite types to be present in the same list. To facilitate parsing each composite type is given a four bit core identifier that can be used to identify the type of object, before parsing the mapping of types are documented in this section.

Composite Type Byte Identifier
Reserved 0x0
File 0x1
Associated Data 0x2
Directory 0x3
Internal Link 0x4
Native Mount 0x5

Composites

We have mandatory attributes that we know specific details about and can customize our CRDT decisions to have the most accurate semantics for the specific type. We also include arbitrary user metadata as extended attributes which we can not assume the semantics of so must make a more general decision with those.

Our filesystem nodes must have a unique reference within the filesystem to be individually addressable, this does not include details such as the path of the file or its name as the identifier shouldn't change when the entry itself is moved or renamed. This identifier needs to be stable within the filesystem even when its metadata changes.

To form a meaningful Merkle DAG of the filesystem, each node in our tree needs to have a unique ID that is generated from all of the data that make up its semantic content. Any attribute that is not-derivable from other data, controls how the data should be interpreted, or result from an operation that a user has taken should be captured within this identifier.

These two requirements are odds with each other so both are used in the representation itself. Journal entries will apply to node IDs and result in a new CID for that node ID. All CRDT operations will need to target the node ID, and thus that needs to be the stable reference for directories, links, and other node types.

CID Calculation

  • Layout uses the unencrypted / public encoding version of the file node and perform a blake3 hash over the results.

Files

The core data contents of a file is made up of two important components, an optional encryption key which needs to be managed very carefully. There is no operations that makes sense within the filesystem to change the encryption key without also changing the list of content blocks as they will need to be re-encrypted. When re-encrypting existing files, systems operating on the journal entries need to be able to simultaneously change the list of content blocks.

When reconstructing files from content blocks we must know the order of the files, but also the offset and size of the actual data within the data blocks that contain the content. CIDs can be considered universally unique, and the location of the data is irrelevant to the semantics of the data.

The final data stored within a File is associated data of the File node itself. This is data such as smaller preview versions of a images and video, different forms of search indexes, content descriptions, etc. A File may have many derived types even of the same type such as file previews of different sizes. This data should all be fully derived from the File content alone even if its a costly operation. The nature of this derived data means we can use a simpler set of rules when operating on this data.

Associated Data

These are effectively lightweight versions of files that live outside of the standard filesystem hierarchy, instead being referenced as part of File metadata.

Decoders MAY entirely ignore associated data entries but MUST preserve and reproduce byte exact copies when re-encoding the data for other format consumers. Encoder MAY omit entries of associated data only when they recognize the associated data type and intend to explicitly filter out that type of data.

TODO: Low priority

Directory

TODO: high priority

Internal Link

TODO: low priority

Native Mount

TODO: low priority

Drive

The ultimate wrapper around a filesystem. A Drive is the fundamental base for the Realized View in the filesystem. The core filesystem node at the root of a drive is always a directory, and it includes all of the basic operations covered by a folder as well as access control, and key management responsibilities.

The Drive also becomes the root of the Merkle tree, always pointing at the previous Realized View that had been fully integrated by all clients.

TODO

CRDT Scratch Pad

  • NewFile
  • NewFolder
  • NewLink
  • NewMount
  • DeleteNode
  • MoveNode
  • RekeyNode (Either File or Folder which will recursively cover all child entries)
  • SetAttribute
  • AddAccessKey
  • RemoveAccessKey
  • ModifyAccessKeySettings
  • ForkOwnership

Error Recovery Generation and Recovery

  • Reed-Solomon Erasure
  • Hamming Code integrity check and simple recovery
  • Secondary digest integrity check

TODO

Cookbook

Decoding and Verification

  • Read first 9 bytes
    • Validate magic header (upon failure attempt error recovery)
    • Validate version header (upon failure attempt error recovery)
    • Extract header length
  • Read header length bytes into buffer
  • Validate error correcting checksum, correcting any bytes in the
  • Parse Public Settings field to determine if the filesystem is public or private
  • Validate the extracted header length is less than or equal to the maximum valid size for the filesystem type (TODO: Define this)
  • When public
    • Extract the public keys and their key settings, ignore any key present that has this historical key set.
    • Identify the owner key based on the key settings
    • Use the public key to validate the signature on

TODO

Forking a Filesystem

  • Generate an new Filesystem ID
  • Add ForkedOwnership event in the journal
    • This marks all existing keys as historical and grants one new key ownership
  • For a private filesystem this flags all files for key rotation upon their next write
  • May optionally add additional journal entries removing historical keys
  • May optionally add additional journal entries granting access to new keys

Making a Filesystem Public

Anyone with access the a realized copy of the filesystem is able to make a view public by simply re-encrypting the view they have as an unencrypted filesystem. Some access control measures, and limited view help mitigate this. Intentionally doing this by an owner however is quite easy.

TODO

Merging Two Related Filesystems

Filesystems are related if they share a common origin. The Filesystem ID is considered to be a shortcut for testing whether two non-overlapping histories are part of the same filesystem. While this is not necessarily true, merging journal entries from two distinct filesystems will still result in a valid filesystem but the final state may not be wha

TODO

Publishing an Encrypted File

Sufficient for publishing to the world if the data blocks are publicly available:

  • Extract the file version's encryption key
  • Extract the addresses of the file data

If performing some kind of long term archive:

  • Retrieve the data blocks associated with the address
  • Validate the each block matches its CID

TODO

Test Vectors

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment