Skip to content

Instantly share code, notes, and snippets.

@rpip
Last active August 29, 2015 14:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rpip/9007cc43c857b344059f to your computer and use it in GitHub Desktop.
Save rpip/9007cc43c857b344059f to your computer and use it in GitHub Desktop.

Design

The new eDFS architecture is designed with the assumption that a single master system has a single point of failure. In a single-point-of-failure system, the failure of the master node will result in the termination of jobs and the loss of access to the entire system. This will not happen in a cluster where the master node is dynamically elected.

Therefore, each node in the cluster will be a self-containing full copy of eDFS. On startup, the system reads the cluster configuration and sets up the node cluster. All nodes start as followers; one node is elected to be a leader at the start. During normal operation, the leader maintains a heartbeat which allows the followers to detect if the leader fails or becomes partitioned.

You can specify the number of nodes that need to be involved to a write or a read. Below is a list of the default settings used by some popular systems:

  1. Basho's Riak (N = 3, R = 2, W = 2)
  2. Linkedin's Voldemort (N = 2 or 3, R = 1, W = 1)
  3. Apache's Cassandra (N = 3, R = 1, W = 1)

where

  • N = number of running nodes
  • R = number of nodes required for a read operation
  • W = number of nodes required for a write operation

Ideally, a file system should always be consistent, but this requirement also increases the system complexity. Luckily, we can start off with either Rafter, and Erlang implementation of the Raft consensus protocol or Riak core for building the blocks of the system. This should save us some time and help us focus on some other features such as the plugins / module systems, events notification, web interface and querying of the system with JavaScript as done by MongoDB and Riak.

Proposed data structure for file storage

I propose we model the file storage data structure after the Linux file system concept of inodes. At a very broad level, the file data structure has the following fields:

  • file name
  • metadata structure: permission/mode, user id, group id, size, last accessed time, created time, modified time, link_count (how many files reference the file), readcount (number of open refs or how many clients are currently accessing this file), filetype (regular | directory)
  • chunkmap (hash that maps ids to chunks).
  • permission: We could implement an Access Control similar to Linux file system permissions. For example,(0755/drwxr-xr-x). This sets the permissions for the file owner, the group that owns the file (if any) and how others can also access the file.
  • use Riak style links for adding relationships between files and walking through the files
    {
        links: [
                 {
                     tag: 'hardlink',
                     key: '41399579391950848',
                     bucket: '27db4424b76e27fa37944ee1d742b4a6'  # perhaps, a hash of the directory name?
                 },
                 {
                     tag: 'symlink',
                     key: '5e183910b6d9684370',
                     bucket: '96e3f88e2002e9845e84370183910b6d'
                 },
                 {
                     tag: 'child',
                     key: '41399579391950848',
                     bucket: '16af6f8b3f566dfb1f8d652d01f4ff3a'
                 }
               ]
    }

Please note that directories are files too - they contain links to children nodes. This also means that the entire file system must be implemented as tree to enable directory traversal.

Storage engines

By default, edFS will ship with a Mnesia backend, but the system can be easily extended to support other backends such as LevelDB, BerkelyDB, RocksDB, LMDB, and InnoDB.

Caching

Caching can be done both at the client side or on the servers when the clients request for data. In the case of the latter, file reads can be cached on nodes, so subsequent reads do not redirect.

This is known as Adaptive Replacement cache (recently accessed blocks and frequently accessed blocks)

Data integrity

Data can be corrupted as it travels over a network either by packet loss or interference by malicious programs.

One way of checking the integrity of the data is to create checksum of blocks and send both block and checksum to clients. Another way of checking data integrity is to create a hash table of block ids and checksums. We then map the file id to the combined hashes of all the hashes.

This will help detect corruption caused by either the clients, nodes or the network.

Garbage collection

When a file is deleted, either:

  • Mark for deletion by setting delete bit on the file record
  • OR, put files in trash bin

Garbage collector periodically clears (permanently) the deleted files.

Plugins

Some proposed plugins:

  • Backup plugin for exporting and importing data from and to eDFS
  • Events notifications plugins for hooks such as on_write, on_delete, on_update etc
  • Authentication plugin for implementing custom authentication modes

Summary of Amazon's Dynamo

  • consistent hashing to determine key placement
  • partial quorums for reading and writing
  • conflict detection and read repair via vector clocks and
  • gossip for replica synchronization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment