rpip/edfs.md

## edfs.md

      
    Raw
  

              edfs.md
            
          
    Design

The new eDFS architecture is designed with the assumption that a single master system has a single point of failure.
In a single-point-of-failure system, the failure of the master node will result in the termination of jobs and the loss of access to the entire system. This will not happen in a cluster where the master node is dynamically elected.
Therefore, each node in the cluster will be a self-containing full copy of eDFS. On startup, the system reads the cluster configuration and sets up the node cluster.  All nodes start as followers; one node is elected to be a leader at
the start. During normal operation, the leader maintains a heartbeat
which allows the followers to detect if the leader fails or becomes partitioned.
You can specify the number of nodes that need to be involved to a
write or a read. Below is a list of the default settings used by some popular systems:

Basho's Riak (N = 3, R = 2, W = 2)
Linkedin's Voldemort (N = 2 or 3, R = 1, W = 1)
Apache's Cassandra (N = 3, R = 1, W = 1)

where

N = number of running nodes
R = number of nodes required for a read operation
W = number of nodes required for a write operation

Ideally, a file system should always be consistent, but this requirement also increases the system complexity.
Luckily, we can start off with either Rafter, and Erlang implementation of the Raft consensus protocol or Riak core for
building the blocks of the system. This should save us some time and help us focus on some other features such as the plugins / module systems, events notification, web interface and querying of the system with JavaScript as done by MongoDB and Riak.
Proposed data structure for file storage

I propose we model the file storage data structure after the Linux file system
concept of inodes. At a very broad level, the file data structure has the following fields:

file name
metadata structure: permission/mode, user id, group id, size, last
accessed time, created time, modified time, link_count (how many
files reference the file), readcount (number of open refs or how
many clients are currently accessing this file), filetype (regular | directory)
chunkmap (hash that maps ids to chunks).
permission: We could implement an Access Control similar to Linux
file system permissions. For example,(0755/drwxr-xr-x). This sets
the permissions for the file owner, the group that owns the file (if
any) and how others can also access the file.
use Riak style links for adding relationships between files and walking through the files

    {
        links: [
                 {
                     tag: 'hardlink',
                     key: '41399579391950848',
                     bucket: '27db4424b76e27fa37944ee1d742b4a6'  # perhaps, a hash of the directory name?
                 },
                 {
                     tag: 'symlink',
                     key: '5e183910b6d9684370',
                     bucket: '96e3f88e2002e9845e84370183910b6d'
                 },
                 {
                     tag: 'child',
                     key: '41399579391950848',
                     bucket: '16af6f8b3f566dfb1f8d652d01f4ff3a'
                 }
               ]
    }
Please note that directories are files too - they contain links to children nodes.
This also means that the entire file system must be implemented as tree to enable directory traversal.
Storage engines

By default, edFS will ship with a Mnesia backend, but the system can be easily extended to support other backends such
as LevelDB, BerkelyDB, RocksDB, LMDB, and InnoDB.
Caching

Caching can be done both at the client side or on the servers when the
clients request for data. In the case of the latter, file reads can be
cached on nodes, so subsequent reads do not redirect.
This is known as Adaptive Replacement cache (recently accessed blocks and frequently accessed blocks)
Data integrity

Data can be corrupted as it travels over a network either by packet
loss or interference by malicious programs.
One way of checking the integrity of the data is to create checksum of blocks and send both block and checksum to clients.
Another way of checking data integrity is to create a hash table of block ids and checksums.
We then map the file id to the combined hashes of all the hashes.
This will help detect corruption caused by either the clients, nodes or the network.
Garbage collection

When a file is deleted, either:

Mark for deletion by setting delete bit on the file record
OR, put files in trash bin

Garbage collector periodically clears (permanently) the deleted files.
Plugins

Some proposed plugins:

Backup plugin for exporting and importing data from and to eDFS
Events notifications plugins for hooks such as on_write, on_delete, on_update etc
Authentication plugin for implementing custom authentication modes

Summary of Amazon's Dynamo


consistent hashing to determine key placement
partial quorums for reading and writing
conflict detection and read repair via vector clocks and
gossip for replica synchronization