avati/plan.md Secret

## plan.md

      
    Raw
  

              plan.md
            
          
    GlusterFS 4.0 Plan

This document is a draft of the GlusterFS 4.0 plan. This is more of a specification document focusing on external behavior, look and feel, with sparse internal details (provided only if felt necessary.) The document includes only those things which need change in architecture and have cross functional impact. Some areas are intentionally vauge to encourage further discussion.
Goals

GlusterFS 4.0 is designed to be a significant improvement over the 3.x series, particularly in scalability (10k nodes) and operations (script and integration friendly). Architecture and internals are revisited (in both the management and data layers), commands and interfaces are redesigned for scalability with the focus on the operator's ease of use. In short, we aim to be the devops' "favorite" filesystem.
The rest of the document is classified based on components, and we will look into proposed changes for both scalability and operational ease in each.
Management layer

Resource pool management

GlusterFS, at the heart is a lego kit of translators. User consumable storage space (volumes) is constructed by arranging translators in a specific topology. This construction is represented in a textual format, also known as a volfile.
With glusterfs 3.0, we introduced a CLI and management daemon for easing administration and creating/managing volfiles. The 3.x series has taken a volume centric view of the internals, and presents the same view outwards (as seen in the CLI commands). The volume is pretty much the only first class object which is created, deleted, reconfigured and managed. As an inevitable dependency of the volume, the concept of a brick comes into existence.
This is felt at its worst at the time of creating a volume spanning 100s of servers. The bricks being nothing more than parameters of a volume, have to be specified as a list, in a very specific and careful order at the time of volume creation. Inevitably, this results in the need for an extra layer of logic and intelligence to build built as tools around the gluster CLI.
There is also redundency in inputs provided by the operator - the list of servers is already input at the time of peer probing. But it has to be repeated every time volumes are created, and subject to error of wrong order (or missing some) in each volume creation command.
The inconvenience does not end here. If a new set of servers are added to the pool, the list of new servers must be specified once for each existing volume. This clearly becomes a bottleneck if we want to scale volumes to 100s or 1000s. The same inconvenience is felt if a server has to be decommissioned - each of the 100s or 1000s of volumes have to be remove-brick'ed separately, tracked separately, committed separately before peer-detaching the server.
Clearly our management layer can do better than this, and that is one of the goals of 4.0. The rationale for the new set of commands is to take a functional programming style "goal based"/declarative approach instead of the "step based"/imperative approach. For e.g, it must not be of the operator's concern about which specific bricks are replicating each others data, or have the responsibility of moving a specific brick from a specific server to another (tracking its effects on redundancy and global balancing in their brain)
The new commands can be catagorized in to one of two classes:

Declaring resources into the pool along with characteristics. E.g:

"Add SERVER.1 residing on rack 4 into the server pool"
"Use /dev/sdb on SERVER.2 for storing data"
"/dev/sdc on SERVER.3 is SSD/fast storage"


Defining goals meaningful to the user. E.g:

"I need a new volume/namespace called my-music"
"I need all data in /my-docs/tax-returns to be replicated 3 times"
"I want to decommission SERVER.4"


Almost everything else must be touch-free internal automation. This means, the management layer needs to be more of an autonomous storage intelligence engine and less of a server which reacts to step by step low level commands (like "start remove-brick for volume-1", "commit remove-brick on volume-1", "start moving contents of server-1:/dir-a to server-3:/dir-b"). This also means we need a richer ecosystem/hierarchy of first class resource objects like

"Fault zone" (which maps to racks in physical world or Availability zone in clouds/AWS) which takes down all servers in it when failing
"Server" (a trusted system) which takes down all bricks in it when failing
"Brick" (either export directory or block device) which takes down all files in it when failing

In this new scheme, a hypothetical from-scratch deployment would look something like this
srv1# gluster init cluster_name
// set up a new cluster
// generate a new security CA.

srv1# gluster peer probe srv2 fault-zone rack1
// assign srv2 a UUID
// generate and give it a signed cert for that UUID identity
// share public key of CA for srv2 to verify other servers
// create "rack1" fault-zone object on the fly

srv1# gluster peer probe srv2 fault-zone rack2
// re-assign fault zone

srv1# gluster brick add srv2:/mnt/ephemeral
// use a directory based brick
// note brick is added into the global pool, not associated with a particual volume

srv1# gluster brick add srv2:/dev/sdc
// format and use a block device
// possibly set up dm-thin for automated BD support or snapshots, or use btrfs

srv1# gluster brick add srv2:/dev/sdd type SSD
// possibly qualify it with storage type for storing logs/journals

Easy volume management

Once the resources are declared in a hierarchy representing real world, all further operations are around creating and managing virtual objects (like volumes) and setting properties on them. At no time must the operator be made to associate virtual objects with physical resources (that is the primary job of the management layer.)
Volumes are syntax-sugared to be "top level directories" in the abstract volume name-space, and that namespace smoothly merges with the files/dir namespace within the volumes below it.
srv1# gluster volume create my-music
// notice that no bricks are specified!

srv1# gluster volume create my-docs
// or gluster volume create /my-docs

srv1# gluster volume set /my-docs replica 2
// set replication level 2 for all future data going into volume my-docs

srv1# gluster volume set /my-docs/tax-returns replica 3
// override "tax-returns" directory (inside volume "my-docs") to replica 3
// implies support for different replica levels within a volume (see data path changes)

srv1# gluster volume set / replica 2
// set default replica 2 for all future volumes

"Flexible" replication

3.x has the scaling factor rigidity when using stripe and replicate (scaling up can be done only in multiples of stripe-count x replica-count servers). This can be an annoying inconvenience when dealing with a volume at scale. There are many models already used in the community to address this rigidity. There are useful patterns in the puppet-gluster project for automating addition removal of individual servers independent of the replica count. GlusterFS 4.0 will be incorporating some of those community work into the internals of the management layer. More details on how striping is made more "flexible" is in the data paths section.
Volume internals (process model)

When a volume is created, a subdirectory is created in each of the brick directories by the name of the volume and used as a container of that volume data. There is one brick process created per physical brick, which exports all the subdirectories (volume specific directories) from that physical brick. This means the number of glusterfsd processes (and hence number of tcp ports used) per server is a function of the number of physical bricks, and not a function of number of volumes created. When a volume is created, it only results in the creation of a sub-directory and dynamically get exported through the already running brick process.
Easy operations (server add/remove)

When servers are added to scale storage:
srv1# gluster peer probe SERVER.5 fault-zone rack1
srv1# gluster brick add SERVER.5:/dev/sdb

srv1# gluster peer probe SERVER.6 fault-zone rack2
srv1# gluster brick add SERVER.6:/dev/sdb

// That's it. No more per-volume commands.

To decommission a server:
srv1# gluster peer detach SERVER.3
// will start draining all data on SERVER.3 (and autocommit)
// All volumes reconfigure to new topology automatically

To remove a dead server from config:
srv1# gluster peer detach SERVER.4 force
// no data drained as server is likely already dead. just updates configuration.
// All volume reconfigure/self-heal to maintain replica count automatically

etcd for config data (management scalability)

One the most significant changes in the management layer to address scalability (up to 10k servers) is the replacement of the glusterd management network with etcd. etcd will be the config store and the quorum arbiter network for glusterfs management layer.
As servers are peer probed, glusterfs will automatically pick representative nodes from each fault-zone to form a quorum of etcd network and store ALL config data in that etcd network. Absolutely no config or state will be stored directly on the filesystem (except logs). Management daemon by itself will be stateless (at least no durable state.) Management daemons will communicate with their local (in fault zone) etcd to get configuration and other notifications.
The management daemons will no more be handling the clustering aspects of the system. Their primary role will instead be to implement the "storage intelligence" (which servers form replica pairs, how to add a server, when to rebalance etc), monitoring (serving/aggregating "volume status" "volume profile" commands) and managing per-server brick processes.
ReST management APIs

The management daemon in GlusterFS 4.0 will be a native ReST server. Installation of the management daemon will involve zero (or close to zero) config file changes (except things like the HTTP listen port etc.) and EVERYTHING after that will be driven through HTTP/ReST. The gluster CLI command tool will be rewritten to be just another ReST client. All communications (between CLI and management, and among management daemons) are authenticated using the certificates assigned at provisioning time (gluster peer probe).
There will be multiple roles and identities for management. At the very least (maybe by default) there will be a read-only variant (to get status and info about the cluster, for monitoring purposes) and admin privilege (to add resources, create/configure volumes etc.)
Unified management

GlusterFS 3.x supports unified file and object access to the same data (using the gluster-for-swift project). While that enabled unified data access, GlusterFS 4.0 will aim for unified data and management. With 3.x, creation of a gluster volume automatically created Swift accounts. With 4.0 it will be possible to create a Swift account and automatically create a gluster volume to back it up (made possible only because of the new model of truly virtualized volumes). With this we have true bidirectional unification of data and management between Object (swift) and File.
Transparent drive migration

One of the use cases glusterfs 4.0 aims to cover better is running in the cloud (like AWS, Openstack) on top of an attached block device (like EBS, Cinder). For this use case, it must be made extremely easy for block devices to be detached from one server and attached to another. This is the same "pattern" as using gluster in a JBOD configuration and migrating drives from one shelf to another. Supporting this involves doing the right thing at multiple layers. The most notable one is in the management layer where we need to detect that a brick has migrated, make appropriate changes in the configuration, notify all clients of the new topology transparently and possibly other rebalance/self-heal decisions as well.
A similar, but different problem is graceful handling of change in IP addresses (which can happen when instances in a cloud reboot). An internal requirement/dependency is to store all configurations based on UUID and map them to IP addresses at the most superficial layer to minimize disruption.
Data layer

EHT-2 (Elastic Hashing 2.0)

One of the biggest factors which limits scaling the data path to 10k nodes is the fact that the current DHT translator maintains the native skeleton directory structure of the namespace on all bricks. Solutions like "directory-span" option which limits a given directory's scale-out factor are not effective or ideal because their limits restrict sub-directories' spanning and also impose on the parent directory's spanning.
A new approach is proposed here which addresses this fundamental limitation. A subset (possibly just one, as it is not critical for HA) of the bricks is chosen to store the skeleton directory structure. This only stores directories with assigned GFIDs without any files in them. This gives "structure" to the overall namespace. Files in a given directory is then spread/hashed across many servers. For a given directory in the namespace, a server will create a manifestation of that directory in a flat namespace, named by the ascii encoded GFID in the UUID format.
SKELETON NODE
/ (gfid= 00000001)
 |
 |-home/ (gfid= 1c45d8af)
 |     |
 |     |-user1/ (gfid= 4dfa37e3)
 |
 |-etc/ (gfid= 1c3234b1)
 
 DATA NODE
 /
 |-00/
 |   |-00000001/
 |             |-<files in dir />
 |-1c/
 |   |-1c45d8af/
 |   |         |-<files in dir /home/>
 |   |
 |   |-1c3234b1/
 |             |-<files in dir /etc/>
 |
 |-4d/
 |   |-4dfa37e3/
 |             |-<files in dir /home/user1/>

For listing entires in a directory, all entries in the skeleton node are first listed (which will cover all the subdirectories) and all non-directories are aggregated from the data nodes in order to complete the full set. The flat dirs on the data nodes also store a dummy (empty) entry of a sub-directory along with files. This way if the skeleton node is down, sub-directory entries in the data nodes are used.
This model of spreading directories allow for scaling each directory independent of its parent or sub-directories. Thereby we can now scale to server counts far greater (aimed to be in the region of 10s of thousands) than what is possible today.
The set of servers storing a parent directory and sub-directory can even be disjoint. The skeleton structure node should not be seen as a new overhead - instead it is something which is already existing today and in fact replicated everywhere. In this way, we are reducing this "overhead" from N nodes to 1 node.
Further benefits optimizations are possible in this model:

Since each directory is now completely "independent" (of its parent/sub-dirs), decisions like replication can be made on a per-directory level with much greater ease.
Flat directories in the data nodes can be created only on demand, when the first filename hashing to that server is created.
Skeleton structure can be stored on SSD or cached heavily.
Skeleton structure can store (cache) hash ranges of all component servers to avoid a fan-out lookup entirely.
Per-directory accounting (like xtime, file size, file counts) happen across data nodes on the dir GFID basis, whereas the "upward recursive aggregation" part happens on the skeleton node thereby avoiding lots of duplicated work.

Thinnified native client

The current clustering model of glusterfs 3.x places intelligence on the clients. A typical graph would look like this:

While this has many advantages, there are certain disadvantages as well. For e.g:

After performing a rolling upgrade of all storage servers, the real challenge of tracking down and upgrading all clients begin
While we took extrenous steps to move something as trivial (relatively speaking) as quota enforcement from client to server, we are trusting the same client with the integrity of the filesystem (like creating a file or directory on all servers with the same GFID, updating metadata of replicate and DHT in a responsible way without intentionally causing split brains or hash collisions).
There is no way for the management daemon to know that all systems have performed a graph switch involving a topology change successfully (and forcefully disable clients which have not switched over) because the clustering xlators live on the client side and the management daemon has no control over them

To address these concerns, GlusterFS 4.0's clustering model will be to place the cluster translators on the server side instead. A typical graph would now instead look like this:

With this model we have a clearer view of what is server-side (clustered) functions and what is server-specific functions.
Note some (important) changes:

Client graph now has a new xlator called "multipath" which is cluster aware, makes use of pathinfo to reach the right server for local data access.
Access control is "upgraded" to sit on top of the server (cluster) graph. It makes little sense to start a cluster-wide transaction only to later realize the user did not have access rights.
While marker remains in the brick graph, qouta is upgrade to sit on top of the server graph as well. This is a more natural place for quota to be placed (where it should have been all along). While in the old quota (version 3.3.x) quota was on top of the cluster xlators, it had the issue of being on the client side. This model seems ideal in all aspects.
Changelog (used for geo-rep) too is upgraded to sit on top of cluster xlators. This addresses some of the issues changelog today faces, where it tries to decode the "high level operation" (like RENAME) by intercepting "low level operations" in the brick (like LINK).
Caching xlators still sit on the client graph. Caching is best done as close to the client, and they do not impact the integrity of the filesystem like the cluster xlators.

Sharding xlator (Stripe 2.0)

GlusterFS's answer to very large files (those which can grow beyond a single brick) has never been clear. There is a stripe xlator which allows you to do that, but that comes at a cost of flexibility - you can add servers only in multiple of stripe-count x replica-count, mixing striped and unstriped files is not possible in an "elegant" way. This also happens to be a big limiting factor for the big data/Hadoop use case where super large files are the norm (and where you want to split a file even if it could fit within a single server.)
The proposed solution for this is to replace the current stripe xlator with a new Shard xlator. Unlike the stripe xlator, Shard will not be a cluster xlator. It will be placed on TOP of EHT (previously DHT). Initially all files will be created as normal files, even up to a certain (configurable) size. Those files which are hand picked (either through a configured filename pattern or selected individually through virtual xattrs) will "grow" in a different way. The first block (default 64MB) will be stored like a normal file. However further blocks will be stored in a file, named by the GFID and block index in a separate namespace (like /.stripe_files/GFID1.001, /.stripe_files/GFID1.002 ... /.stripe_files/GFID1.NNN). File IO happening to a particular offset will write to the appropriate "piece file", creating it if necessary. The largest number of all piece files will be stored in the xattr of the original (first block) file.
The advantage of such a model:

Data blocks are distributed by EHT in a "normal way".
Adding servers can happen in any number (even one at a time) and EHT's rebalance will spread out the "piece files" evenly.
Self-healing of a large file is now more distributed into smaller files across more servers.
piece file naming scheme is immune to renames and hardlinks.

EOF