Skip to content

Instantly share code, notes, and snippets.

@krisis
Last active August 29, 2015 14:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save krisis/945e45e768ef1c4e446d to your computer and use it in GitHub Desktop.
Save krisis/945e45e768ef1c4e446d to your computer and use it in GitHub Desktop.
Glusterd Management Volume proposal

##Abstract

Glusterd, the management daemon for GlusterFS, maintains volume and cluster configuration store using an home-grown replication algorithm. Some shortcomings are as follows.

  • Involves O(N^2) (in number of nodes) network messages to replicate configuration changes for every command

  • Doesn't rely on quorum and not resilient to network partitions

  • Recovery of nodes that come back online can choke the network at scale

The thousand node glusterd proposal[1], one of the more mature proposals addressing the above problems, recommends use of a consistent distributed stores like consul/etcd for maintaining the volume and cluster configuration. While the technical merits of this approach make it compelling the operational challenges like coordinating between the two communities for releases and bug-fixes could get out of hand. An alternate approach[2] is to use a replicated GlusterFS volume as the distributed store instead. The remainder of this email explains how a GlusterFS volume could be used to store configuration information.

##Technical details

We will refer to the replicated GlusterFS volume used for storing configuration as the Management volume (MV). The following section describes how MV would be managed.

###MV management

To begin with we can restrict the MV to a pure replicated volume with a maximum of 3 bricks on 3 different nodes[3]. The brick path can be stored in glusterd.vol which is packaged. MV will come into existence only after the first peer probe or first volume create operation.

The following example of setting up a Glusterfs storage cluster highlights how things work in the proposed scheme of things.

  • Install glusterfs server packages on a storage node.

  • Start glusterd service.

  • Create a volume. --> Now, the MV is created with one brick and mounted under /var/lib/glusterd

  • Add a peer to the cluster --> Now, MV is expanded to a 2-way replicated volume with the second brick in the new peer. MV is mounted in the new peer under /var/lib/glusterd.

  • Create more volumes.

  • Add the third peer to the cluster --> MV is expanded to a 3-way replicated volume with the third brick in the new peer. MV is mounted under /var/lib/glusterd in the new peer. This is the last time MV is expanded.

  • Any further peers added to the cluster would only mount the MV under /var/lib/glusterd.

The above restrictions placed on MV allow us to escape the need for a robust distributed store for MV's volume information and volume files.

###Configuration details of MV

  • peers that are hosting bricks for MV would have a boolean option in glusterd.vol. For e.g something like, option mv_host on

  • The brick path for MV would have a default from the packaged glusterd.vol For e.g, option mv_brick /mv/brick

  • Replica count. This could be stored as part of glusterd.vol too. For e.g, option mv_replica 3

  • The ports for MV bricks could be reserved by glusterd's port mappper. For e.g, 49152 could be reserved for MV brick on each node, given that we would have only one MV brick per peer.

  • options to be set on volume - client-quorum, optionally proactive self heal enabled.

  • MV would benefit from client-side quorum, server-side quorum and other options. These could be preset (packaged) as part of glusterd.vol too.

  • With brick path, ports and volume options present in glusterd.vol or preset we can build the in-memory volume info representation on initialization of glusterd. This means we can generate MV's volume file dynamically in each MV hosting peer when needed and store in a 'known' location in local disk.

###Changes in glusterd command execution

Each peer modifies its configuration in /var/lib/glusterd in the commit phase of every command execution. With the introduction of MV, the peer in which the command is executed will perform the modifications to the configuration in /var/lib/glusterd after commit phase on the remaining available peers. Note, the other nodes don't perform any updates to MV.

###How to replace a 'dead' server/peer?

At the moment, I haven't thought of an automatic (or near semi-automatic) way of replacing a 'dead' peer. The manual steps should be as follows,

  • If the 'dead' peer doesn't host MV bricks then the procedure as in previous versions. This approach doesn't change anything.

  • Provision a new server. Install glusterfs packages.

  • Modify the glusterd.vol to have option mv_host on option mv_replica 3 #as the case may be

  • Probe the peer to the cluster. glusterd on initialization would replace its MV brick in MV and replication's healing should replicate the configuration.

N.B This procedure assumes default MV config parameters. For non-default configuration, the brick path should also be updated in glusterd.vol in the new peer.

###How to upgrade from current version?

Following would be the steps,

  • Stop all gluster{d,fs,fsd} processes by stopping the corresponding services.

  • Upgrade to this version of glusterfs packages.

  • Choose at most 3 servers/peers to build MV. In these nodes, create the default brick directories; modify the (new) glusterd.vol to have option mv_host on Set replica count on each peer's glusterd.vol option mv_replica 3 #say

  • Move /var/lib/glusterd contents on each peer a to a temporary directory. Say, /var/lib/glusterd.bkp

  • Start glusterd service on one of the nodes, in 'upgrade' mode. In this mode, glusterd would start the MV bricks and mount it on /var/lib/glusterd. It will not serve cli or mount requests.

  • Copy the contents of /var/lib/glusterd.bkp on to (the mounted) /var/lib/glusterd.

  • Repeat this on all nodes in the cluster.

  • Stop glusterd on all nodes. Start glusterd service on all nodes (in 'normal' mode).

  • Now the storage cluster should be ready for improved operations.

###How to upgrade from this version to future versions?

This is trickier than it should be given that we are holding MV's configuration in glusterd.vol, which is packaged. I would like to hear from the community for suggestions on this.

###References [1] - http://www.gluster.org/community/documentation/index.php/Features/thousand-node-glusterd.

[2] - This approach was initially recommended by Jeff Darcy, who is also the author of [1].

[3] - It shouldn't be hard to allow expanding MV beyond 3 bricks but most distributed configuration stores recommend 3 or 5 way replication. At the least this could be made configurable via glusterd.vol.

@atinmu
Copy link

atinmu commented Nov 17, 2014

  1. In MV management you have mentioned that meta volume will be created on first peer probe/volume command, probably we need to also think about handling system get_uuid which RHSC triggers?
  2. "Changes in glusterd command execution" - I think we should also mention that it has to be a synchronous replication to reach to concensus.
  3. "How to replace a 'dead' server/peer?" - Shouldn't we mention a downtime risk here?

@krisis
Copy link
Author

krisis commented Nov 17, 2014

  1. I am keeping it out to avoid the discussion from going tangential. We could also include system::uuid_get as another trigger for creating MV too.
  2. The kind of replication doesn't concern glusterd as long as it is strongly consistent and partition tolerant.
  3. The down time risk is not unique to the replication mechanism used. So, I chose not to call it out.

@atinmu
Copy link

atinmu commented Nov 18, 2014

Just one more comment in "How to upgrade from current version? - Start glusterd service on all nodes (in 'normal' mode)" - glusterd service need to be stopped first which is missing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment