Skip to content

Instantly share code, notes, and snippets.

@leseb
Forked from vshankar/geo-rep-recovery.md
Created June 14, 2012 20:16
Show Gist options
  • Save leseb/2932673 to your computer and use it in GitHub Desktop.
Save leseb/2932673 to your computer and use it in GitHub Desktop.
Geo-rep failover-failback plan

Geo-rep Recovery plan

Use Case

In the event of geo-rep master suffering a partial failure (one of the GlusterFS brick process not functioning) or a full failure (master node shuts down), the steps involved in recovering the master from the slave is what is covered here.

Notion used in this document

  • failover - Switching from geo-rep master to slave
  • failback - Populating the master with slave data and switching back to master

Switching implies transfer of control over to the slave, thereby allowing write operations on it.

Recovery Mechanism

The mechanism is twofold requiring user intervention.

Step 0 - Phase 1 (Failover)

Provided interface is through Gluster CLI executed either from master or slave. Executing from slave would require SSH keys of the slave to be copied to master.

gluster> volume geo-replication <master> <slave> recover failover start

In case the command is executed on the slave, the semantics of master <--> slave are reversed. Whatever may be the case, existing geo-rep sessions between the master/slave is terminated. At this point the user can switch his application to and continue as usual.

Do we need status for failover ?

Step 1 - Phase 2 (Failback)

Similar to Step 0, the provided interface is through Gluster CLI. This phase populates the real master with the data present on the slave (now acting as master).

gluster> volume geo-replication <master> <slave> recover failback start

start will initiate the data transfer without the end-user face a downtime. The preferred utility and mechanism is discussed further ahead in this document. Please refer to that. We can do a one-shot invocation of the sync utility and allow it to sync as much as it can.

Status of the sync can be observed with

gluster> volume geo-replication <master> <slave> recover failback status

Sync in Progress

Once initial sync is done

gluster> volume geo-replication <master> <slave> recover failback status

Sync Completed

We move on the next step; final sync

gluster> volume geo-replication <master> <slave> recover failback commit

commit would require the user to experience a downtime while it initiates the final sync action. Again, this can be done using Rsync or gsync.

To prevent Rsync from crawling and check-summing each and every file; index translator can be used to keep track of which files were updated. index translator needs to be loaded before start and will keep track of modifies files by creating an hardlink to it in a configured directory. Then, get the list of files by readdir(2) and feed them to Rsync.Additionally, we would need to modify index translator to create negative entries to track delete operations.

Initial Sync Mechanism

Utilities for the initial sync i.e. Failback (start mode)

  • Rsync: Allow Rsync to determine which files need to be sync'd (it's usual rolling check-summing algorithm).
  • Gsync: Gsync can efficiently determine which files to sync using it's xtime checking approach (provided that the real master is not full empty). Hence, this would be significantly faster than the Rsync method in determining which files to sync.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment