angrycub/unfederate.md

## unfederate.md

      
    Raw
  

              unfederate.md
            
          
  title
  date
  draft
  tags
  menu
  
  
  Manually Unfederating a Nomad Cluster using a Network Partition
  2017-12-07 12:49:20 -0500
  false
  
  
  nomad
  
  
  main
  
  
  parent
  
  
  Nomad
  
  
Manually Unfederating a Nomad Cluster using a Network Partition

This is an advanced process that can be used to unfederate a Nomad cluster with minimal impact to running client jobs.  For clusters where the current job state is easily recreated, it is easier to stop the jobs in the cluster, wipe the server's state, and resubmit the jobs.
Scenario Cluster
A 9-node cluster federated in Consul.  Nomad configured to automatically discover nodes based on Consul data.

Cluster A

mr-a-1 - 10.0.0.214
mr-a-2 - 10.0.0.218
mr-a-3 - 10.0.0.28


Cluster B

mr-b-1 - 10.0.0.70
mr-b-2 - 10.0.0.179
mr-b-3 - 10.0.0.55


Cluster C

mr-c-1 - 10.0.0.87
mr-c-2 - 10.0.0.18
mr-c-3 - 10.0.0.187


CentOS 7
Nomad 0.7.0 Enterprise
Using firewalld locally on the boxes
Initial state

The cluster topology starts in a federated state based on the Consul information:
[root@mr-a-1 ~]# nomad server-members
Name                       Address     Port  Status  Leader  Protocol  Build      Datacenter  Region
mr-a-1.node.consul.global  10.0.0.214  4648  alive   false   2         0.7.0+ent  dc1         global
mr-a-2.node.consul.global  10.0.0.218  4648  alive   false   2         0.7.0+ent  dc1         global
mr-a-3.node.consul.global  10.0.0.28   4648  alive   false   2         0.7.0+ent  dc1         global
mr-b-1.node.consul.global  10.0.0.70   4648  alive   false   2         0.7.0+ent  dc2         global
mr-b-2.node.consul.global  10.0.0.179  4648  alive   false   2         0.7.0+ent  dc2         global
mr-b-3.node.consul.global  10.0.0.55   4648  alive   true    2         0.7.0+ent  dc2         global
mr-c-1.node.consul.global  10.0.0.87   4648  alive   false   2         0.7.0+ent  dc3         global
mr-c-2.node.consul.global  10.0.0.18   4648  alive   false   2         0.7.0+ent  dc3         global
mr-c-3.node.consul.global  10.0.0.187  4648  alive   false   2         0.7.0+ent  dc3         global

Desired State

Three discrete clusters of three nodes, each unaware of the other nodes.
[root@mr-a-1 ~]# nomad server-members
Name                       Address     Port  Status  Leader  Protocol  Build      Datacenter  Region
mr-a-1.node.consul.global  10.0.0.214  4648  alive   false   2         0.7.0+ent  dc1         global
mr-a-2.node.consul.global  10.0.0.218  4648  alive   false   2         0.7.0+ent  dc1         global
mr-a-3.node.consul.global  10.0.0.28   4648  alive   true    2         0.7.0+ent  dc1         global

[root@mr-b-1 ~]# nomad server-members
Name                       Address     Port  Status  Leader  Protocol  Build      Datacenter  Region
mr-b-1.node.consul.global  10.0.0.70   4648  alive   false   2         0.7.0+ent  dc2         global
mr-b-2.node.consul.global  10.0.0.179  4648  alive   true    2         0.7.0+ent  dc2         global
mr-b-3.node.consul.global  10.0.0.55   4648  alive   false   2         0.7.0+ent  dc2         global

[root@mr-c-1 ~]# nomad server-members
Name                       Address     Port  Status  Leader  Protocol  Build      Datacenter  Region
mr-c-1.node.consul.global  10.0.0.87   4648  alive   false   2         0.7.0+ent  dc3         global
mr-c-2.node.consul.global  10.0.0.18   4648  alive   true    2         0.7.0+ent  dc3         global
mr-c-3.node.consul.global  10.0.0.187  4648  alive   false   2         0.7.0+ent  dc3         global

Procedure

1. Reconfigure nodes to prevent Consul Auto-Joins

On the nodes, add a consul stanza to the top level of the configuration. This stanza must include server_auto_join = false and client_auto_join = false.  For example, in HCL:
...
consul {
  server_auto_join = false
  client_auto_join = false
} 
...

Add the information about the cluster's addresses to the server stanza.  Using retry_join is a preferred method:
...
server {
  retry_join = ["10.0.0.87","10.0.0.18","10.0.0.187"]  # for a cluster C node as an example
  ... other server options...
} 
...

Do not restart the nodes at this time.  We will do that in a future step.
1.1. (Optional) Pre-build peers.json

You can also pre-build the peers.json file as directed below.

NOTE: This file MUST not be put into the raft folder while the node is up.

2. Use a firewall to partition the clusters

Create firewall rules that prevent communication between the clusters that you want to separate. For my sample cluster, I will partition it into three separate clusters. On cluster A, I want to deny all traffic from clusters B and C; for cluster B, I want to deny A and C; for C, deny A and B.
Cluster A

# Cluster B
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.70" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.179" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.55" reject'
# Cluster C
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.87" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.18" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.187" reject'

Cluster B

# Cluster A
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.214" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.218" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.28" reject'
# Cluster C
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.87" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.18" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.187" reject'

Cluster C

# Cluster A
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.214" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.218" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.28" reject'
# Cluster B
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.70" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.179" reject'
firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="10.0.0.55" reject'

3. Stop the Nomad process on the server nodes

After the Nomad process is stopped, it won't be possible to submit new jobs to the cluster. Existing jobs will continue running without issue. You will need to perform these operations on all clusters before starting Nomad again.
4. Delete the Serf snapshot file

Deleting the Serf snapshot is required to prevent the nodes in the clusters from reconnecting. The Serf snapshot file can be found in the «data_dir»/server/serf folder
rm -f «data_dir»/server/serf/snapshot


5. Create a peers.json file

If the cluster membership changes will cause the new configuration to be unable to reach quorum (which is typical in this scenario), update the membership information using a peers.json file.
In the «data_dir»/server/raft folder, there is a peers.info file with additional information about the process.
Create «data_dir»/server/raft/peers.json with a list of cluster members.  For example, in my cluster C, the peers.json file would contain:
["10.0.0.87:4647","10.0.0.18:4647","10.0.0.187:4647"]
(Optional) Create peers.json using Consul query

curl http://127.0.0.1:8500/v1/catalog/nodes | jq --compact-output '[.[] | .Address+":4647"]' > peers.json

Verify that the peers.json contains the correct nodes to be reclustered.
6. Start Nomad on the servers

7. Verify that the clusters are now separate

Use nomad server-members to verify that the clusters are now separate.
8. Tear down the firewall rules used to partition the cluster

The firewall rules created in step 2 are no longer necessary, so you can remove them. Because my example cluster is using firewalld, I would run firewall-cmd --reload to remove the temporary rules.