Skip to content

Instantly share code, notes, and snippets.

@AoJ
Created June 17, 2020 09:30
Show Gist options
  • Save AoJ/7744853be4f0d655a787641f6d6d4e4c to your computer and use it in GitHub Desktop.
Save AoJ/7744853be4f0d655a787641f6d6d4e4c to your computer and use it in GitHub Desktop.
High Availability (HA) and Clustering: 2-Node (Master / Slave) PostgreSQL Cluster

High Availability (HA) and Clustering: 2-Node (Master / Slave) PostgreSQL Cluster


  • /var/lib/pgsql/11/data/pg_hba.conf (in the $HOME of user postgres)

  • /usr/pqsql-11/bin are the binaries



  • 3 cores principles to HA:

    • Elimination of single point of failures.

    • Reliable crossover.

    • Detection of failures as they occur.

How High Availability is implemented on RHEL 7?

  • CLUSTERING!

Definitions

  • I/O Fencing is, in short, the act of preventing a node from issuing I/Os (usually to shared storage). It is also known as STOMITH or STONITH.

  • Fencing devices are hardware components used to prevent a node from issuing I/Os.

  • Fencing agents are software components used to communicate with fencing devices in order to perform I/O Fencing. In the context of this project, they are standalone applications which are spawned by the cluster software (compared to, for example, Linux-HA which uses dynamically-loaded modules).

  • Power fencing is when a node is power-cycled, reset, or turned off to prevent it from issuing I/Os

  • Fabric fencing is when a node's access to the shared data is cut off at the device-level. Disabling a port on a fibre channel switch (zoning), revoking a SCSI3 group reservation, or disabling access at the SAN itself from a given initiator's GUID are all examples of fabric fencing.

  • Manual fencing or meatware is when an administrator must manually power-cycle a machine (or unplug its storage cables) and follow up with the cluster, notifying the cluster that the machine has been fenced. This is never recommended.

Nomenclature

  • RA
    • Resource agent
  • Master,Slave
    • the state of Master/Slave resource of Pacemaker
  • PRI
    • PostgreSQL works as Primary(Master). The request of Read/Write can be processed in PRI as well as usual PostgreSQL, and data for the replication can be transmitted. Master of Pacemaker is basically corresponding to this state. At a synchronous replication, The transaction is stopped when the response from Standy PostgreSQL(HS) is lost.
  • HS
    • PostgreSQL works as Hot Standby. Only the request of Read is available. The data of the replication can be received from PRI. Because PostgreSQL cannot change the state from PRI to HS directory, the state may not be constistent to Slave of Pacemaker though the state changes to Pacemaker with Master-> Slave.
  • Asynchronous mode
    • Make the asynchronization replication of PostgreSQL HA cluster with RA. Only when HS works normally by the asynchronization replication, the failover of Master is possible.
  • Synchronous mode
    • Make the synchronous replication of PostgreSQL HA cluster with RA. HS works as the synchronous mode as a normal, However, HS's breaking down or removing LAN HS move to asynchronization mode (automatic switch). Only when HS works as a synchronous mode, the failover of Master is possible.
  • Replication mode
    • The asynchronous mode and the synchronous mode are generically called a replication mode.
  • Replication mode switch
    • At the synchronous mode, The Replication mode switches the synchronous to the asynchronous replication.
  • D-LAN
    • LAN that throws packet of data replication. Please use bonding for the fault tolerance.
  • S-LAN
    • LAN to provide service. In the current Act-Standby composition, it is equal to LAN that gives virtual IP. Please use bonding for the fault tolerance.
  • IC-LAN
    • LAN (inter connection LAN) that throws packet of communication of heartbeat of Pacemaker. Two or more is recommended.
  • STONITH-LAN
    • STONITH is not used in this document. But I recommend to use it.

Limitations

  • If you want to connect new Master(PRI) after fail-over occured automatically, it is necessary to share the WAL archive of PRI to HS. For instance, at the "archive_command" setting of postgresql.conf, you can use scp or rsync command for sending the WAL archive to HS. Another methods are the using the share-disk and NFS and so on.. RA don't care the method. I recommend not to share WAL archives because it's difficult to keep it consistency. -
  • A specification of >PostgreSQL9.1 fails switching the Master. Because the shutdown of PRI can't send all WAL to HS. I hope someone will improve it.
  • Without shareing the WAL archive, it is impossible to reconnect to PRI from HS when PostgreSQL is stopped with demote and stop operations. In this case you must copy the WAL archive of PRI to HS by manually.
  • It is necessary to add the virtual IP to not only S-LAN but also D-LAN in Master.

Clustering

A cluster is a set of computers working together on a single task. Which task is performed, and how that task is performed, differs from cluster to cluster.

  • High-availability clusters: Known as an HA cluster or failover cluster, their function is to keep running services as available as they can be. You could find them in two main configurations:

    • Active-Active (where a service runs on multiple nodes).

    • Active-Passive (where a service only runs on one node at a time). Load-balancing clusters: All nodes run the same software and perform the same task at the same time and the requests from the clients are distributed between all the nodes.

  • Compute clusters: Also know as high-performance computing (HPC) cluster. In these clusters, tasks are divided into smaller chunks, which then get computed on different nodes.

  • Storage clusters: All nodes provide a single cluster file system that will be used by clients to read and write data simultaneously.

Software

The cluster infrastructure software is provided by Pacemaker and performs the next set of functions: Cluster management, Lock management, Fencing, Cluster configuration management.

  • pacemaker

    It's responsible for all cluster-related activities, such as monitoring cluster membership, managing the services and resources, and fencing cluster members. The RPM contains three (3) important components:

    • Cluster Information Base (CIB).
    • Cluster Resource Management Daemon (CRMd).
  • corosync: This is the framework used by Pacemaker for handling communication between the cluster nodes.

  • pcs: Provides a command-line interface to create, configure, and control every aspect of a Pacemaker/corosync cluster.

Here are some requirements and limits for Pacemaker:

  • Up to 16 nodes per cluster. Minimum number of nodes: 3.

  • 2 nodes cluster could be configured but is not recommended.

Fence agents

Fencing, generally, is a way to prevent an ill-behaved cluster member from accessing shared data in a way which would cause data or file system corruption. The canonical case where fencing is required is something like this: Node1 live-hangs with a lock on a GFS file system. Node2 thinks node1 is dead, takes a lock, and begins accessing the same data. Node1 wakes up and continues what it was doing. Because we can not predict when node1 would wake up or prevent it from issuing I/Os immediately after waking up, we need a way to prevent its I/Os from completing even if it does wake up.

Fence agents were developed as device "drivers" which are able to prevent computers from destroying data on shared storage. Their aim is to isolate a corrupted computer, using one of three methods:

  • Power - A computer that is switched off cannot corrupt data, but it is important to not do a "soft-reboot" as we won't know if this is possible. This also works for virtual machines when the fence device is a hypervisor. N

  • Network - Switches can prevent routing to a given computer, so even if a computer is powered on it won't be able to harm the data.

  • Configuration - Fibre-channel switches or SCSI devices allow us to limit who can write to managed disks.

Features of HA

  • Failover of Master

    • If Master breaks down, RA detects this fault and makes Master to stop, and Slave is promoted to new Master(promote).
  • Switching between asynchronous and synchronous * If Slave breaks down or LAN have some trouble, the transaction including Write operation will be stopped during the setting of synchronous replication. This means the stop of service. Therefore, RA switches dynamically from the synchtonous to the asynchronization replication for prevented from stopping.

  • Automated discrimination of data old and new when initial starts

    • When Pacemaker of two or more nodes is started at the same time in the initial starts, RA compare the data of each node using last xlog replay location to check which node has the newest data. The node which has the newest data will be Master. Of cource the node becomes Master when Pacemaker is started only one node or it starts for the first time too. RA judges it based on the state of the data when having stopped before.
  • Load-balancing of read

    • Because Slave can process the Read only-transaction, the load-balancing of Read is possible by applying another virtual IP for the Read operation.

Parameter

The following parameters are added to the parameter of the original pgsql RA. the "monitor_sql" of original pgsql RA parameter cannot be used in the replication mode.

  • rep_mode
    • choice from none/async/sync. "none" is default, and the same operation as original pgsql RA. "async" is an asynchronous mode, and "sync" is a synchronous mode. The following parameter node_list master_ip, and restore_command is necessary at async or sync modes(*).
  • node_list(*)
    • The list of PRI and all HS nodes. Specifies a space-separated list of all node name (result of the uname -n command).
  • master_ip(*)
    • Virtual IP used for D-LAN is specified. This Virtual IP is added to the Master.
  • restore_command(*)
    • restore_command specified in recovery.conf file when starting with HS.
  • repuser
    • The user of replication which HS connects to PRI. Default is "postgres".
  • primary_conninfo_opt
    • RA generates recovery.conf file for HS. host,port,user and application name of primary_conninfo are automatically set by RA. If you want to set an additional parameter, you can specifies it here.
      • ex)ssl setting
  • tmpdir
    • the rep_mode_conf and xlog_note.* and PGSQL.lock files are created in this directory. Default is /var/lib/pgsql/tmp directory. If the directory dosen't exist, RA makes it automatically.
  • xlog_check_count
    • The count of last_xlog_replay_location's check is speciefied. Default is 3 (times). It is counted at moniter interval. The last_xlog_replay location is used for which node is the latest one at the initial starting PostgreSQL. If you set small values, the wrong PRI node is set because Pacemaker of other nodes is not started.
  • stop_escalate_in_slave
    • Number of shutdown retries (using -m fast) before resorting to -m immediate in Slave state. In Master sate, you can use "stop_escalate".

Installation

A peculiar setting to this RA is mainly described here. Please refer to other documents for an installation and a basic operation of PostgreSQL and Pacemaker.

The assumption composition in this document is made the following.

  • The replication mode is assumed to be a synchronous mode

  • The WAL archive is not shared.

    • You need to copy WAL archive from new Master to Slave.
  • The node name is assumed to be pm01 and pm02

  • S-LAN IP are 192.168.0.1 and 192.168.0.2.

  • Virtual IP(Master) is 192.168.0.201.

  • Virtual IP(Slave) is 192.168.0.202.

  • IC-LAN are 192.168.1.1, 192.168.1.2, 192.168.2.1, and 192.168.2.2.

  • D-LAN are 192.168.3.1 and 192.168.3.2.

  • Virtual (Master of D-LAN) IP is 192.168.3.200.

  • /usr/local/pgsql/ is the installation destination of PostgreSQL.

  • /var/lib/pgsql/9.1/data is the PostgreSQL database cluster and an archivastorage is the /var/lib/pgsql/9.1/data/pg_archive.

  • If you use this RA for business use, STONITH setting is strongly recommends. Thexplanation is ommited here.

The main set part as follows. Please refer to the manual of PostgreSQL for other parameter. Check the starting with the PostgreSQL unit, and the replication is possible.

Install Log

postgres=# \l List of databases Name

bash-4.2$ whoami
postgres
bash-4.2$ $HOME
bash: /var/lib/pgsql: Is a directory
bash-4.2$

bash-4.2$ ls /usr/pgsql-11/
bin  lib  share
[root@db1 ~]# netstat -lt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:sunrpc          0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:ssh             0.0.0.0:*               LISTEN
tcp        0      0 localhost:postgres      0.0.0.0:*               LISTEN
tcp6       0      0 [::]:sunrpc             [::]:*                  LISTEN
tcp6       0      0 [::]:ssh                [::]:*                  LISTEN
tcp6       0      0 localhost:postgres      [::]:*                  LISTEN

[root@db2 ~]# netstat -lt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:sunrpc          0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:ssh             0.0.0.0:*               LISTEN
tcp        0      0 localhost:postgres      0.0.0.0:*               LISTEN
tcp6       0      0 [::]:sunrpc             [::]:*                  LISTEN
tcp6       0      0 [::]:ssh                [::]:*                  LISTEN
tcp6       0      0 localhost:postgres      [::]:*                  LISTEN

High Availability

Warning: DO NOT BUILD A CLUSTER WITHOUT PROPER, WORKING AND TESTED FENCING. Fencing is a absolutely critical part of clustering. Without fully working fence devices, your cluster will fail.

Fencing is one of the mandatory piece you need when building an highly available cluster for your database.

It’s the ability to isolate a node from the cluster.

Should an issue happen where the master does not answer to the cluster, successful fencing is the only way to be sure what is its status: shutdown or not able to accept new work or touch data. It avoids countless situations where you end up with split brain scenarios or data corruption

The documentation provides best practices and examples : http://clusterlabs.github.io/PAF/fencing.html

root@db1: pcs stonith list

fence_amt_ws - Fence agent for AMT (WS)

fence_apc - Fence agent for APC over telnet/ssh

fence_apc_snmp - Fence agent for APC,
                 Tripplite PDU over SNMP

fence_bladecenter - Fence agent for IBM BladeCenter
fence_brocade - Fence agent for HP Brocade over telnet/ssh
fence_cisco_mds - Fence agent for Cisco   MDS
fence_cisco_ucs - Fence agent for Cisco UCS

fence_compute - Fence agent for the
                automatic resurrection
                of OpenStack compute
                instances

fence_drac5 - Fence agent for Dell DRAC
              CMC/5

fence_eaton_snmp - Fence agent for Eaton
                   over SNMP
fence_emerson - Fence agent for Emerson
                over SNMP

fence_eps - Fence agent for ePowerSwitch

fence_evacuate - Fence agent for the
                 automatic resurrection
                 of OpenStack compute
                 instances

fence_heuristics_ping - Fence agent for
                        ping-heuristic
                        based fencing

fence_hpblade - Fence agent for HP
                BladeSystem

postgresql.conf

listen_addresses = '*'
wal_level = hot_standby
synchronous_commit = on
archive_mode = on
archive_command = 'cp %p /var/lib/pgsql/9.1/data/pg_archive/%f'
max_wal_senders=5
wal_keep_segments = 32
hot_standby = on
restart_after_crash = off
replication_timeout = 5000         # mseconds
wal_receiver_status_interval = 2   # seconds
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
synchronous_commit = on
restart_after_crash = off
hot_standby_feedback = on

Network Cluster

[root@db2 /] systemctl start pcsd
[root@db2 /] systemctl start pcsd
[root@db2 /] pcs cluster auth db1 db2 -u hacluster
Password:
db1: Authorized
db2: Authorized
[root@db1 /] pcs cluster auth  db1 db2 -u hacluster
[root@db1 /] pcs cluster setup --name db_cl-azr db1 db2 --token 30000
[root@db1 /] pcs cluster start --all

Destroying cluster on nodes: db1, db2...

db2: Stopping Cluster (pacemaker)...
db1: Stopping Cluster (pacemaker)...
db2: Successfully destroyed cluster
db1: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'db1', 'db2'
db1: successful distribution of the file 'pacemaker_remote authkey'
db2: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
db1: Succeeded
db2: Succeeded

Synchronizing pcsd certificates on nodes db1, db2...
db1: Success
db2: Success
Restarting pcsd on the nodes in order to reload the certificates...
db1: Success
db2: Success

[root@db1 ~] pcs cluster start --all

db1: Starting Cluster (corosync)...
db2: Starting Cluster (corosync)...
db2: Starting Cluster (pacemaker)...
db1: Starting Cluster (pacemaker)...

[root@db2 /] pcs cluster start --all
db1: Starting Cluster (corosync)...
db2: Starting Cluster (corosync)...
db2: Starting Cluster (pacemaker)...
db1: Starting Cluster (pacemaker)...

[root@db1 ~] pcs status
Cluster name: db_cl-azr

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: db2 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Wed Mar 20 16:58:33 2019
Last change: Wed Mar 20 16:56:53 2019 by hacluster via crmd on db2

2 nodes configured
0 resources configured

Online: [ db1 db2 ]

No resources


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

[root@db2 /] pcs status
Cluster name: db_cl-azr

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: db2 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Wed Mar 20 16:59:39 2019
Last change: Wed Mar 20 16:56:53 2019 by hacluster via crmd on db2

2 nodes configured
0 resources configured

Online: [ db1 db2 ]

No resources


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Pacemaker cluster preparation

Quick start show you how to implement network redundancy in Corosync, but it best fits in the operating system layer. Documentation about how to setup network bonding or teaming are popular on internet. have a look at our basic administration cookbooks for CentOS 7 using pcs.

$ yum install -y pacemaker resource-agents pcs fence-agents-all fence-agents-virsh We’ll later create one fencing resource per node to fence. They are called fence_vm_xxx and use the fencing agent fence_virsh, allowing to power on or off a virtual machine using the virsh command through a ssh connexion to the hypervisor. You’ll need to make sure your VMs are able to connect as root (it is possible to use a normal user with some more setup though) to your hypervisor.

Install the latest PAF version, directly from the PGDG repository:

$ yum install -y resource-agents-paf It is advised to keep Pacemaker off on server boot. It helps the administrator to investigate after a node fencing before Pacemaker starts and potentially enters in a death match with the other nodes. Make sure to disable Corosync as well to avoid unexpected behaviors.

Run this on all nodes:

systemctl disable corosync
systemctl disable pacemaker

Let’s use the cluster management tool pcsd, provided by RHEL, to ease the creation and setup of a cluster.

It allows to create the cluster from command line, without editing configuration files or XML by hands.

pcsd uses the hacluster system user to work and communicate with other members of the cluster.

passwd hacluster
systemctl enable pcsd
systemctl start pcsd

Now, authenticate each node to the other ones using the following command:

$ pcs cluster auth server1 server2 -u hacluster
Password:
server1: Authorized
server2: Authorized

Create and start the cluster:

$ pcs cluster setup --name cluster_pgsql server1 server2

Destroying cluster on nodes: server1, server2...
server1: Stopping Cluster (pacemaker)...
server2: Stopping Cluster (pacemaker)...
server1: Successfully destroyed cluster
server2: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'server1', 'server2'
server1: successful distribution of the file 'pacemaker_remote authkey'
server2: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
server1: Succeeded
server2: Succeeded

Synchronizing pcsd certificates on nodes server1, server2...
server1: Success
server2: Success
Restarting pcsd on the nodes in order to reload the certificates...
server1: Success
server2: Success

$ pcs cluster start --all
server2: Starting Cluster...
server1: Starting Cluster...
Check the cluster status:

$ pcs status
Cluster name: cluster_pgsql
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: server1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum
Last updated: ...
Last change: ... by hacluster via crmd on server2

2 nodes configured
0 resources configured

Online: [ server1 server2 ]

No resources

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Now the cluster run, let’s start with some basic setup of the cluster.

Run the following command from one node only (the cluster takes care of broadcasting the configuration on all nodes):

$ pcs resource defaults migration-threshold=3
$ pcs resource defaults resource-stickiness=10

This sets two default values for resources we’ll create in the next chapter:

migration-threshold: this controls how many time the cluster tries to recover a resource on the same node before moving it on another one. resource-stickiness: adds a sticky score for the resource on its current node. It helps avoiding a resource move back and forth between nodes where it has the same score.

Node level fencing devices

Before we get into the configuration details, you need to pick a fencing device for the node level fencing. There are quite a few to choose from. If you want to see the list of stonith devices which are supported just run:

stonith -L

Stonith devices may be classified into five categories:

  • UPS (Uninterruptible Power Supply)

  • PDU (Power Distribution Unit)

  • Blade power control devices

  • Lights-out devices

  • Testing devices

The choice depends mainly on your budget and the kind of hardware. For instance, if you’re running a cluster on a set of blades, then the power control device in the blade enclosure is the only candidate for fencing. Of course, this device must be capable of managing single blade computers.

The lights-out devices (IBM RSA, HP iLO, Dell DRAC) are becoming increasingly popular and in future they may even become standard equipment of of-the-shelf computers. They are, however, inferior to UPS devices, because they share a power supply with their host (a cluster node). If a node stays without power, the device supposed to control it would be just as useless. Even though this is obvious to us, the cluster manager is not in the know and will try to fence the node in vain. This will continue forever because all other resource operations would wait for the fencing/stonith operation to succeed.

The testing devices are used exclusively for testing purposes. They are usually more gentle on the hardware. Once the cluster goes into production, they must be replaced with real fencing devices.

STONITH (Shoot The Other Node In The Head)

Stonith is our fencing implementation. It provides the node level fencing.

NB

The stonith and fencing terms are often used interchangeably here as well as in other texts.

The stonith subsystem consists of two components:

  • pacemaker-fenced

  • stonith plugins

pacemaker-fenced

pacemaker-fenced is a daemon which may be accessed by the local processes or over the network. It accepts commands which correspond to fencing operations: reset, power-off, and power-on. It may also check the status of the fencing device.

pacemaker-fenced runs on every node in the CRM HA cluster. The pacemaker-fenced instance running on the DC node receives a fencing request from the CRM. It is up to this and other pacemaker-fenced programs to carry out the desired fencing operation.

Stonith plugins

For every supported fencing device there is a stonith plugin which is capable of controlling that device. A stonith plugin is the interface to the fencing device. All stonith plugins look the same to pacemaker-fenced, but are quite different on the other side reflecting the nature of the fencing device.

Some plugins support more than one device. A typical example is ipmilan (or external/ipmi) which implements the IPMI protocol and can control any device which supports this protocol.

CRM stonith configuration

The fencing configuration consists of one or more stonith resources.

A stonith resource is a resource of class stonith and it is configured just like any other resource. The list of parameters (attributes) depend on and are specific to a stonith type. Use the stonith(1) program to see the list:

$ stonith -t ibmhmc -n
ipaddr
$ stonith -t ipmilan -n
hostname  ipaddr  port  auth  priv  login  password reset_method

NB

It is easy to guess the class of a fencing device from the set of attribute names.

A short help text is also available:

$ stonith -t ibmhmc -h
STONITH Device: ibmhmc - IBM Hardware Management Console (HMC)
Use for IBM i5, p5, pSeries and OpenPower systems managed by HMC
  Optional parameter name managedsyspat is white-space delimited
list of patterns used to match managed system names; if last
character is '*', all names that begin with the pattern are matched
  Optional parameter name password is password for hscroot if
passwordless ssh access to HMC has NOT been setup (to do so,
it is necessary to create a public/private key pair with
empty passphrase - see "Configure the OpenSSH client" in the
redbook for more details)
For more information see
http://publib-b.boulder.ibm.com/redbooks.nsf/RedbookAbstracts/SG247038.html

You just said that there is pacemaker-fenced and stonith plugins. What’s with these resources now?

Resources of class stonith are just a representation of stonith plugins in the CIB. Well, a bit more: apart from the fencing operations, the stonith resources, just like any other, may be started and stopped and monitored. The start and stop operations are a bit of a misnomer: enable and disable would serve better, but it’s too late to change that. So, these two are actually administrative operations and do not translate to any operation on the fencing device itself. Monitor, however, does translate to device status.

A dummy stonith resource configuration, which may be used in some testing scenarios is very simple:

configure
primitive st-null stonith:null \
        params hostlist="node1 node2"
clone fencing st-null
commit

NB

All configuration examples are in the crm configuration tool syntax. To apply them, put the sample in a text file, say sample.txt and run:

crm < sample.txt

The configure and commit lines are omitted from further examples.

An alternative configuration:

primitive st-node1 stonith:null \
        params hostlist="node1"
primitive st-node2 stonith:null \
        params hostlist="node2"
location l-st-node1 st-node1 -inf: node1
location l-st-node2 st-node2 -inf: node2

This configuration is perfectly alright as far as the cluster software is concerned. The only difference to a real world configuration is that no fencing operation takes place.

A more realistic, but still only for testing, is the following external/ssh configuration:

primitive st-ssh stonith:external/ssh \
        params hostlist="node1 node2"
clone fencing st-ssh

This one can also reset nodes. As you can see, this configuration is remarkably similar to the first one which features the null stonith device.

What is this clone thing?

Clones are a CRM/Pacemaker feature. A clone is basically a shortcut: instead of defining n identical, yet differently named resources, a single cloned resource suffices. By far the most common use of clones is with stonith resources if the stonith device is accessible from all nodes.

The real device configuration is not much different, though some devices may require more attributes. For instance, an IBM RSA lights-out device might be configured like this:

primitive st-ibmrsa-1 stonith:external/ibmrsa-telnet \
        params nodename=node1 ipaddr=192.168.0.101 \
        userid=USERID passwd=PASSW0RD
primitive st-ibmrsa-2 stonith:external/ibmrsa-telnet \
        params nodename=node2 ipaddr=192.168.0.102 \
        userid=USERID passwd=PASSW0RD
# st-ibmrsa-1 can run anywhere but on node1
location l-st-node1 st-ibmrsa-1 -inf: node1
# st-ibmrsa-2 can run anywhere but on node2
location l-st-node2 st-ibmrsa-2 -inf: node2

Why those strange location constraints?

There is always certain probability that the stonith operation is going to fail. Hence, a stonith operation on the node which is the executioner too is not reliable. If the node is reset, then it cannot send the notification about the fencing operation outcome.

If you haven’t already guessed, configuration of a UPS kind of fencing device is remarkably similar to all we have already shown.

All UPS devices employ the same mechanics for fencing. What is, however, different is how the device itself is accessed. Old UPS devices, those that were considered professional, used to have just a serial port, typically connected at 1200baud using a special serial cable. Many new ones still come equipped with a serial port, but often they also sport a USB interface or an Ethernet interface. The kind of connection we may make use of depends on what the plugin supports. Let’s see a few examples for the APC UPS equipment:

$ stonith -t apcmaster -h

STONITH Device: apcmaster - APC MasterSwitch (via telnet)
NOTE: The APC MasterSwitch accepts only one (telnet)
connection/session a time. When one session is active,
subsequent attempts to connect to the MasterSwitch will fail.
For more information see http://www.apc.com/
List of valid parameter names for apcmaster STONITH device:
        ipaddr
                login
                password

$ stonith -t apcsmart -h

STONITH Device: apcsmart - APC Smart UPS
 (via serial port - NOT USB!).
 Works with higher-end APC UPSes, like
 Back-UPS Pro, Smart-UPS, Matrix-UPS, etc.
 (Smart-UPS may have to be >= Smart-UPS 700?).
 See http://www.networkupstools.org/protocols/apcsmart.html
 for protocol compatibility details.
For more information see http://www.apc.com/
List of valid parameter names for apcsmart STONITH device:
                ttydev
                hostlist

The former plugin supports APC UPS with a network port and telnet protocol. The latter plugin uses the APC SMART protocol over the serial line which is supported by many different APC UPS product lines.

So, what do I use: clones, constraints, both?

It depends. Depends on the nature of the fencing device. For example, if the device cannot serve more than one connection at the time, then clones won’t do. Depends on how many hosts can the device manage. If it’s only one, and that is always the case with lights-out devices, then again clones are right out. Depends also on the number of nodes in your cluster: the more nodes the more desirable to use clones. Finally, it is also a matter of personal preference.

In short: if clones are safe to use with your configuration and if they reduce the configuration, then make cloned stonith resources.

The CRM configuration is left as an exercise to the reader.

Monitoring the fencing devices

Just like any other resource, the stonith class agents also support the monitor operation. Given that we have often seen monitor either not configured or configured in a wrong way, we have decided to devote a section to the matter.

Monitoring stonith resources, which is actually checking status of the corresponding fencing devices, is strongly recommended. So strongly, that we should consider a configuration without it invalid.

On the one hand, though an indispensable part of an HA cluster, a fencing device, being the last line of defense, is used seldom. Very seldom and preferably never. On the other, for whatever reason, the power management equipment is known to be rather fragile on the communication side. Some devices were known to give up if there was too much broadcast traffic on the wire. Some cannot handle more than ten or so connections per minute. Some get confused or depressed if two clients try to connect at the same time. Most cannot handle more than one session at the time. The bottom line: try not to exercise your fencing device too often. It may not like it. Use monitoring regularly, yet sparingly, say once every couple of hours. The probability that within those few hours there will be a need for a fencing operation and that the power switch would fail is usually low.

Odd plugins

Apart from plugins which handle real devices, some stonith plugins are a bit out of line and deserve special attention.

external/kdumpcheck

Sometimes, it may be important to get a kernel core dump. This plugin may be used to check if the dump is in progress. If that is the case, then it will return true, as if the node has been fenced, which is actually true given that it cannot run any resources at the time. kdumpcheck is typically used in concert with another, real, fencing device. See README_kdumpcheck.txt for more details.

external/sbd

This is a self-fencing device. It reacts to a so-called "poison pill" which may be inserted into a shared disk. On shared storage connection loss, it also makes the node commit suicide. See http://www.linux-ha.org/wiki/SBD_Fencing for more details.

meatware

Strange name and a simple concept. meatware requires help from a human to operate. Whenever invoked, meatware logs a CRIT severity message which should show up on the node’s console. The operator should then make sure that the node is down and issue a meatclient(8) command to tell meatware that it’s OK to tell the cluster that it may consider the node dead. See README.meatware for more information.

null

This one is probably not of much importance to the general public. It is used in various testing scenarios. null is an imaginary device which always behaves and always claims that it has shot a node, but never does anything. Sort of a happy-go-lucky. Do not use it unless you know what you are doing.

suicide

suicide is a software-only device, which can reboot a node it is running on. It depends on the operating system, so it should be avoided whenever possible. But it is OK on one-node clusters. suicide and null are the only exceptions to the "don’t shoot my host" rule.

What about that pacemaker-fenced? You forgot about it, eh?

The pacemaker-fenced daemon, though it is really the master of ceremony, requires no configuration itself. All configuration is stored in the CIB.

Resources

http://www.linux-ha.org/wiki/STONITH

http://www.clusterlabs.org/doc/crm_fencing.html

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained

http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html

other sources: [1], [2], [3]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment