niedbalski/recover-rabbit.sh

## recover-rabbit.sh
Some notes from engineering.

1) Identify the partition

Mnesia('rabbit@juju-machine-30-lxd-11'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@juju-machine-29-lxd-9'}

$ sudo rabbitmqctl cluster_status

2) Pick the most reliable node as the master.

# connections.
$ juju run --service rabbitmq-server "sudo ss -t state established -nt '( sport = :amqp )' | wc -l"


# latest mnesia modification.
$ juju run --service rabbitmq-server 'sudo find /var/lib/rabbitmq/mnesia -type f | xargs ls -ltr | tail -n 1 | cut -d " " -f13 | xargs -I {} stat -c "%y" {}'

# most messages in the openstack queues.
juju run --service rabbitmq-server "sudo rabbitmqctl list_queues -p openstack messages | awk '{s+=\$1}END{print s}'"


3) Stop all the epmd/erl processes on the non-master nodes

$ /etc/init/rabbitmq-server stop
$ sudo killall epmd

Check that no rabbitmq related process remains alive.

$ sudo ps -U rabbitmq -o pid --no-heading

4) Remove mnesia,start the service and stop the app

$ sudo mv /var/lib/rabbitmq/mnesia /var/lib/rabbitmq/mnesia-back
$ sudo service rabbitmq-server start
$ sudo rabbitmqctl stop_app

Check that the node starts unclustered ($ sudo rabbitmqctl cluster_status)

5) (Not required, but desirable) Forget the cluster nodes from the master

$ sudo rabbitmqctl stop_app
$ sudo rabbitmqctl forget_cluster_node rabbit@trashed-slave
$ sudo rabbitmqctl start_app

6) Join to the master from the slave units.

$ sudo rabbitmqctl join_cluster rabbit@master
$ sudo rabbitmqctl start_app

5) Check cluster status healthy.

https://pastebin.canonical.com/185576/

6) (Suggested) Switch the config option https://github.com/openstack/charm-rabbitmq-server/blob/master/config.yaml#L95
cluster-partitioning-handling to autoheal.
	Some notes from engineering.

	1) Identify the partition

	Mnesia('rabbit@juju-machine-30-lxd-11'): ERROR mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@juju-machine-29-lxd-9'}

	$ sudo rabbitmqctl cluster_status

	2) Pick the most reliable node as the master.

	# connections.
	$ juju run --service rabbitmq-server "sudo ss -t state established -nt '( sport = :amqp )' \| wc -l"


	# latest mnesia modification.
	$ juju run --service rabbitmq-server 'sudo find /var/lib/rabbitmq/mnesia -type f \| xargs ls -ltr \| tail -n 1 \| cut -d " " -f13 \| xargs -I {} stat -c "%y" {}'

	# most messages in the openstack queues.
	juju run --service rabbitmq-server "sudo rabbitmqctl list_queues -p openstack messages \| awk '{s+=\$1}END{print s}'"


	3) Stop all the epmd/erl processes on the non-master nodes

	$ /etc/init/rabbitmq-server stop
	$ sudo killall epmd

	Check that no rabbitmq related process remains alive.

	$ sudo ps -U rabbitmq -o pid --no-heading

	4) Remove mnesia,start the service and stop the app

	$ sudo mv /var/lib/rabbitmq/mnesia /var/lib/rabbitmq/mnesia-back
	$ sudo service rabbitmq-server start
	$ sudo rabbitmqctl stop_app

	Check that the node starts unclustered ($ sudo rabbitmqctl cluster_status)

	5) (Not required, but desirable) Forget the cluster nodes from the master

	$ sudo rabbitmqctl stop_app
	$ sudo rabbitmqctl forget_cluster_node rabbit@trashed-slave
	$ sudo rabbitmqctl start_app

	6) Join to the master from the slave units.

	$ sudo rabbitmqctl join_cluster rabbit@master
	$ sudo rabbitmqctl start_app

	5) Check cluster status healthy.

	https://pastebin.canonical.com/185576/

	6) (Suggested) Switch the config option https://github.com/openstack/charm-rabbitmq-server/blob/master/config.yaml#L95
	cluster-partitioning-handling to autoheal.