Skip to content

Instantly share code, notes, and snippets.

Created July 5, 2017 16:04
What would you like to do?
Recover a rabbitmq cluster after partitioning
Some notes from engineering.
1) Identify the partition
Mnesia('rabbit@juju-machine-30-lxd-11'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@juju-machine-29-lxd-9'}
$ sudo rabbitmqctl cluster_status
2) Pick the most reliable node as the master.
# connections.
$ juju run --service rabbitmq-server "sudo ss -t state established -nt '( sport = :amqp )' | wc -l"
# latest mnesia modification.
$ juju run --service rabbitmq-server 'sudo find /var/lib/rabbitmq/mnesia -type f | xargs ls -ltr | tail -n 1 | cut -d " " -f13 | xargs -I {} stat -c "%y" {}'
# most messages in the openstack queues.
juju run --service rabbitmq-server "sudo rabbitmqctl list_queues -p openstack messages | awk '{s+=\$1}END{print s}'"
3) Stop all the epmd/erl processes on the non-master nodes
$ /etc/init/rabbitmq-server stop
$ sudo killall epmd
Check that no rabbitmq related process remains alive.
$ sudo ps -U rabbitmq -o pid --no-heading
4) Remove mnesia,start the service and stop the app
$ sudo mv /var/lib/rabbitmq/mnesia /var/lib/rabbitmq/mnesia-back
$ sudo service rabbitmq-server start
$ sudo rabbitmqctl stop_app
Check that the node starts unclustered ($ sudo rabbitmqctl cluster_status)
5) (Not required, but desirable) Forget the cluster nodes from the master
$ sudo rabbitmqctl stop_app
$ sudo rabbitmqctl forget_cluster_node rabbit@trashed-slave
$ sudo rabbitmqctl start_app
6) Join to the master from the slave units.
$ sudo rabbitmqctl join_cluster rabbit@master
$ sudo rabbitmqctl start_app
5) Check cluster status healthy.
6) (Suggested) Switch the config option
cluster-partitioning-handling to autoheal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment