Created
July 5, 2017 16:04
-
-
Save niedbalski/69a72103adad4f0f9609a0857c9810a4 to your computer and use it in GitHub Desktop.
Recover a rabbitmq cluster after partitioning
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some notes from engineering. | |
1) Identify the partition | |
Mnesia('rabbit@juju-machine-30-lxd-11'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@juju-machine-29-lxd-9'} | |
$ sudo rabbitmqctl cluster_status | |
2) Pick the most reliable node as the master. | |
# connections. | |
$ juju run --service rabbitmq-server "sudo ss -t state established -nt '( sport = :amqp )' | wc -l" | |
# latest mnesia modification. | |
$ juju run --service rabbitmq-server 'sudo find /var/lib/rabbitmq/mnesia -type f | xargs ls -ltr | tail -n 1 | cut -d " " -f13 | xargs -I {} stat -c "%y" {}' | |
# most messages in the openstack queues. | |
juju run --service rabbitmq-server "sudo rabbitmqctl list_queues -p openstack messages | awk '{s+=\$1}END{print s}'" | |
3) Stop all the epmd/erl processes on the non-master nodes | |
$ /etc/init/rabbitmq-server stop | |
$ sudo killall epmd | |
Check that no rabbitmq related process remains alive. | |
$ sudo ps -U rabbitmq -o pid --no-heading | |
4) Remove mnesia,start the service and stop the app | |
$ sudo mv /var/lib/rabbitmq/mnesia /var/lib/rabbitmq/mnesia-back | |
$ sudo service rabbitmq-server start | |
$ sudo rabbitmqctl stop_app | |
Check that the node starts unclustered ($ sudo rabbitmqctl cluster_status) | |
5) (Not required, but desirable) Forget the cluster nodes from the master | |
$ sudo rabbitmqctl stop_app | |
$ sudo rabbitmqctl forget_cluster_node rabbit@trashed-slave | |
$ sudo rabbitmqctl start_app | |
6) Join to the master from the slave units. | |
$ sudo rabbitmqctl join_cluster rabbit@master | |
$ sudo rabbitmqctl start_app | |
5) Check cluster status healthy. | |
https://pastebin.canonical.com/185576/ | |
6) (Suggested) Switch the config option https://github.com/openstack/charm-rabbitmq-server/blob/master/config.yaml#L95 | |
cluster-partitioning-handling to autoheal. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment