Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
A WIP doc page on Ceph Mon recovery when running Rook.

Rook <= 0.7.0 mon hostNetwork: true node IP issue

NOTE If you need assistance with steps 3 through 5, let us know on the Rook Slack and we are happy to help you.

WARNING You should not have to go through this section when having hostNetwork: false (or haven't even set it)! WARNING If you have/had multiple Filesystems created, this guide may not work for you because of a bug in Ceph that causes the mons to crash during the "FS Map" assertion.

  1. Scale rook-operator down (e.g. replicas: 0).
  2. Edit all rook-ceph-mon ReplicaSets to have command: ['sleep', '3600'] in the mon container, but copy the other args and command values somewhere safe for each mon.
  3. Exec into the first mon and run: monmap --print /var/lib/rook/rook-ceph-mon-$MON_ID/monmap.
    • Where MON_ID is the ID of the mon you execed into.
  4. Check if the IPs listed for the mons are correct.
    • For hostNetwork: false, they need to be the Service IPs.
    • For hostNetwork: true, they need to be the node IPs.
  5. If a mon has a wrong IP, run monmap --rm MON_NAME to remove it. Where MON_NAME is the name of the mon e.g. rook-ceph-mon3.
    • This step needs to be done for every mon.
  6. Add the mons that had a wrong IP and/or are missing with the correct IP by running monmap --add MON_NAME MON_IP:6790, where MON_NAME is the e.g. rook-ceph-mon4.
    • This step needs to be done for every mon.
  7. Now that the monmap has been corrected, run export MON_NAME="rook-ceph-mon$ID" && ceph-mon --name=mon.$MON_NAME --inject-monmap /var/lib/rook/$MON_NAME/monmap --mon-data=/var/lib/rook/$MON_NAME/data --conf=/var/lib/rook/$MON_NAME/rook.config --keyring=/var/lib/rook/$MON_NAME/keyring. Where the $ID is just the number behind the name of a mon, e.g. rook-ceph-mon3 ID is 3.
    • This step needs to be done for every mon.
  8. After the monmap has been corrected and injected into each mon, you can begin editing the rook-ceph-mon ReplicaSets and replacing the command: ['sleep', '3600'] with the previous command and args part copied somewhere safe.
  9. Now your cluster should return to a healthy state and you are ready to go again.

Rook Mon Quorum broken

So first of all: Yes, Ceph Mons need a quorum to function.

  1. TODO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment