Skip to content

Instantly share code, notes, and snippets.

@feisuzhu
Last active July 31, 2019 15:26
Show Gist options
  • Save feisuzhu/dfd5200ba5549d8b059d81204c391ed5 to your computer and use it in GitHub Desktop.
Save feisuzhu/dfd5200ba5549d8b059d81204c391ed5 to your computer and use it in GitHub Desktop.
Revive a rook cluster

What this manual trying to resolve

  1. You had a running rook/ceph cluster, suddenly your Kubernetes environment exploded, you have to start a new Kubernetes environment and put your existing rook/ceph cluster back.
  2. You are migrating your existing rook/ceph cluster to a new Kubernetes environment, downtime can be tolerated.

In author's situation, etcd data of running Kubernetes cluster is nuked and have no backup, all OSDs are using bluestore backend.

Prerequisites

  1. A working Kubernetes cluster without rook
  2. Previous rook/ceph cluster is intact, which means you have at least one ceph mon data intact, and all your osd data is intact.

TL;DR for Steps below

  1. Start a new and clean rook cluster, with old CephCluster & friends.
  2. Shut it down when it seems working(as a brandnew cluster).
  3. Replace ceph-mon data with old one. Fix fsid in rook. Fix monmap. Disable auth.
  4. Fire it up, watch it resurrect. Fix admin auth key.
  5. Shut it down again. Enable auth. Fire it up.

HOORAY!

Steps

  1. Assuming your old Kubernetes cluster is completely torned down, and your new Kubernetes cluster is up and running, without rook.
  2. Backup /var/lib/rook in all your rook nodes. Backups will be used later.
  3. Pick a /var/lib/rook/rook-ceph/rook-ceph.config from any node and get your old cluster fsid from its content.
  4. Remove /var/lib/rook in all your rook nodes.
  5. Install rook in your new Kubernetes cluster.
  6. Prepare identical CephCluster descriptors, especially identical spec.storage.config and spec.storage.nodes, except mon.count, which sets to 1. Post them to your new Kubernetes cluster.
  7. Prepare identical CephFilesystem & ... etc descriptors (if any). Post them to your new Kubernetes cluster too.
  8. Run kubectl logs -f rook-ceph-operator-xxxxxxxxxx and wait till all the things are settled.
  9. Run kubectl get cm/rook-crush-config -o yaml, ensure initialCrushMapCreated is set to 1. If not, goto 7, manually set it or stop here for further help.
  10. STATE: Now you will have rook-ceph-mon-a, rook-ceph-mgr-a, and all the auxiliary pods up and running, and zero(hopefully) rook-ceph-osd-X running. Rook should not start any OSD daemon since all devices belongs to your old cluster(have a different fsid).
  11. Run kubectl exec -it rook-ceph-mon-a-XXXXXX bash to enter your ceph-mon pod,
mon-a# cat /etc/ceph/keyring-store/keyring  # save this keyring content, for later use
mon-a# exit
  1. Run kubectl edit deploy/rook-ceph-operator and set replicas to 0.
  2. Run kubectl delete deploy/X where X is every deployment in namespace rook-ceph, except rook-ceph-operator and rook-ceph-tools.

SSH to the host where rook-ceph-mon-a in your new Kubernetes cluster resides.

  1. Pick the latest ceph-mon directory (/var/lib/rook/mon-?) in your previous backup, replace /var/lib/rook/mon-a with it.
  2. Replace /var/lib/rook/mon-a/keyring with the saved keyring, preserving only the [mon.] section, remove [client.admin] section.
  3. Get your rook-ceph-mon-a address by kubectl get cm/rook-ceph-mon-endpoints -o yaml in your new Kubernetes cluster.
  4. Run docker run -it --rm -v /var/lib/rook:/var/lib/rook ceph/ceph:v14.2.1-20190430 bash (note the docker images version, should match your deployment):
container# cd /var/lib/rook
container# ceph-mon --extract-monmap m --mon-data ./mon-a/data
container# monmaptool --print m
container# monmaptool --rm a m  # repeat this until all the old ceph-mons are removed
container# monmaptool --add a 10.77.2.216:6789 m   # Replace with your own rook-ceph-mon address!
container# ceph-mon --inject-monmap m --mon-data ./mon-a/data
container# rm m
container# exit

Now back to your local machine.

  1. Run kubectl edit secret/rook-ceph-mon and modify fsid to your original fsid
  2. Run kubectl edit cm/rook-config-override add content below:
data:
  config: |
    [global]
    auth cluster required = none
    auth service required = none
    auth client required = none
    auth supported = none
  1. Run kubectl edit deploy/rook-ceph-operator and set replica to 1.
  2. Run kubectl logs -f rook-ceph-operator-xxxxxxxxxx and wait till all the things are settled.
  3. STATE: Now your rook/ceph cluster should be up and running, with authentication disabled.
  4. Run kubectl exec -it rook-ceph-tools-XXXXXXX bash to enter tools pod:
tools# vi key
[paste keyring content saved before, preserving only `[client admin]` section]
tools# ceph auth import -i key
tools# rm key
  1. Run kubectl edit cm/rook-config-override and remove previously added configurations.
  2. Run kubectl edit deploy/rook-ceph-operator and set replicas to 0.
  3. Run kubectl delete deploy/X where X is every deployment in namespace rook-ceph, except rook-ceph-operator and rook-ceph-tools, again. This time OSD daemons are present and included.
  4. Run kubectl edit deploy/rook-ceph-operator and set replicas to 1.
  5. Run kubectl logs -f rook-ceph-operator-xxxxxxxxxx and wait till all the things are settled.

HOORAY!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment