andy108369/scaling-osd-down.md

## scaling-osd-down.md

      
    Raw
  

              scaling-osd-down.md
            
          
    Scaling OSD from 3 to 1 in rook-ceph

This procedure is for removing all OSD off of a selected storage node in the environment with more than a single
storage node in the cluster and enough free disk space.

If you have a single storage node only, then you'll have to remove disk by disk.

If you have only a single disk, then you'll have to remove OSD by OSD, reclaiming the freed disk space in the VG (if that is your case)
using this and that hints or waiting until rook-ceph will support this.


Scale osdsPerDevice from 3 to 1 and apply the rook-ceph-cluster helm chart;


Make sure the cluster is healthy:


kubectl -n rook-ceph get cephclusters
kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status

1. Stop Rook Ceph Operator

kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0

2. Find the OSD you want to remove


Can be a useful command if you know what disks you want to remove from Ceph: lsblk -J -n -o NAME /dev/nvme11n1 /dev/nvme10n1 /dev/nvme9n1 /dev/nvme8n1 /dev/nvme7n1 /dev/nvme6n1 | jq -r '.blockdevices[].children[].name' | sed 's/.*block--//g' | sed 's/--/-/g' | xargs -I@ grep ^ /var/lib/rook/rook-ceph/3184888a-5060-4684-a6c2-32509397948a_@/whoami | xargs

kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree

Say you want to remove all OSDs from the specific node:
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME                     STATUS  REWEIGHT  PRI-AFF
-13         27.94556  host <YOUR-NODE>                           
  8    ssd   2.32880      osd.8                         up   1.00000  1.00000
 18    ssd   2.32880      osd.18                        up   1.00000  1.00000
 28    ssd   2.32880      osd.28                        up   1.00000  1.00000
 38    ssd   2.32880      osd.38                        up   1.00000  1.00000
 48    ssd   2.32880      osd.48                        up   1.00000  1.00000
 58    ssd   2.32880      osd.58                        up   1.00000  1.00000
 68    ssd   2.32880      osd.68                        up   1.00000  1.00000
 77    ssd   2.32880      osd.77                        up   1.00000  1.00000
 87    ssd   2.32880      osd.87                        up   1.00000  1.00000
 97    ssd   2.32880      osd.97                        up   1.00000  1.00000
107    ssd   2.32880      osd.107                       up   1.00000  1.00000
117    ssd   2.32880      osd.117                       up   1.00000  1.00000

export OSDS="117 107 97 87 77 68 58 48 38 28 18 8"

3. Tell Ceph that you are going to remove these OSDs

kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd out $OSDS

Wait until Ceph settles:
kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status

4. Stop the OSDs

Verify the OSDs are safe to destroy first.
kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd safe-to-destroy $OSDS

You should see this message: OSD(s) 8,18,... are safe to destroy without reducing data durability.

Stop if you do not see. It might take some time. If you still do not see the safe message, then rollback the change by setting the OSDs back (e.g. ceph osd in $OSDS).
You might need to go disk-by-disk or osd-by-osd.
Verify the OSDs you are going to remove are on the right node:
for i in $OSDS; do kubectl -n rook-ceph get pods -l "app=rook-ceph-osd,osd=$i" -o wide; done

Remove the OSD deployment:

this will put these OSDs down

for i in $OSDS; do kubectl -n rook-ceph delete deployment -l "app=rook-ceph-osd,osd=$i"; done

Check OSDs are down now:
kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree

If for whatever reason the OSDs are still up even after you have removed the deployment even if Ceph Operator is not running, you can manually put them down:
##kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd down $OSDS

5. Purge the OSDs

kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- sh -c "for i in $OSDS; do ceph osd purge \$i --force; done"

If these were the last OSDs and you aren't planning on using that host ever again, you can remove it from the crush map:
kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd crush rm <old-host>

6. Remove the underlying data

Look for the disks with ceph_bluestore FSTYPE:
lsblk -f

If LVM was used, close the inactive by now (since OSDs have been stopped and purged) LVs and remove their VG / PV:
pvs
vgs
lvs
pvdisplay --noheadings -C -o pv_name,vg_name,pv_attr,vg_attr,lv_attr
lvdisplay --noheadings -C -o lv_name,vg_name,lv_attr
lvdisplay --noheadings -C -o lv_name,vg_name,lv_attr | grep ceph | grep -v -- '-ao-' | while read lv vg attr; do lvremove -y $vg/$lv; done

vgs
vgremove <empty vg name>

pvdisplay --noheadings -C -o pv_name,vg_name,pv_attr

pvremove /dev/<DISK>
lsblk -f

Wipe FS signatures off of the disk so it can be reused by Ceph again if you wish so:
wipefs -a /dev/<DISK>

7. Start Rook Ceph Operator

kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1

After few moments Ceph will automatically create new OSDs over previously purged disks (if you have wipefs'ed them or zero'ed them with dd command [first 50Mi usually is enough]).

Wait for the Ceph Operator to settle down:

kubectl -n rook-ceph logs $(kubectl -n rook-ceph get pod -l "app=rook-ceph-operator" -o jsonpath='{.items[0].metadata.name}') --tail=10 -f


Check the OSDs and cluster health using the commands from the above.

Troubleshooting

If rook-ceph fails creating a new OSD with entity osd.<id> exists but key does not match error, then it is likely because you have scaled Rook Operator down too early, leaving ceph auth leftovers.
Simply compare ceph auth ls vs ceph osd tree and ceph auth rm osd.<id> those that are stale.
Clean the related disks again and bounce the Rook Ceph Operator.
If you still have the list of OSDs to remove expoted as OSDS variable, you can automate stale ceph auth removal using this command:
kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- sh -c "for i in $OSDS; do ceph auth rm osd.\$i; done"

Ref. rook/rook#10872
Reclaiming freed space in VG

For those ones who are wondering how to run the ceph-bluestore-tool on the Kubernetes rook-ceph environment, you could modify the deployment of your OSDs to add it as an initContainer:
 - name: bluefs-bdev-expand
    image: quay.io/ceph/ceph:v17.2.3
    command:
      - ceph-bluestore-tool
    args:
      - 'bluefs-bdev-expand'
      - '--path'
      - /var/lib/ceph/osd/ceph-<ID>
    resources: {}
    volumeMounts:
      - name: rook-data
        mountPath: /var/lib/rook
      - name: rook-config-override
        readOnly: true
        mountPath: /etc/ceph
      - name: rook-ceph-log
        mountPath: /var/log/ceph
      - name: rook-ceph-crash
        mountPath: /var/lib/ceph/crash
      - name: devices
        mountPath: /dev
      - name: run-udev
        mountPath: /run/udev
      - name: activate-osd
        mountPath: /var/lib/ceph/osd/ceph-<ID>
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
      runAsUser: 0
      readOnlyRootFilesystem: false

Refs.

rook/rook#2997 (comment)
rook/rook#10214 (reply in thread)