본 글은 OpenShift 환경에서 클러스터가 정상적이지 않거나 사용자에 의해
object resource가 손상되었을 경우를 대비해 백업 및 복구하는 방법에 대해서 작성 되었다.
etcd는 kubernetes에서 사용되는 모든 정보들이 저장되어 있는 key/value 기반의 database 이다.
etcd 백업은 크게 2가지 방법으로 수행이 가능하다.
OpenShift에서 Control Plane(Master Nodes)에서만 제공되는 "cluster-backup.sh" 스크립트를 통해 가능하다.
위치는 각 master 노드의 "/usr/local/bin/cluster-backup.sh"에 위치해 있으며,
사용법은 해당 스크립트 뒤에 백업 파일이 저장될 디렉토리만 지정 해주면 된다.
[root@bastion ~]# for masters in {master01,master02,master03}; do
ssh core@$masters.ocp4.local "sudo /usr/local/bin/cluster-backup.sh /home/core/backup";
done
found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-89
found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-26
found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-25
found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-12
etcdctl is already installed
{"level":"info","ts":1654562767.9457173,"caller":"snapshot/v3_snapshot.go:68","msg":"created temporary db file","path":"/home/core/backup/snapshot_2022-06-07_004606.db.part"}
{"level":"info","ts":1654562767.9566555,"logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1654562767.956733,"caller":"snapshot/v3_snapshot.go:76","msg":"fetching snapshot","endpoint":"https://10.65.40.182:2379"}
{"level":"info","ts":1654562771.5703478,"logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":1654562771.7935908,"caller":"snapshot/v3_snapshot.go:91","msg":"fetched snapshot","endpoint":"https://10.65.40.182:2379","size":"326 MB","took":"3 seconds ago"}
{"level":"info","ts":1654562771.7943306,"caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"/home/core/backup/snapshot_2022-06-07_004606.db"}
Snapshot saved at /home/core/backup/snapshot_2022-06-07_004606.db
Deprecated: Use `etcdutl snapshot status` instead.
{"hash":1517890880,"revision":259217763,"totalKey":24763,"totalSize":325615616}
snapshot db and kube resources are successfully saved to /home/core/backup
[root@master01 ~]# ls -al /home/core/backup/
-rw-------. 1 root root 325615648 Jun 7 00:46 snapshot_2022-06-07_004606.db
-rw-------. 1 root root 75010 Jun 7 00:46 static_kuberesources_2022-06-07_004606.tar.gz
위의 "1.1. 스크립트 백업"의 내용을 kubernetes의 cronjob 기능으로 자동 수행 하도록 설정하는 방식이다.
이 방식은 총 4가지의 object resource를 openshift-etcd namespace에 생성하여 구성한다.
[root@bastion ~]# vi 00_service-account.yaml
kind: ServiceAccount
apiVersion: v1
metadata:
name: cluster-backup
namespace: openshift-etcd
labels:
cluster-backup: "true"
[root@bastion ~]# oc create -f 00_service-account.yaml
[root@bastion ~]# vi 01_cluster-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-backup
rules:
- apiGroups:
- '*'
resources:
- '*'
verbs:
- '*'
- nonResourceURLs:
- '*'
verbs:
- '*'
[root@bastion ~]# oc create -f 01_cluster-role.yaml
[root@bastion ~]# vi 02_cluster-role-binding.yaml
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cluster-backup
labels:
cluster-backup: "true"
subjects:
- kind: ServiceAccount
name: cluster-backup
namespace: openshift-etcd
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-backup
[root@bastion ~]# oc create -f 02_cluster-role-binding.yaml
매주 일요일 00시 30분에 수행 날짜를 기준으로 디렉토리를 생성 후 7일치의 디렉토리만 남기고 etcd 백업을 진행한다.
[root@bastion ~]# vi 03_cronjobs-etcd-backup.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: openshift-etcd
spec:
# Sunday, 00:30
schedule: "30 0 * * 0"
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
concurrencyPolicy: Forbid
suspend: false
jobTemplate:
metadata:
creationTimestamp: null
labels:
etcd-backup: "true"
spec:
backoffLimit: 0
template:
metadata:
creationTimestamp: null
labels:
etcd-backup: "true"
spec:
containers:
- name: etcd-backup
args:
- "-c"
- oc get no -l node-role.kubernetes.io/master --no-headers -o name | xargs -I {} -- oc debug {} -- bash -c 'chroot /host sudo -E /usr/local/bin/cluster-backup.sh /home/core/backup/$(date "+%Y%m%d") && chroot /host sudo -E chown -R core:core /home/core/backup/ && chroot /host sudo -E find /home/core/backup/ -type d -ctime +"7" -delete'
command:
- "/bin/bash"
image: "registry.redhat.io/openshift4/ose-cli"
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 256Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
securityContext:
privileged: true
runAsUser: 0
tolerations:
- operator: Exists
nodeSelector:
node-role.kubernetes.io/master: ''
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
serviceAccount: cluster-backup
serviceAccountName: cluster-backup
terminationGracePeriodSeconds: 30
activeDeadlineSeconds: 500
[root@bastion ~]# oc create -f 03_cronjobs-etcd-backup.yaml
[root@bastion ~]# oc get cronjobs -n openshift-etcd
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
etcd-backup 30 0 * * 0 False 0 2d2h 119d
[root@bastion ~]# oc get jobs -n openshift-etcd
NAME COMPLETIONS DURATION AGE
etcd-backup-27573150 1/1 9m24s 20h
[root@bastion ~]# oc get pod -l etcd-backup
NAME READY STATUS RESTARTS AGE
etcd-backup-27576165-grnx5 0/1 Completed 0 2m40s
ose-cli pod를 통해 master 노드들을 oc debug node 명령어로 접근하여 "/usr/local/bin/cluster-backup.sh" 스크립트를 수행 한다.
[root@bastion ~]# oc logs pod/etcd-backup-27576165-grnx5 -n openshift-etcd
Starting pod/master01ocp4local-debug ...
To use host binaries, run `chroot /host`
found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-89
found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-26
found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-25
found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-12
etcdctl is already installed
{"level":"info","ts":1654569976.7008715,"caller":"snapshot/v3_snapshot.go:68","msg":"created temporary db file","path":"/home/core/backup/20220607/snapshot_2022-06-07_024615.db.part"}
{"level":"info","ts":1654569976.710989,"logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1654569976.7110608,"caller":"snapshot/v3_snapshot.go:76","msg":"fetching snapshot","endpoint":"https://10.65.40.182:2379"}
{"level":"info","ts":1654569980.0306778,"logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":1654569980.0825686,"caller":"snapshot/v3_snapshot.go:91","msg":"fetched snapshot","endpoint":"https://10.65.40.182:2379","size":"328 MB","took":"3 seconds ago"}
{"level":"info","ts":1654569980.0826762,"caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"/home/core/backup/20220607/snapshot_2022-06-07_024615.db"}
Snapshot saved at /home/core/backup/20220607/snapshot_2022-06-07_024615.db
Deprecated: Use `etcdutl snapshot status` instead.
{"hash":255873794,"revision":259409339,"totalKey":24999,"totalSize":327585792}
snapshot db and kube resources are successfully saved to /home/core/backup/20220607
Removing debug pod ...
Starting pod/master02ocp4local-debug ...
To use host binaries, run `chroot /host`
found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-89
found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-26
found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-25
found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-12
etcdctl is already installed
{"level":"info","ts":1654569999.9411652,"caller":"snapshot/v3_snapshot.go:68","msg":"created temporary db file","path":"/home/core/backup/20220607/snapshot_2022-06-07_024638.db.part"}
{"level":"info","ts":1654569999.9541829,"logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1654569999.9542665,"caller":"snapshot/v3_snapshot.go:76","msg":"fetching snapshot","endpoint":"https://10.65.40.183:2379"}
{"level":"info","ts":1654570004.369387,"logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
Snapshot saved at /home/core/backup/20220607/snapshot_2022-06-07_024638.db
{"level":"info","ts":1654570004.567939,"caller":"snapshot/v3_snapshot.go:91","msg":"fetched snapshot","endpoint":"https://10.65.40.183:2379","size":"328 MB","took":"4 seconds ago"}
{"level":"info","ts":1654570004.568075,"caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"/home/core/backup/20220607/snapshot_2022-06-07_024638.db"}
Deprecated: Use `etcdutl snapshot status` instead.
{"hash":3364346622,"revision":259409934,"totalKey":25591,"totalSize":328011776}
snapshot db and kube resources are successfully saved to /home/core/backup/20220607
Removing debug pod ...
Starting pod/master03ocp4local-debug ...
To use host binaries, run `chroot /host`
found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-89
found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-26
found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-25
found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-12
etcdctl is already installed
{"level":"info","ts":1654570027.5533462,"caller":"snapshot/v3_snapshot.go:68","msg":"created temporary db file","path":"/home/core/backup/20220607/snapshot_2022-06-07_024705.db.part"}
{"level":"info","ts":1654570027.570648,"logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1654570027.5709696,"caller":"snapshot/v3_snapshot.go:76","msg":"fetching snapshot","endpoint":"https://10.65.40.184:2379"}
{"level":"info","ts":1654570031.5820053,"logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":1654570031.6293285,"caller":"snapshot/v3_snapshot.go:91","msg":"fetched snapshot","endpoint":"https://10.65.40.184:2379","size":"314 MB","took":"4 seconds ago"}
{"level":"info","ts":1654570031.6294756,"caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"/home/core/backup/20220607/snapshot_2022-06-07_024705.db"}
Snapshot saved at /home/core/backup/20220607/snapshot_2022-06-07_024705.db
Deprecated: Use `etcdutl snapshot status` instead.
{"hash":2676820444,"revision":259410654,"totalKey":26313,"totalSize":314064896}
snapshot db and kube resources are successfully saved to /home/core/backup/20220607
Removing debug pod ...
복구 방식은 백업 과정에서 생성한 snapshot 파일을 기준으로 master 노드들 중 recovery host를 지정하고 수행한다.
본 내용에서는 master01 노드를 recovery host로 지정한다.
Master 노드에서 static pod로 사용되는 etcd, kube-apiserver yaml 파일을 삭제 한다.
[root@bastion ~]# for masters in {master01,master02,master03}; do
ssh core@$masters.ocp4.local "sudo rm -f /etc/kubernetes/manifests/{etcd-pod.yaml,kube-apiserver-pod.yaml}";
done
recovery host의 etcd snapshot 데이터를 기준으로 복구하기 위해 기존 etcd 데이터를 삭제한다.
(기존 데이터를 삭제하지 않으면 데이터 정합성이 일치하지 않아 복구가 되지 않는다.)
[root@bastion ~]# for masters in {master01,master02,master03}; do
ssh core@$masters.ocp4.local "sudo rm -rf /var/lib/etcd";
done
Master 노드 중 한곳을 recovery host로 지정하고 etcd 복구를 진행 한다.
[root@bastion ~]# ssh core@master01 "sudo chown -R core:core /home/core/backup/20220607"
[root@bastion ~]# ssh core@master01 "sudo /usr/local/bin/cluster-restore.sh /home/core/backup/20220607"
Master 노드에서 kubelet을 재시작 한다.
[root@bastion ~]# for masters in {master01,master02,master03}; do
ssh core@$masters.ocp4.local "sudo systemctl restart kubelet.service";
done
recovery host에서 single etcd container가 구동되어 있는지 확인한다.
[root@bastion ~]# ssh core@master01.ocp4.local "sudo crictl ps | grep etcd"
[root@bastion ~]# oc get pod -o wide -n openshift-etcd | grep etcd
recovery host에서 구동된 single etcd를 기준으로 master02, master03 노드에 etcd를 재배포 수행한다.
[root@bastion ~]# oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
api 및 scheduler를 master02, master03 노드에 재배포 한다.
[root@bastion ~]# oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
[root@bastion ~]# oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
[root@bastion ~]# oc get pod -o wide -n openshift-etcd | grep etcd
[root@bastion ~]# oc get node
[root@bastion ~]# oc get pod -o wide --all-namespaces
모든 OpenShift 클러스터 노드를 재부팅한다.
[root@bastion ~]# for node in $(oc get node -o name | cut -d '/' -f '2'); do
ssh core@$node "sudo systemctl reboot";
done
[1]: OpenShift Docs - Backing up etcd
[2]: OpenShift Docs - Restoring to a previous cluster state
[3]: RedHat Discussions - Automated daily etcd-backup on OCP 4
[4]: RedHat Knowledgebase - Openshift Container Platform 4: Etcd backup cronjob.
[5]: IBM Developer - Backing up etcd data from a Red Hat OpenShift Container Platform cluster to IBM Cloud Object Storage