Kubernetes Master Nodes Backup for Kops on AWS - A step-by-step Guide
Since this upgrade is disruptive to the control plane (master nodes), although brief, it's still something we take very seriously because nearly all the Buffer production services are running on this single cluster. We felt a more thorough backup process than the currently implemented Heptio Velero was needed.
To my surprises, my Google searches didn't yield any useful result on how to carry out the backup steps. To be fair, there are a few articles that are specifically for backing up master nodes created by kubeedm but nothing too concrete for
kops specifically. We knew we had to try things out on our own.
We would very much love to share our experiences with the community and potentially hear what everyone needed to do with this upgrade. Now, let's jump in!
Locate the master nodes and noting the devices attached
Yeah, let's do some backups. But where? We have found the easiest way to back up master nodes is to back up their EBS volumes. This should be easy right? But like everything in tech, there are always smaller bits and pieces we need to watch out for. A thing as complex as Kubernetes + kops + AWS is unsurprisingly no exception. To locate the right EBS volumes let's look at the screenshot below.
It's important to note there are 2 block devices for each master node and they are both important. One is for
etcd-main while the other one is for
Creating a snapshot from each volume (3 masters x 2 devices each = 6 snapshots)
Now, let's try to create a snapshot for each volume. Since our Kubernetes cluster is running on 3 master nodes for High Availability. We will need to do this 6 times! From the screenshoot you should see all the tags assigned to each volume. They are important because kops rely on them to attach volumes back to master nodes. For now, let's just acknowledge this. I will provide more details very soon.
Rinse and repeat for 6 times, we should have 6 snapshots ready.
Now, let's take a pause and talk about the tags I mentioned earlier. It's important that each volume to have the right tags. Here is a table that will come handy when creating the volumes. Yeah, this means 30 (6 volumes and 5 tags each) tags needed for a 3 master node setup.
As the screenshot shows, it's important to make sure each volume is created in the same Availability Zone as the intended master node. Otherwise they won't be able to find each other. For now, we will leave one value as
stub since we still have existing volumes attached.
After this is done, we should have six backup volumes ready to go as soon as we swap out the
stub value to
steven.buffer-k8s.com. This concludes our backups.
[Optional] Upgrading the cluster
This step is totally optional. The only purpose is to demonstrate how to revert a bad cluster upgrade (1.11 to 1.12) using the backup/restore strategy describe in this article.
Note the master nodes are now in 1.12, and our intention is to roll everything back to 1.11. Let's see if we can do that.
Restoring from backups
Detach existing volumes and delete them. This will break all master nodes, for now
It should be obvious by now that in order to restore from backups we will need to remove all existing, attached volumes, first. The screenshots below show where this is done.
Add the missing tag value to the backup volumes
With the old volumes detached and deleted, the backup volumes created earlier are ready to be attached to the master nodes. But first, we will need to add back the right tag value as the screenshot shows.
We are now right on our final step. Just hang in there! For the master nodes to pick up the backup volumes, we will need to recreate all of them. This step is as simply as terminating the nodes because
kops will automatically spin up new nodes, and attach the volumes we created.
Profit! All nodes back to 1.11
Thanks for bearing with me on this long post that involves many steps. I believe we are right now at a very interesting stage of Kubernetes adoption. While it has made an amazing progress in the last few years, the ecosystem is still catching up. For the longest time, CI/CD on Kubernetes was a challenge, then we faced the issue of observability. Fortunately, vendors like Datadog et al are continuously rolling out new offerings to address all kinds of challenges. Buffer being an early adopter of Kubernetes is truly blessed to be in a position to witness all these transitions, and contribute the best we can to the community.
If you have any thoughts, questions. Feel free to hit me up on Twitter. Until then, I hope you have fun with Kubernetes!