Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Kubernetes Master Nodes Backup for Kops on AWS - A step-by-step Guide

Kubernetes Master Nodes Backup for Kops on AWS - A step-by-step Guide

For those who have been using kops for a while should know the upgrade from 1.11 to 1.12 poses a greater risk, as it will upgrade etcd2 to etcd3.

Since this upgrade is disruptive to the control plane (master nodes), although brief, it's still something we take very seriously because nearly all the Buffer production services are running on this single cluster. We felt a more thorough backup process than the currently implemented Heptio Velero was needed.

To my surprises, my Google searches didn't yield any useful result on how to carry out the backup steps. To be fair, there are a few articles that are specifically for backing up master nodes created by kubeedm but nothing too concrete for kops specifically. We knew we had to try things out on our own.

We would very much love to share our experiences with the community and potentially hear what everyone needed to do with this upgrade. Now, let's jump in!

Creating backups

Locate the master nodes and noting the devices attached

Yeah, let's do some backups. But where? We have found the easiest way to back up master nodes is to back up their EBS volumes. This should be easy right? But like everything in tech, there are always smaller bits and pieces we need to watch out for. A thing as complex as Kubernetes + kops + AWS is unsurprisingly no exception. To locate the right EBS volumes let's look at the screenshot below.

It's important to note there are 2 block devices for each master node and they are both important. One is for etcd-main while the other one is for etcd-events.

Creating a snapshot from each volume (3 masters x 2 devices each = 6 snapshots)

Now, let's try to create a snapshot for each volume. Since our Kubernetes cluster is running on 3 master nodes for High Availability. We will need to do this 6 times! From the screenshoot you should see all the tags assigned to each volume. They are important because kops rely on them to attach volumes back to master nodes. For now, let's just acknowledge this. I will provide more details very soon.

Rinse and repeat for 6 times, we should have 6 snapshots ready.

Now, let's take a pause and talk about the tags I mentioned earlier. It's important that each volume to have the right tags. Here is a table that will come handy when creating the volumes. Yeah, this means 30 (6 volumes and 5 tags each) tags needed for a 3 master node setup.

Key Value
Name b.etcd-(main/events) b/b,c,d 1 owned

Creating volumes

As the screenshot shows, it's important to make sure each volume is created in the same Availability Zone as the intended master node. Otherwise they won't be able to find each other. For now, we will leave one value as stub since we still have existing volumes attached.

After this is done, we should have six backup volumes ready to go as soon as we swap out the stub value to This concludes our backups.

[Optional] Upgrading the cluster

This step is totally optional. The only purpose is to demonstrate how to revert a bad cluster upgrade (1.11 to 1.12) using the backup/restore strategy describe in this article.

Note the master nodes are now in 1.12, and our intention is to roll everything back to 1.11. Let's see if we can do that.

Restoring from backups

Detach existing volumes and delete them. This will break all master nodes, for now

It should be obvious by now that in order to restore from backups we will need to remove all existing, attached volumes, first. The screenshots below show where this is done.

Add the missing tag value to the backup volumes

With the old volumes detached and deleted, the backup volumes created earlier are ready to be attached to the master nodes. But first, we will need to add back the right tag value as the screenshot shows.

We are now right on our final step. Just hang in there! For the master nodes to pick up the backup volumes, we will need to recreate all of them. This step is as simply as terminating the nodes because kops will automatically spin up new nodes, and attach the volumes we created.

Profit! All nodes back to 1.11

Closing words

Thanks for bearing with me on this long post that involves many steps. I believe we are right now at a very interesting stage of Kubernetes adoption. While it has made an amazing progress in the last few years, the ecosystem is still catching up. For the longest time, CI/CD on Kubernetes was a challenge, then we faced the issue of observability. Fortunately, vendors like Datadog et al are continuously rolling out new offerings to address all kinds of challenges. Buffer being an early adopter of Kubernetes is truly blessed to be in a position to witness all these transitions, and contribute the best we can to the community.

If you have any thoughts, questions. Feel free to hit me up on Twitter. Until then, I hope you have fun with Kubernetes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.