2021-08-26 Banzai Cloud Helm Chart repository incident postmortem
Table of Contents
- Incident summary
- Root cause
- Corrective actions
At 17:48 (UTC) on 2021-08-26 users encountered expired certificate issues when trying to access Banzai Cloud's public Helm Chart repository at https://kubernetes-charts.banzaicloud.com.
The incident was reported by multiple users on the Banzai Cloud Community Slack workspace.
After acknowledging the issue, the SRE team was able to address the problem in a couple minutes.
All times are UTC.
- 17:48 - Incident first reported on the Banzai Cloud Community Slack workspace.
- 18:14 - Incident was acknowledged by employee who notified the SRE team.
- 18:30 - Incident was acknowledged by the SRE team. They began fixing the issue.
- 18:35 - The issue was reported to be fixed by users.
This particular Helm Chart repository runs on a previous generation of our infrastructure where certificates are installed manually and as a result renewal is also a manual process.
The normal alert for expiring certificates haven't notified the SRE team that certificates would expire (we don't yet know why), leading to this incident.
During the time of the incident the service was available, but users weren't able to access it because of the expired certificate. The incident actually impacted a number of other services as well, but we received no user complaint and chances are the users of those services never noticed the incident.
Although the service is not seen as mission critical (Helm maintains a local cache of Charts), users reported failing CI jobs.
The incident was reported by users. No automatic alert went off notifying the SRE team.
The incident was first acknowleged by @pregnor roughly 30 minutes after the first report who then notified the SRE team. @sagikazarmark acknowleged the issue in another 15 minutes and fixed the problem in another couple minutes.
The service was recovered by executing the runbook for renewing certificates on our previous generation infrastructure. The process completed in a couple minutes resulting in the restoration of the service.
The incident was caused by an expired certificate that was supposed to be renewed manually 10 days before its expiration date. The alert that was supposed to notify the responsible personnel didn't go off leading to the incident.
In order to prevent this incident from happening in the future, the following actions will be taken:
- Relocation of the service to our current generation infrastructure. This will involve some data migration, but we don't anticipate any downtime during the process.
- Automated certificate renewal using Cert-manager and Let's Encrypt (already provided by our current generation infrastructure)
- Better monitoring checking service availability