sagikazarmark/2021-08-26-banzai-cloud-helm-chart-repository-incident-postmortem.md

## 2021-08-26-banzai-cloud-helm-chart-repository-incident-postmortem.md

      
    Raw
  

              2021-08-26-banzai-cloud-helm-chart-repository-incident-postmortem.md
            
          
    2021-08-26 Banzai Cloud Helm Chart repository incident postmortem

Table of Contents


Incident summary
Timeline
Context
Fault
Impact
Detection
Response
Recovery
Root cause
Corrective actions

Incident summary

At 17:48 (UTC) on 2021-08-26 users encountered expired certificate issues when trying to
access Banzai Cloud's public Helm Chart repository at https://kubernetes-charts.banzaicloud.com.
The incident was reported by multiple users on the Banzai Cloud Community Slack workspace.
After acknowledging the issue, the SRE team was able to address the problem in a couple minutes.
Timeline

All times are UTC.

17:48 - Incident first reported on the Banzai Cloud Community Slack workspace.
18:14 - Incident was acknowledged by employee who notified the SRE team.
18:30 - Incident was acknowledged by the SRE team. They began fixing the issue.
18:35 - The issue was reported to be fixed by users.

Context

This particular Helm Chart repository runs on a previous generation of our infrastructure where certificates are installed manually
and as a result renewal is also a manual process.
Fault

The normal alert for expiring certificates haven't notified the SRE team that certificates would expire (we don't yet know why),
leading to this incident.
Impact

During the time of the incident the service was available, but users weren't able to access it because of the expired certificate.
The incident actually impacted a number of other services as well, but we received no user complaint and chances are the users of those services never noticed the incident.
Although the service is not seen as mission critical (Helm maintains a local cache of Charts), users reported failing CI jobs.
Detection

The incident was reported by users. No automatic alert went off notifying the SRE team.
Response

The incident was first acknowleged by @pregnor roughly 30 minutes after the first report who then notified the SRE team.
@sagikazarmark acknowleged the issue in another 15 minutes and fixed the problem in another couple minutes.
Recovery

The service was recovered by executing the runbook for renewing certificates on our previous generation infrastructure.
The process completed in a couple minutes resulting in the restoration of the service.
Root cause

The incident was caused by an expired certificate that was supposed to be renewed manually 10 days before its expiration date.
The alert that was supposed to notify the responsible personnel didn't go off leading to the incident.
Corrective actions

In order to prevent this incident from happening in the future, the following actions will be taken:

Relocation of the service to our current generation infrastructure.
This will involve some data migration, but we don't anticipate any downtime during the process.
Automated certificate renewal using Cert-manager and Let's Encrypt (already provided by our current generation infrastructure)
Better monitoring checking service availability