Skip to content

Instantly share code, notes, and snippets.

@jujaga
Last active September 14, 2020 20:57
Show Gist options
  • Save jujaga/7b43c9c31d72a71d3103f36cc750088e to your computer and use it in GitHub Desktop.
Save jujaga/7b43c9c31d72a71d3103f36cc750088e to your computer and use it in GitHub Desktop.
CSST Incident Report - COMFORT - Septemer 10, 2020

Incident Report

  • Ministry: Natural Resources - IITD Division
  • Team(s): Common Services Showcase Team (CSST)
  • Affected System(s): Common Forms Toolkit (COMFORT)
  • Environments: Production
  • Incident Type: Outage

Summary

On Septemer 10, 2020 at 9:37 AM PDT, the production instance of COMFORT became unavailable due to a compound database failure. The COMFORT application remained unavailable until Septemer 10, 2020 at 1:42 PM PDT.

Affected Users

While this incident does not expect there to be any data loss, we suggest for any users who attempted to reach COMFORT (https://comfort.pathfinder.gov.bc.ca/app/) between Septemer 10, 2020 at 1:00 AM PDT and Septemer 10, 2020 at 1:42 PM PDT to double check their actions and submissions within that window.

Root Cause

The Patroni High Availability cluster database had a catastrophic failure where all three replicas failed over the course of a few days and was undetected. More specifically:

  • patroni-0 was in a crash loop for more than a week - unable to track starting date
  • patroni-1 was up and operational until Septemer 10, 2020 at 9:36 AM PDT. This pod was killed cleanly as a part of regular platform node maintenance.
  • patroni-2 was up and operational until September 7, 2020 at 2:30 PM PDT. This pod was killed cleanly as a part of regular platform node maintenance.

As of September 7, 2020 at 2:30 PM PDT until Septemer 10, 2020 at 9:33 AM PDT, patroni-1 was the only pod running and was therefore acting as master at the time. When patroni-1 failed, there was no more database replicas available to fail over to and thus became unavailable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment