- Ministry: Natural Resources - IITD Division
- Team(s): Common Services Showcase Team (CSST)
- Affected System(s): Common Forms Toolkit (COMFORT)
- Environments: Production
- Incident Type: Outage
On Septemer 10, 2020 at 9:37 AM PDT, the production instance of COMFORT became unavailable due to a compound database failure. The COMFORT application remained unavailable until Septemer 10, 2020 at 1:42 PM PDT.
While this incident does not expect there to be any data loss, we suggest for any users who attempted to reach COMFORT (https://comfort.pathfinder.gov.bc.ca/app/) between Septemer 10, 2020 at 1:00 AM PDT and Septemer 10, 2020 at 1:42 PM PDT to double check their actions and submissions within that window.
The Patroni High Availability cluster database had a catastrophic failure where all three replicas failed over the course of a few days and was undetected. More specifically:
- patroni-0 was in a crash loop for more than a week - unable to track starting date
- patroni-1 was up and operational until Septemer 10, 2020 at 9:36 AM PDT. This pod was killed cleanly as a part of regular platform node maintenance.
- patroni-2 was up and operational until September 7, 2020 at 2:30 PM PDT. This pod was killed cleanly as a part of regular platform node maintenance.
As of September 7, 2020 at 2:30 PM PDT until Septemer 10, 2020 at 9:33 AM PDT, patroni-1 was the only pod running and was therefore acting as master at the time. When patroni-1 failed, there was no more database replicas available to fail over to and thus became unavailable.