Skip to content

Instantly share code, notes, and snippets.

@elof
Last active August 24, 2021 20:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save elof/5458276085824b673efb7e21598b3c1a to your computer and use it in GitHub Desktop.
Save elof/5458276085824b673efb7e21598b3c1a to your computer and use it in GitHub Desktop.
Incident Summary 19-08-2020

Incident Summary:

Incident Description: Start Date/Time: 07:00, 2021-08-19 End Date/Time: 13:56, 2021-08-19 / TIme format is 24h in UTC Time Zone /

Issue Description: On 19 August between 0700 and 1230 UTC, a maintenance window was initiated for the ap-west region. Maintenance windows are designed to have no disruption to users with global endpoints. In this situation, intermittent connectivity issues manifested in a maintenance window that extended for 1 hour beyond initial scope.

Customer Impact: Users with data stored in a local collection (i.e. not globally replicated) were unable to access their data for approximately 7 hours.

Resolution: Upon identification of the connectivity issue at 1242 hours, we identified the root cause, completed the deployment and region upgrade, and verified availability by 1356 UTC.

Next Steps: Real-time, actionable insight from data requires real-time access to that data. In this scenario, we did not live up to our, or your, standards.

This incident highlighted multiple issues that will require individual attention. As we work through this backlog, we will postpone maintenance windows for future regions as we implement additional controls.

  1. Networking & connectivity issues during maintenance window

One of our engineers assigned to maintenance and upgrade work lost connectivity. As the tasks were handed off to another global region, it took longer than expected and increased the maintenance window. We will be providing critical engineers with backup internet on alternative providers.

  1. The maintenance window notification did not provide accurate information for customers with local collections.

In this incident, all access to local data in the ap-west region was unavailable for users with local only collections.

The potential impact of maintenance windows will be communicated more clearly in the future and we will advise customers about the impact of local collections during scheduled maintenance windows. This is a process improvement that will be addressed with additional training and SRE runbooks.

  1. Status page updates were lacking or incomplete

Upon identification of the issues, we began addressing them quickly. This, however, was not sufficiently communicated via the status page. We will address this issue with a combination of tooling for automated status updates and better internal definitions for external communications.

In addition, maintenance window notifications (which currently disappear upon completion on the status page) will remain in the Macrometa status page history and customers will be proactively notified of regional updates with notifications set at 72, 48, 24 hours, and 30mins prior to start of maintenance.

  1. Macrometa internal escalation of issues

We have a fixed set of escalation steps that we follow. This process is being reviewed, reworked, and improved to ensure that user-impacting issues are escalated in a timely fashion and given appropriate priority immediately.

  1. Local vs. Global collection documentation

Macrometa documentation does not, at present, describe the availability trade-offs associated with implementing local vs. global collections. This will be addressed with additional documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment