Skip to content

Instantly share code, notes, and snippets.

@jfryman
Last active May 22, 2020 17:48
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jfryman/4744a01fd68b6336377a to your computer and use it in GitHub Desktop.
Save jfryman/4744a01fd68b6336377a to your computer and use it in GitHub Desktop.
DO Post Mortem - Mark Imbriaco
Hi, I would like to take a moment to apologize for the problems you may have experienced accessing your droplets in the NYC2 region July 21st, starting around 6PM Eastern time. Providing a stable infrastructure for all customers is our number one priority, and whenever we fall short we work to understand the problem and take steps to reduce the chance of it happening again.
In this case, we’ve determined what were a few related events which contributed to the outage:
First, we had a problematic optical module in one of our switches that was sending malformed packets to one of the core switches in our network. Under normal circumstances, losing connectivity to a single core switch should not be problematic since each cabinet in our datacenter is connected to multiple upstream switches. In this case, however, the invalid data caused problems with the upstream core switch.
When the core switch received the invalid packet, it triggered a bug in the software on the core switch which caused some internal processes that are related to learning new network addresses to crash. Some of the downstream switches interpreted this condition in a way that caused them to stop forwarding traffic until the link to the affected core switch was manually disabled.
Once traffic forwarding was restored to the core switches, they were flooded with a large volume of MAC address information. Our network is built to be able to handle a complete failure of half of its core switches, however the volume of address updates as a number of cabinets simultaneously cycled between up and down triggered built-in denial of service protection features. This protection caused the core switches to be unable to correctly learn new address information, ultimately leading to connectivity problems to some servers.
Our network vendor has been engaged, and we’ve been working together to attempt to fully understand the scope of the problem and steps that we can take to address it. Concretely, we’ve begun evaluating some software updates that we believe may improve the situation. If we determine, as we hope, that these changes will improve stability in this type of situation we will build a plan to upgrade our core network to this version as soon as possible. In addition, we continue to look for additional configuration changes that we can make in the mean time to help prevent this type of problem.
DigitalOcean's top priority is to ensure your droplets are running 24 hours a day, 7 days a week, 365 days a year. We’ve taken the first steps to fully understand this outage and have begun making changes to greatly reduce the likelihood of a similar event in the future. This work is ongoing and we will continue to make changes and validate our infrastructure to ensure that it behaves as expected in adverse conditions.
We will issue an SLA credit for the downtime you have experienced. We realize this doesn't make up for the interruption but we want to uphold our promise to our users when we fall short.
Thank you for your patience throughout this process. We look forward to continuing to provide you with the highest possible level of service.
Mark Imbriaco
VP, Technical Operations
DigitalOcean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment