paddybyers/post-mortem-20181203.md

## post-mortem-20181203.md

      
    Raw
  

              post-mortem-20181203.md
            
          
    Summary of incident affecting the Ably production cluster 3 December 2018 - preliminary investigation and conclusions

Overview

See incident on our status site at https://status.ably.io/incidents/572
There was an incident affecting the Ably production cluster on 3 December that caused a significant level of disruption for multiple accounts, primarily in the us-east-1 region, but with some adverse impact on other regions.
The incident lasted for a period of 65 minutes between 1520 and 1625 (all times in UTC) in which there were elevated error rates across all regions, with the greatest impact in us-east-1. During that time, channel creation error rates averaged 23%, and message publish rates averaged around 46%. The us-east-1 region was taken out of service, and traffic diverted to other regions, during that time. The us-east region was reinstated at 1839, without any further disruption.
We are sorry for the disruption that this has caused our customers, and we are determined to do everything we can to learn lessons from the incident to ensure that we avoid similar situations in the future. Our commercial team will be liaising with our customers in the coming days to ensure we take the necessary remedial steps in line with our SLAs.
The specific issues that led to the problems initially, and resulted in them causing extended disruption, are still being investigated. The incident started as a system update was being deployed to the us-west-1 region - that region remained healthy but it triggered an increase in load in us-east-1 and, to a lesser degree, in ap-southeast-1. Although the update that triggered the problem was reverted quickly, those regions did not recover, and us-east problems persisted until load was migrated to other regions.
We will continue to investigate the root cause of the problem and implement bugfixes, and any other improvements to processes, tools or monitoring resulting from an analysis of the events.
Background

We operate a main production cluster of our realtime message platform in multiple AWS regions globally. We have been in the process of rolling out a significant update to the platform which provides a new service discovery layer for communicating the cluster structure between nodes, which aims to provide a much more scalable mechanism for coordinating our system as the global load expands. A core objective of this architectural change is to ensure that the discovery layer remains stable when other components of the system are under stress, and the decision to migrate to this architecture was a principal conclusion of the analysis of an earlier incident.
We have successfully concluded load testing of the new discovery layer, and deployment to other clusters, without issues, and on that basis the new layer was in the process of being deployed across the production cluster. The state on 3 December was that all but 3 regions - us-east-1, eu-central-1 and ap-southeast-1 - had been updated to the new architecture. We were about to proceed with deployment to those remaining regions, but were deploying an update to the already-deployed regions, which included a number of bugfixes and performance improvements.
Several regions had been updated earlier in the day, and the update was being deployed to us-west-1 at the time of the incident.
As it happened

In the description below all times are in UTC.
At 1519 we initiated the deployment of an update to a service discovery node in us-west-1. The deployment process on each instance involves starting a new process and then terminating the superseded process on that instance. As soon as the new process joined the cluster there was a rapid increase in CPU and drop in memory in two core (message processing and routing) nodes in us-east-1. Those nodes became unresponsive and error rates started to rise for the channels located on those instances. Those unresponsive instances were then not visible consistently to other nodes in the region, and a error condition became widespread in which the discovery system could not reach consensus on the cluster state. Within a few minutes, this caused a significant rise in error rates for channels across the region.
Channel errors experienced by clients generally trigger retries by those clients, and this itself generates further load on the system. As part of our standard response to incidents of this type - where a systematic issue is causing errors across multiple instances and services - we disabled the automated responses of our monitoring and health systems to avoid them simply making continuous restart attempts for individual services.
Our first recovery action was at 1539 to revert the 1519 deploy, on the assumption that it had introduced a regression. This operation completed at 1540 and there was no immediate evidence that it had improved the health of the us-east-1 region. At 1540 we determined that there was no immediate sign of an improvement, and decided to move traffic away from both the us-east-1 and ap-southeast-1 in order to give them the capacity margin needed to allow them to recover. Before load could be diverted it was necessary to provision additional capacity in us-west-1 and ap-southeast-2 since this would be the region most likely to take the traffic diverted from us-east. This scale-up was initiated at 1540 and completed at 1605. DNS for the us-east-1 and ap-southeast-1 regions was redirected elsewhere at 1607, which resulted (after DNS expiry) in new client connections being instead established in healthy regions, and new HTTP requests also being directed there. However, existing client connections into those regions were not forcibly closed, so these remained connected. us-west-1 and ap-southeast-2 remained stable with the transfer of load. Error rates on client requests dropped immediately but connections remaining in us-east-1, and some channels, continued to be affected by the us-east-1 instability.
At 1622, with us-east continuing to be unstable, we were faced with a decision of moving forwards or backwards with the rollout of the architectural change. We decided to roll forwards, on the basis that the already-upgraded regions were largely error-free, and the un-upgraded regions were the ones experiencing issues. Therefore we decided to terminate us-east completely, which caused all residual traffic to be transferred, so that we could restart all instances completely with the new archtecture. Very quickly after terminating all old-architecture instances, the system became globally consistent and stable, and client errors dropped away; by 1625 error levels had returned to normal.
Subsequently us-east-1 and the other remaining regions were upgraded to the new architecture and service was reenabled there, without incident.
Investigation

The deploy at 1519 was the specific event that triggered the incident. The cause of that is still being investigated; a working theory is that that update generated large cluster state updates, and a bug in the gossip protocol used by the discovery layer meant that that update imposed excessive memory demands on the old-architecture processes. Two instances in particular were affected by this bug initially and these experienced memory exhaustion, leading them to become unresponsive. A disappointing feature of this aspect, in particular, is that this bug was known, and fixed, and was part of the update that was being rolled out.
Notwithstanding the initial trigger, we do not know yet why the system did not recover without further intervention. At that time of day there was growing load at the start of the US work day, but there was a large capacity margin at the time, and the system was not under stress due to load. It is possible that the mixed configuration at the time - that is, there were some old-architecture and some new-architecture regions - was responsible for that. We did conduct extensive mixed-configuration testing before rolling out the updates, and we had been running in a mixed configuration for around 2 weeks, but something about this specific update might have caused the ongoing instability.
We will continue to try to understand the specific sequence of events we saw but, as things stand, we do believe that this specific vulnerability is not present in the new discovery architecture - and the cascade of load-induced instability and errors is definitely a known failure mode of the old architecture which we were seeking to eliminate. The fact that the system regained consensus and stability as soon as the old-architecture instances were taken out of service has reinforced our belief that the new system will be significantly more resilient in this type of situation.
Aside from the critical triggering event, our investigation has identified a number of bugs and other issues for which we have remedial actions.
Bugs/issues resolved

So far we have not identified any specific bugs that had not already been fixed in the update that is now deployed.
Incident handling procedures

We are looking at the actions we took in handling the incident to understand whether or not we should have acted differently. It is not possible to have a pre-prepared playbook for every eventuality, but there need to be clearly defined strategies for handling situations of different severities, and clear criteria, based on experience, of when a given strategy should be triggered.
Irrespective of the specific cause, our strategy for handling incidents such as this in which there is widespread failure within a region is to divert load away from the region to allow to recover, or be explicitly re-provisioned. This is what we did in this instance - and this was eventually successful - but this procedure relies on the destination regions being healthy and able to accept that traffic. Scaling those other regions in order to do that causes a delay; and that delay contributed to the duration of this incident.
In fact, delays in handling the incident contributed significantly to the impact on our customers, and addressing this is one of the principal areas that improvement is needed. The steps in progression of the incident that delayed resolution were:
from errors first seen at 1519, to the attempt to revert the deploy at 1539. This took too long - partly because our expectation was that the system would recover by itself, and partly resulting from indecision about the appropriate corrective action. The fact that a deploy in one region had triggered instability in another region contributed to our initial hesitation to blame the newly deployed code;
after having decided to divert traffic, the time taken to scale the other regions to accept that traffic. The action was initiated at 1540 but we did not feel confident to divert the traffic until 1607, which is also too long. Our scaling rates have been determined historically by features of the old architecture, and we will be revisiting the rates at which we permit the cluster to scale with the new architecture;
the fact that certain errors continued even after diverting DNS until the eventual step of taking us-east-1 out of action entirely. This contributed to ongoing errors from 1607 to 1622. Again the delay was partly the result of indecision as to the appropriate corrective action - specifically, whether to roll forwards or backwards.
So, of the 65 minutes from start to end, more than half (33 minutes) was time taken to react to circumstances - most of this was waiting for events to unfold rather than indecision - but nonetheless we could arguably have reduced these delays significantly if with better instrumentation we could have known the likely eventual outcome sooner than we did. The other significant delay was the scaling delay, and this is an area we have identified improvement actions.
Architecture changes proposed

The leaking of load, and any cluster errors generally, between regions, is an issue, in that it defeats the strategy of offloading problematic regions in the event of unrecoverable errors in one region. This has been a known issue and we have specific architectural updates underway to address this.
Conclusion

The interim conclusion is that the disruption resulted from excessive system load as a result of a bug in the discovery layer that was in the process of being replaced, followed by an (as yet unexplained) failure to recover autonomously from that situation, probably a bug specific to the mixed-architecture configuration at the time. Resolution of the incident was delayed in part by decision-making, complicated the procedural steps needed to prepare to divert traffic away from the busiest global region.
The fact that the system subsequently recovered almost instantly as soon as we had taken the step of terminating the old-architecture instances validates the decision made at the time, and makes us more confident that the new architecture is a step forward in stability under stress.
Statistics

For the period of the incident (1520-1635)
realtime message failure rate: 43.4%
connection failure rate: 1.6%
channel failure rate: 23.2%
router http response failure rate: 46.0%
Support

Please get in touch if you have any questions in regards to this incident.
https://www.ably.io