Summary of incident affecting the Ably production cluster 30 October 2018 - preliminary investigation and conclusions
See incident on our status site at https://status.ably.io/incidents/567/
There was an incident affecting the Ably production cluster on 30 October that caused a significant level of disruption for multiple accounts, across multiple regions.
The incident was most severe for a 2½-hour period between 1400 and 1630 (all times in UTC) in which there were elevated error rates across all regions, with the greatest impact in us-east-1. During that time, channel creation error rates averaged nearly 20%, and message publish rates averaged around 9%. For a further period of 3 hours, as the system recovered, channel creation error rates averaged around 9% and message publish error rates averaged nearly 5%.
This was by far the most serious incident affecting service since the inception of the company; we are sorry for the disruption that this has caused our customers, and we are determined to do everything we can to learn lessons from the incident to ensure that we avoid similar situations in the future. Our commercial team will be liaising with our customers in the coming days to ensure we take the necessary remedial steps in line with our SLAs.
The specific issues that led to the problems initially, and resulted in them causing extended disruption, are still being investigated. Broadly, the incident started with a failure to cope with rapidly escalating load in us-east; in response to the resulting disruption in that region we instigated routing of traffic to other regions, but these in turn became overloaded, albeit to a lesser degree.
Some bugfixes and enhancements have been implemented as a result of the investigations so far, and others are anticipated as we develop a fuller understanding of what happened. There are also wide-ranging improvements to monitoring, alerting and visualisation, and operational processes, being planned based on lessons learned.
We operate a main production cluster of our realtime message platform in multiple AWS regions globally. During normal operations that cluster scales autonomously in response to daily load variations. However, in certain regions (notably us-east-1 and us-west-1) there is a predictable and significant surge in load coinciding with the start of the work day in those timezones, and we explicitly scale the cluster ahead of time to more readily handle that morning surge. As the global production load has increased over recent months we have seen a steady increase in the total cluster size in these regions, with reactive load-driven scaling continuing to occur during the day even after the preemptive morning daily scaling. As load subsides, the cluster autoscales down again from early evening.
Each region is scaled independently, but the regions operate together as a single cluster (for example to satisfy requirements for data to be persisted in more than one region); this means that load variations in one region always leak to some degree into other regions, and these will in turn scale reactively based on load metrics.
When the system scales, certain roles (eg the processing in a region for a given channel) relocate, in order to maintain a balance of load across the resized system. Consistent hashing is used as the basis for placement and relocation of roles, and this depends on there being a consensus across the cluster (or, at least, a group consensus among all participants in a given channel) as to the cluster state. A consensus formation process ensures that a consistent view of cluster state is reestablished after any change. Scaling operations, and the resultant relocation, themselves generate load, and so the system operates with a certain margin of capacity so that scaling operations can be performed when required.
There is always the possibility of extreme situations in which load changes so rapidly that the autoscaling system is unable to respond in time to handle the load, and there is insufficient capacity margin for scaling to happen; in such situations certain monitored load parameters can trigger any instance to enter a defensive "siege mode"; in this state, in addition to enforcing the usual rate-limits for API requests etc, the system can elect to reject certain operations (which under normal circumstances would be legitimate) in order that the load does not cause the system to become overloaded, or unable to react. Siege mode is triggered rarely, and is usually a result of extreme load spikes, or unexpected system failures affecting a large number of nodes.
Also, we anticipate the possibility that entire AWS regions can stop functioning effectively, because of widespread networking issues, or widespread and persistently high error rates. In such cases we can divert all user traffic away from the affected regions in order to maintain continuity of service.
As it happened
In the description below all times are in UTC.
On 30 October the scheduled scale-up had completed in us-east-1 by 1130 but as load increased there were multiple reactive scaling events from 1301 through to 1340. Throughout this period, the system load metrics remained relatively high; the reactive scaling was keeping up as total load increased, but per-instance load metrics were not lowered as a result.
At 1345 a scaling event triggered a load spike simultaneously in multiple instances, very rapidly increasing CPU. Often such situations are transient and after a perturbation the system reestablishes normal operation quickly. So we didn't take any specific corrective action for a couple of minutes but watched the various metrics. Error rates in the region started to increase; by 1350 multiple instances were at 100% CPU, and this metric triggered further scaling.
Over the next few minutes the health of the cluster in the region continued to deteriorate, with more instances becoming overloaded, and rising error rates. There was insufficient capacity margin for further scaling in the region and siege mode was triggered widely across the system.
As part of our standard response to incidents of this type, we disabled the automated responses of our monitoring and health systems to avoid them simply making continuous restart attempts for individual services.
Load was starting to increase in us-west-1 (which is an expected result of client retries if requests at the primary endpoint fail). In order to respond to that we started to scale us-west-1 at 1355.
With no sign of recovery evident, we decided at 1405 to divert traffic away from us-east-1 in order to give it the capacity margin in needed to allow it to recover. Before load could be diverted it was necessary to provision additional capacity in us-west-1 and eu-west-1, since these would be the regions most likely to take the traffic diverted from us-east.
Scaling of those regions was initiated at 1406 and at 1415 we decided that they were capable of accepting the us-east traffic. DNS for the us-east region was redirected elsewhere, which resulted (after DNS expiry) in new client connections being instead established in us-west or eu-west, and new HTTP requests also being directed there. However, existing client connections into us-east were not forcibly closed, so these remained connected to us-east.
Once load was diverted from us-east we started a process of manually restarting instances that had become unhealthy (since automated system recovery was disabled). This was completed by 1428 and the health of systems in us-east started to make a decisive recovery.
Load increased in us-west particularly, and there was a very gradual drop in load in us-east. By 1436 us-west was also becoming overloaded and error rates rose there. In order to relieve that we reenabled traffic to us-east at 1450. us-west total load recovered to sustainable levels, but there were still high error rates in both of these regions and, to a lesser extent, elsewhere.
At this point, the elevated error rates continued, due to there being widespread lack of consensus in cluster state. We were unable to reenable automated recovery because this would have triggered a significant number of restarts simultaneously, which would have had a net adverse effect on the cluster state. As a result we were forced to perform a systematic manual recovery of individual processes across multiple reviews in order to restore service to normal.
Error rates gradually declined but this was a slow process. Error rates fell gradually as individual processes were repaired, and had fallen substantially by 1543, at which point only a small number of accounts were experiencing errors. Reenabling automated recovery was then possible.
We continued to see above normal error levels as residual problems were evident in various systems that had not been detected and repaired automatically. After a number of investigations and manual interventions the service was fully restored by 1900.
The addition of a node to the cluster at 1345 and the reaction at 1347 was the specific event that caused the situation to escalate from simply being busy to an incident with disruption of service. The cause of that is still being investigated; a working theory is that that node was brought into the cluster before it was ready and the cluster state change triggered, nearly simultaneously, multiple clients of a specific that faulty node to experience errors; their specific responses to that included logging a significant volume of information to their logs via blocking I/O, which caused certain other event processing to stall. We will continue to try to understand this critical sequence of events.
The temporary lack of consensus on cluster state causes certain operations to be retried, and the pattern of retries itself escalated the load on the affected nodes. This resulted in a backlog of processing of messages and other internal requests, which in turn caused free memory to drop sharply.
The overloading of instances resulted in them becoming unresponsive to health checks - causing them to be taken out of service - and also triggered further scaling. That scaling, whilst overloaded, further compounded the problem: it generates load that the system was unable to handle, and it caused further cluster inconsistency which leads to errors and retries, further increasing load.
Siege mode is intended to prevent the system from entering such a deadlocked state, but at present this is essentially a strategy of rejecting operations that would increase load; for example rejecting a channel creation attempt on an instance whose memory is able a critical threshold. However, siege mode, as currently implemented, is unable to shed existing load - it will not shut down an already-active channel, for example. Therefore it did not help once the state of the system had gone beyond a certain point.
Ultimately, our strategy for handling incidents such as this in which there is widespread failure in a region is to divert load away from the region to allow to recover, or be explicitly reprovisioned. This is what we did in this instance - and this was eventually successful - but this procedure relies on the destination regions being healthy and able to accept that traffic. Scaling those other regions in order to do that causes a delay; and that scaling itself induces problems in an overloaded cluster.
Aside from the critical triggering event, our investigation has identified a number of bugs and other issues for which we have remedial actions.
Operations that are retried as a result of temporary cluster inconsistency can result in increased load; a backoff mechanism has been implemented to avoid this.
Some sequences of events with instances being simultaneously added to the cluster, as others are removed, can cause temporary corruption of internal data structures used for role placement; this has been resolved.
Some channels could end of in a persistently failing state in the case of a specific sequence of errors in their startup and subsequent retries; this has been resolved.
We were unable to reenable our automated recovery services selectively (eg regionally); this is now fixed. This was a significant impairment on our ability to reinstate full operation after the initial crisis was largely resolved.
Architecture changes proposed
We were already in the process of migrating to a new system discovery layer which avoids the problems we encountered in which overloaded nodes were unable to reach consensus on cluster state. It will also hugely accelerate scaling operations for loaded clusters, which was an issue in this case. In fact, the excessive logging of state when specific errors occurred was part of the preparatory work for this change. This will be progressively rolled out from next week.
The leaking of load, and any cluster errors generally, between regions, is an issue, in that it defeats the strategy of offloading problematic regions in the event of unrecoverable errors in one region.
Siege mode proved to be ineffective once a certain threshold of load has been crossed; refusing new load, by itself, does not enable instances to continue if they are already past the point of being able to function. A more aggressive set of siege mode actions, probably including shedding of existing connections or channels where necessary, would enable to continue to provide a lower level of service during an incident without it causing the sort of widespread disruption we saw. We are working on siege mode enhancements now.
Monitoring, alerting and visualisation
When the cluster was continuing to experience errors due to lack of consensus on cluster state, we knew this was happening, but did not have sufficiently detailed information in real time to know the real extent of the problem. Improved metrics and visualisation would have enabled us to understand what the cause was at the time, and to take specific corrective action, instead of having to wait for load to subside before the system could reach consensus again. This has partly been addressed already via improved metrics.
When instances are overloaded (either high CPU or low memory) then we did not have sufficient visibility in real time of the causes of that. That's not to say that there would have been specific corrective actions that we could have taken, but without the information it was harder to understand at the time the causes of the behaviour we were seeing.
Incident handling procedures
We are looking at the actions we took in handling the incident to understand whether or not we should have acted differently. It is not possible to have a pre-prepared playbook for every eventuality, but there need to be clearly defined strategies for handling situations of different severities, and clear criteria, based on experience, of when a given strategy should be triggered. The strategy of migrating traffic away from us-east is something we have undertaken before, and with hindsight this was the correct action; however we lacked tooling to do this efficiently, so it took longer than necessary. This tooling shortfall is being addressed.
Enterprise customers have custom CNAMEs set up for Active Traffic Management, which enables us to migrate their traffic to completely dedicated temporary infrastructure if there is an incident that would cause significant disruption. This procedure was followed in this incident, but some configuration shortcomings in some clients prevented it from working. We also lacked tooling to perform this transfer efficiently; we are in the process of addressing both the process and tooling issues highlighted in this incident.
In the short term we are scheduling scaling explicitly, earlier, and to a higher scale, for us-east and us-west while we complete the actions from this incident, the minimise the chances of any recurrence of the conditions that triggered the error.
The outage resulted from an incorrect system response to an error occurring whilst scaling to meet unusually high load. The nature of the response - whose specific reason is still being analysed - was that multiple instances were affected at the same time which led to significant disruption in the region; this in turn affected other instances and, ultimately, other regions. From a situation of widespread disruption, resolution took a long time due to the inherent complexity of dealing with a large cluster, some bugs, and some operational inefficiencies.
Although our response was broadly the correct approach, it created far greater disruption than is acceptable, and it took too long to restore service fully. We have identified a range of bugs and other issues highlighted by this experience.
Ultimately this incident resulted from an obscure error condition and a bug which was present because the specific circumstances required to trigger the bug were not part of our integration testing. We're not going to pretend that we will ever be completely bug-free and we have to acknowledge that issues do arise at scale and under load that are not detected at an earlier stage, so we are fundamentally reliant on load testing to avoid incidents of this type in service. In this case, the problem was compounded by our inability to respond effectively enough as a result of the issues identified above.
For the period of peak disruption (1400-1630)
- message failure rate: 9.0%
- connection failure rate: 0.2%
- channel failure rate: 19.4%
- router http response failure rate: 3.1%
For the incident overall (1345-1900)
- message failure rate: 4.8%
- connection failure rate: 0.09%
- channel failure rate: 9.3%
- router http response failure rate: 1.7%
Please get in touch if you have any questions in regards to this incident.