paddybyers/incident_report_2021-0210.md Secret

## incident_report_2021-0210.md

      
    Raw
  

              incident_report_2021-0210.md
            
          
    Summary of incident affecting the Ably production service 10 February 2021 - investigation and conclusions

Overview

There was an incident affecting Ably production services, both on the main production cluster and multiple dedicated clusters, on 10 February 2021, which caused all connections in the us-east-1 region to be terminated and forced to reconnect elsewhere. The resulting load spike in other regions, particularly us-west-1, caused longer than usual reconnection latencies and elevated message processing latencies.
The incident started when AWS Network Load Balancers (NLBs) in the us-east-1 region terminated all connections simultaneously [1]. The failure affected multiple Availability Zones (AZs) and multiple Ably clusters in that region. The subsequent flood of reconnections, primarily in us-west-1, temporarily exceeded the capacity of the backend Ably system, which then required clients to connect to other regions to maintain service.
Access to Ably Reactor Queues explicitly provisioned in the us-east-1 region was also affected.
At the present state of the investigation we have a good understanding of the factors that initially triggered the incident, and have evaluated the effectiveness of automated responses, and our explicit interventions to manage the response. Several remedial actions have been identified and implementation of the most critical actions is underway.
Background

Ably operates a number of production clusters for its service around the globe. There is a main production cluster that services the majority of customer accounts, plus a small number of dedicated clusters for specific accounts. Each of the clusters has a presence in multiple regions in a globally federated system.
Access network, client connections and routing

In each region, the frontend layer of the messaging system, which terminates subscriber connections and handles REST requests, consists of a group of endpoint instances, behind a mesh of routers/reverse proxies which are responsible for distributing connections and requests among available endpoint instances, and implement endpoint affinity for certain classes of requests. These routers are served by AWS NLB instances in each region and these, in turn, are served by our CDN / edge cacheing layer.
A Latency Based DNS service is used to route the generic API hostnames (ie rest.ably.io and realtime.ably.io) to the nearest available and healthy service endpoint (which will usually be the nearest datacenter geographically). Route53 health checking is able to route requests to more remote service endpoints in the case that the nearest is detected to be unhealthy.
Ably client libraries are configured with a primary endpoint (strictly, an environment option which allows the library construct a canonical endpoint hostname) and fallback endpoints (the hostnames of endpoints in non-primary regions). The libraries use the fallback hosts if an error indicates that there was a 5xx error from (or failure to respond by) the primary host, and an independent check confirms that internet connectivity generally from the client is not the issue. This functionality aims to ensure that, if the Ably cluster in one region becomes unreachable or is unhealthy, clients can connect to another region and continue to be serviced with minimal disruption.
Therefore there are two independent mechanisms that are capable of ensuring continued access to service via an alternate region if the nearest region is unhealthy.
Push messaging

Ably provides a push notifications service that delivers events to mobile and other device targets by the Apple Push Notification Service (APNS) and Firebase Cloud Messaging (FCM) transports. Push notifications can be published directly to a specified recipient, or can be triggered as a result of messages being published to certain Ably channels, if devices have registered a prior interest in push events associated with those channels.
The push service operates via two subsystems that are an adjunct to the main Ably messaging system. First, there is a RabbitMQ cluster that is used to collect push message jobs (either direct publishes, or channel messages for channels that can have push subscriptions); and a pool of worker processes that consume from push queues and process those jobs, retrieving registered device details as required, and submitting the resulting requests to APNS and FCM endpoints.
Both services are provisioned in us-east-1 and eu-west-1 and are hosted on EC2 instances in AWS, and deployment and orchestration is controlled by Ably's infrastructure software. All processes deployed by Ably to EC2 run under the control of a manager process that is responsible for establishing the launch configuration, managing continuous deploys, and monitoring the ongoing health of the process. Push messages are processed by default in the us-east-1 cluster, except for accounts configured to process messages only in the EU. In the case of failure of the us-east cluster, failover is to the eu-west cluster, but this failover is manually triggered.
Reactor Queues

Ably provides Reactor Queues in a multi-tenanted RabbitMQ cluster that clients can access via a dedicated endpoint, and interact with via AMQP or STOMP. Reactor Queues are designed for non-critical, low to medium volume use-cases and have a guideline limit of no more than 200 messages per second per account.
These queues are provisioned explicitly in a specific region - currently, as for push messages, there are clusters in us-east-1 and eu-west-1. Access by consumers to the consumer endpoint is via a cluster-specific NLB, with a region-specific endpoint. Since Reactor Queues are provisioned in a specific region, and consumed from that region, there is no failover mechanism to provide continuity of service in the case of problems in a region. For this reason, we do not provide the same SLA for Reactor Queues as we do for the core Ably messaging service.
As it happened

In the description below all times are in UTC, on 2021-02-10 unless otherwise stated.
The triggering event was an abrupt failure of AWS NLBs in the us-east-1 region [1] at 2021-02-10T19:06:00Z. (The AWS status site did not initially report the incident; it was eventually acknowledged to be a live incident at At 2021-02-10T20:19:00Z, more than 1 hour later.)
At 2021-02-10T19:06:00Z we were alerted to a significant load spike in the us-west-1 region of both the main production cluster and of a separate dedicated cluster. All connections in us-east-1 had been abruptly closed at 19:06:30 and as a result a flood of new connection attempts started to arrive in us-west, doubling the established connection count within a few seconds. Memory and CPU metrics initiated autoscaling in the region, and the load on the instances increased all processing latencies, to the point that even existing connections could not be sustained and were dropped. Manually-initiated scaling, to triple the capacity in the region, was initiated at 19:09. The ongoing stream of connection attempts continued to fail in us-west until the additional capacity came online and was able to service traffic from 19:14. A similar process was followed in the dedicated cluster.
At 19:20 all other clusters that had a us-east-1 presence suffered a similar sudden loss of all connections. Most of the traffic was diverted to the next-nearest US endpoint by latency-based routing. To varying degrees, service in the US in those dedicated clusters was impaired by the loss of us-east, and overloading of other available regions. Again, significant scaling was manually triggered in all cases to supplement the scaling already initiated automatically. Except in those dedicated clusters that only had 2 regions, service remained available via endpoints in the EU or elsewhere. In all cases, service in the overloaded US regions returned over the course of the next 10-15 minutes. By 19:20 service was fully restored in all available regions (ie excluding us-east-1) in the main production cluster, and for all other clusters by 19:33.
In all cases, connectivity in us-east-1 remained unavailable for several hours. Although DNS was already routing traffic away from us-east-1 on the basis of health checks, a change in the Ably DNS configuration was made that permanently routed traffic away until further notice.
AWS confirmed the existence of a problem with NLBs in us-east-1 at 20:19. By 2021-02-11T00:03:00Z we started to see service restored to some NLBs, and confirmed all remaining NLBs had service restored by 2021-02-11T00:38, more than 5½ hours after the onset of problems. We progressively restored Ably service across all clusters by re-provisioning capacity and then redirecting traffic back. By 2021-02-11T02:00:00Z service in us-east-1 was resumed in all clusters.
Push messaging

Initially all push messages through the us-east-1 cluster continued as normal, because Ably's ability for internal processes to publish to, and consume from, the RabbitMQ cluster in the region was not affected by the NLB problems. However, since the RabbitMQ cluster is shared with the Reactor Queues service, external consumers' ability to consume from the clusters via those queues was impaired by the NLB problem and, as a result, memory usage in the cluster started to increase, and instances first hit a high water mark (HWM) at 20:23 for a period of 10 minutes until 20:33. During that time, up to 25% of push messages were discarded.
Push traffic was migrated to the eu-west cluster by manual intervention at 20:37, and remained there with no further disruption until it was finally restored to us-east by 2021-02-11T00:56:00Z.
Reactor Queues

User queues provisioned in us-east-1 are unavoidably pinned to the us-east-1 region and NLBs mediating access for consumers are a Single Point of Failure (SPOF) of that service. Reactor Queues in us-east-1 were partially unreachable by consumers for the period from the initial NLB problem at 19:20 until service was restored at 23:37. During this time, queues were intermittently unavailable due to connectivity problems, and recurring problems with the RabbitMQ cluster rejecting messages due to repeatedly hitting the HWM. A more detailed analysis of the impact of the incident on the Reactor Queues service is underway.
Investigation

The investigation has identified a number of issues relating to how the service responded to the NLB problem and the resulting redistribution of load, as well as the way the incident was managed and our interventions. These factors are discussed below.
Issues identified

Load spike to alternate regions exceeded capacity in those regions

A large number of clients with dropped connections attempted simultaneous reconnection in alternate regions. This load spike led to unsustainable load in those regions, with the result that the migrating connections could not be established and, to a degree, existing connections in those regions were compromised.
Complete loss of connectivity to a single endpoint should not ordinarily cause any problems for clients, other than a dropped connection that immediately reconnects to an alternate endpoint. Of those clients that suffered disconnection from us-east, the subsequent problems were predominantly from this issue, in that the principal alternate region was not initially able to handle the load spike. There are several aspects to this problem.
The simultaneity of reconnection attempts itself created an unsustainable spike. Often described as the "thundering herd" problem, the problem is that all connected clients were disconnected almost simultaneously, and their reconnection attempts were therefore also near-simultaneous. This created an instantaneous rate connection attempts that exceeded the available capacity.
Clients will by default attempt reconnection immediately after a dropped connection. This is because the majority of disconnections in ordinary operation are a result of either:

server-initiated disconnections resulting from scaling events, server process redeployments, or active shedding to rebalance load;
general client-to-server connectivity issues.

In both cases an immediate first reconnection attempt is appropriate. Backoff will take place if the first connection attempt fails.
However, this immediate reconnection attempt can give rise to the simultaneous reconnection spike we saw in this case. It would be possible to introduce a reconnection delay, and jitter, to these attempts if the client is able to distinguish between legitimate server-initiated disconnections, and unexpected disconnections. However, this would require a behaviour change in all SDKs, and would not improve the situation for clients that do not use a library (eg those using Server-Sent Events or MQTT).
The load from migrated connections, even without the simultaneous reconnection events, would have exceeded capacity available in the alternate region. Notwithstanding the problems arising from the reconnection spike, a smoothed profile of reconnection attempts would still have given rise to a volume of connections that was unsupportable given the capacity available in the alternate regions at the time. Scaling can cause more capacity to exist, but at any given instant, it is necessary to operate within the bounds of the available capacity.
The regions were vulnerable to denial of existing service. This is related to both of the above points. If new load demand is not efficiently rejected - whether that is an instantaneous spike or just excessive in aggregate - then it risks continuity of service for existing load. This is in fact what happened in this case: in the alternate us regions, the arriving load caused CPU to spike to 100% for an extended time, which caused existing traffic in the region to be dropped for several minutes until capacity increased. There is an existing mechanism ("siege mode") that frontend instances use to reject incoming requests when various load metrics indicate that the instance is at capacity, but in this case the cost of processing requests up to the point of rejection was sufficient to overload some instances.
In response to this incident we have decided to accelerate a plan to move the first stage of request processing to an independently scalable layer, which will mean we can operate that specific functionality with a much greater capacity margin. This request processing element will also enforce rate limits on connection establishment generally on a per-account basis, and connection attempts that are rejected for rate-limiting reasons can be designated explicitly as such and handled by clients appropriately, by introducing jittered backoff before retries.
US traffic is too dependent on a single region

In the main Ably production cluster, a significant majority of all US traffic is in us-east-1. We also have datacenters in the us-west-1 region, but latency-based routing takes clients to us-east-1 for most of the large US population centres. We have decided to expand our presence also to include us-east-2; this change will be rolled out with immediate effect.
Dedicated clusters with only two regions are too vulnerable to capacity shortfall and denial of service in the case that there is a complete outage in a region

The main production cluster, and all dedicated clusters with 3 or more regions, were able to provide continuous service - at most, one region was temporarily unavailable - but those clusters with only 2 regions were unable to provide service for some period. It is strongly preferred that all dedicated clusters have at least 3 regions and we will be contacting customers in the coming days to address that.
The requirement for manual failover for push message queue processing led to avoidable disruption

As noted, manual intervention was necessary in order to migrate push message processing to an alternate queue cluster in eu-west, and this caused push message processing to be suspended for around 10 minutes.
The health of the RabbitMQ cluster in us-east-1 was deteriorating during the incident. Alerts, indicating failure of end-to-end delivery, were generated as soon as push message delivery latencies exceeded 90s, but these alerts were missed in among all of the other alerts arising the incident. Furthermore, these alerts relating to end-to-end service did nothing to highlight the specific reason for the failures (which in his case was the HWM problem in some instances in the RabbitMQ cluster.)
We are implementing automated failover in order to deal with this situation in the future. In addition, further metrics and associated alerting have been added to improve monitoring of the RabbitMQ cluster.
Reactor Queues are inherently vulnerable as queues are pinned to a specific region

As noted above, user queues provisioned in us-east-1 are unavoidably pinned to the us-east-1 region and NLBs mediating access for consumers are a Single Point of Failure (SPOF) of that service. For this reason, we do not provide the same SLA for Reactor Queues as we do for the main Ably messaging service.
We are currently undertaking a review of the Reactor Queues element of the Ably service, to ensure that we can provide a service that meets its primary use-case - as a simple and accessible queue service usable in conjunction with Ably for non-critical workloads - but do so with a level of availability and reliability that meets user expectations.
Incident response

Alert overload clouded incident response. It is rare that a single cause triggers failures for a large proportion of traffic, or failures across multiple distinct clusters and systems. On this occasion, the fact that problems were triggered across multiple environments and services meant that there was a significant volume of alerts, and no single dashboard to assist in visualising the overall status of the service.
At the onset of the incident we were forewarned of the issue because a small number of NLBs failed at 1906, and so manually-triggered scaling in multiple clusters had been initiated across all affected clusters by 1919, and this mitigated effects for those clusters that failed around 1920. But this was fortuitous - had those responses depended on an analysis of individual alerts from all services and clusters, the response would have been slower.
Several remedial actions are underway, and further actions are planned, to assist with this issue in the future:

a new system of alert management, which was already nearing completion, is now being rolled out, and this will help to classify and prioritise ongoing alerts;
additional dashboards with certain critical system-wide metrics are being implemented, covering all services and environments;
an improved protocol has been defined for incident response, with specific attention being paid to ensuring that visibility is maintained of the overall status of systems and services and all actions are prioritised in that context.

Conclusion

The AWS NLB outage in us-east-1 was unparalleled in our experience in multiple ways: the abruptness of the failure, the extent of the failure (ie affecting all service in 3 AZs), and the time to resolution by AWS. The incident directly impacted connectivity for a significant fraction of all connections, across all virtually all customer accounts. Nonetheless, the affected clients experienced only transient disconnection and the core Ably service was continuously available for all but a small handful of accounts; our ability to maintain a global service that is resilient to these events is a key element of the Ably service proposition, and the technical and operational provisions that exist to accommodate this kind of failure were effective.
That said, our analysis of the incident shows that there are areas in which we can improve. For the core Ably service, these relate to ensuring adequate provision of capacity, in terms of the available regions and capacity margin, and to more graceful handling of the load spikes that arose. Further issues have been identified that will improve the resilience of the push messaging element of the service. We will also use this experience to improve how we prepare for, and handle, incidents of this type.
The Reactor Queues service was the area that was adversely impacted in a significant way; to an extent that service is always going to be vulnerable to single-region failures, but there are nonetheless improvements we can make to monitoring and alerting that would have helped in this specific instance. We are separately working with those customers that were affected to rectify the impact of the incident, and help architect a solution that reduces the exposure that is inherent in that service.
We take service continuity very seriously, and this incident represents another opportunity to learn and improve the level of service we are committed to providing for our customers. We are sorry for the disruption that was experienced by some customers, and are committed to identifying and implementing remedial actions so that we do not have a recurrence of any of the issues that impacted our response.
References