pauln-ably/post-mortem-20200227.md Secret

## post-mortem-20200227.md

      
    Raw
  

              post-mortem-20200227.md
            
          
    Summary of incidents affecting the Ably production systems 27-28 February 2020 - investigation and conclusions

Overview

There was a series of incidents affecting multiple Ably production clusters, starting on 27 February and ending on 29 February. The disruption was in the Cassandra database layer that is common to all production environments, and the effects were seen across multiple production clusters, including the main production system and some dedicated clusters.
The incidents all manifested as a sustained overload condition in multiple Cassandra instances in multiple regions; this meant that a fraction of Cassandra queries failed, and these failures led to failed requests and connections for the main Ably service endpoints. In the majority of cases the incidents were short-lived and recovered without intervention, but nonetheless there was appreciable disruption for multiple customers during those times.
At the present state of the investigation we now have a good understanding of the factors that led to the incidents and also how the characteristics of the system, and the way we managed the incidents, impacted the severity of the impact to end-users.
Background

Ably operates a number of production clusters for its service around the globe. There is a main production cluster that services the majority of customer accounts, plus a small number of dedicated clusters for specific accounts. Each of the clusters has a presence in multiple regions in a globally federated system. These clusters all in turn use a globally federated Cassandra database that is used for persistence of operationally critical data.
The Cassandra cluster operates in three regions globally, and each region has multiple instances. Data is replicated with a strategy that places multiple copies of data within every region, so it can continue to operate with minimal latency when there is minor disruption (affecting a minority of instances in a single region), or significant disruption (affecting most or all instances in one region). Loss of multiple regions would permit read operations to continue, but write operations would not be possible if using a quorum consistency level. However, the operating assumption is that regions operate, and fail, independently, and so the likelihood of simultaneous failures in multiple regions is low.
The Cassandra cluster is designed for continuous operation, and any number of instances can fail, either transiently or with irrecoverable local data loss, without impacting the availability of the database service, provided sufficiently many replicas of each partition remain. However, snapshots are also take continuously so that instances can be restored - these can accelerate the recreation of instances that have suffered a local loss of data, and to restore service with the loss of some recent updates in the ultimate catastrophic case of a global loss of data.
The Cassandra system is elastic to some degree - changes in cluster size can be performed dynamically by administrative intervention, but the redistribution of persisted data takes some time after a cluster has been resized, and during this time the system is more vulnerable from an availability standpoint. There are also constraints on the degree to which multiple instances can be added or removed in parallel.
The main Ably system establishes connections across the Cassandra cluster and tracks the responsiveness of each instance individually, and each region in aggregate, in order to decide how best to service queries. Regions that are unable to maintain a minimum number of responsive connections are not used, so the healthiest and nearest Cassandra region is used at all times. If the number of viable connections falls below another threshold, then the system falls back to using all available connections, in order to try to preserve as much service as possible with a degraded database service.
As it happened

In the description below all times are in UTC.
There were several episodes of disrupted operation:

2020-02-27 11:01-12:56
2020-02-27 21:22-21:32
2020-02-28 06:43-07:12
2020-02-28 10:44-11:12
2020-02-28 21:22-21:32
2020-02-29 21:22-21:32

2020-02-27 11:01

This incident started with multiple Cassandra instances across all regions experiencing a rapid load increase, hitting 100% CPU and remaining at that level. Progressively these instances failed to respond to queries, which mean that fewer connections were available to service queries from the main Ably system. A proportion of queries were failing from 1101 and this led to multiple operations failing in the Ably system, with connection attempts and REST requests failing with a 5XX response.
Our usual response to a sustained overloaded Cassandra instance is to suspend scheduling of periodic repairs and to progressively restart instances that appear unable to recover without intervention. However, restarted instances then either resumed but continued to be overloaded, or exited abruptly during startup. Nonetheless, sufficiently many instances were reinstated that operation of the cluster resumed, and by 1124 error rates were returning to normal. However, load across the cluster was still at maximum for a number of instances.
With no sign of the cluster returning to normal load levels, a decision was made at 1134 to rebuild the nodes that were problematic. In the course of doing this, the number of instances out of service meant that a queries were unable to reach quorum on a small fraction of partitions in the cluster. This led to some update queries returning errors for an extended period; all instances did not finally reenter service until 1256, at which point error rates return to zero.
2020-02-27 21:22

At 2122, CPU load increased to 100% for two instances in each Cassandra region. That led to increases in query latency, to the extent that multiple connections from the Ably system to Cassandra were dropped, and queries on the remaining connections were often timing out with query coordinator timeouts. This resulted in a higher rate of errors being returned from Ably API endpoints.
From 2130 the load started to drop, and by 2132 all connections were restored, and query latency returned to normal. CPU load dropped again to normal shortly afterwards.
There was no intervention during the incident.
2020-02-28 Database maintenance

Although there was no clear indication of the cause of the overload, it was decided that the capacity of the cluster would be increased by scaling horizontally and increasing the number of instances in all regions. This process was started by adding a single instance to each of the regions early on the morning of 20200228.
2020-02-28 06:43

From 0643 error rates started to climb, and it was evident that several of the instances were again overloaded. Again, with no clear evidence of the cause of the problem, existing rebuilds were allowed to continue, and overloaded instances were restarted. Multiple instance failed by exiting during the restart process; those that would not restart were this time restored from snapshots. However, with the reduction in capacity both from the rebuilding instances and the instances that were being restored, there were elevated rates of query errors in the cluster, and corresponding errors in the Ably service API, until 0712.
Investigation

The cause of the crash during instance restart was investigated and found to be due to a version incompatibility between the java runtime version and the Cassandra version. This incompatibility had been introduced during a maintenance upgrade in December 2019, and manifested during restart of an instance when replaying a commit log in a particular state. All other operations were seemingly unaffected by the problem, and the regular testing of snapshot restoral had not detected the problem.
Although at this time there was no concrete evidence that the incompatibility was the cause of the other disruption we started to look for evidence of this.
Further steps were taken to complete the introduction of the planned capacity increase.
2020-02-28 10:44

Again at 1044 we saw a rapid rise in load on multiple instances across all regions - specifically, two instances in each region. Again, because the ongoing capacity changes, with several instances still rebuilding, the redundancy margin was depleted, and the Ably system started to experience higher latencies, with some queries timing out. Instances that did not recover autonomously were restarted. Error rates returned to normal at 1112.
Investigation

Investigation of the logs from the 1044 disruption established a correlation between some specific rebuild operations (which involved transferring data from existing nodes to new nodes) and the onset of the disruption. There were some very large partitions being copied, which correlated with CPU and disk activity spikes leading up to, and at the time of the onset of the disruption.
The cause of the unexpectedly large partitions was investigated, at it was determined that the large partitions did not contain live data for the most part, but tombstones corresponding to time-expired entries. Further investigation is ongoing as to whether or not the routinely scheduled compaction and repair schedules were being as effective as they are expected to be.
At this point it was evident that the maintenance operations we were implementing, to increase capacity, were themselves having a detrimental effect, so the capacity increases were suspended.
2020-02-28 21:22

Just as had happened the day before, at the same time, there was a spike in activity and load, and this caused some disruption for 10 minutes as a result of elevated query latencies. Again, it recovered without intervention.
However, now it could be established that there were multiple distinct causes of the load: some occurrences had been triggered by maintenance operations (or, at least, there was some body of evidence to suggest that), but these 2122 incidents had not been caused by that.
In preparation for the same time the following day, the following measures were put in place:

the instances that had been affected by the load, and their associated EBS volumes, were increased to larger capacity instances;
additional system event logging and query event logging were enabled.

2020-02-29 21:22

Again there was a load spike at this time, and the additional capacity did not prevent query latencies increasing to the point that some operations failed with timeouts.
Investigation

Finally, after having followed a few blind alleys, we had the information necessary to determine the underlying cause of the incidents. In fact the cause was very simple: a single application had scheduled operations to occur on a large fleet of devices simultaneously, and the specific API that was being called was not rate-limited, which resulted in an unbounded query rate being presented to the Cassandra layer. When those requests timed out, they were retried by the application, magnifying the load. During the incidents, the combination of a lack of rate-limiting and the retries were causing a 200x magnification of the request rate.
Issues identified

Instance recovery crash. As had been identified after an early investigation, the restart crash was a result of an incompatibility between the java runtime version and the Cassandra version. This is now resolved.
Rate limit checks omitted on certain APIs. Some APIs related to the push notification system had instance-wide rate limits implemented, but not global (account-wide) instantaneous rate limits. We had a system of global aggregation event rates for each account, and when limits are met this gives rise to suppression of events; however, this specific operation was not happening for the push APIs in question. The reason that they had not been implemented was a development flow problem: rate-limiting changes had been made on a branch while work was also ongoing on the push API in another branch, and although both branches merged cleanly, the fact that some of the required changes had been omitted as a result was not noticed.
Use of a counter column in one Cassandra table. Although we had been monitoring query rates, at the CQL protocol level, and we had been capturing internal Cassandra stats, we had not seen that some specific operations involving a table that contained counter column where very high, and correlated with the load spike. We have not yet quantified the specific impact, but we believe there is a very significant disparity in the load between counter column operations and regular upserts. The affected features are now being rearchitected to avoid the use of the counter column.
Inappropriate retry strategy. In this case, retries that occur as a result of a 5xx error response (eg when there is a timeout) are simple period retries in the Ably libraries, without any backoff. This is clearly a source of request magnification in situations like this. The library specification will be updated so that a retry backoff strategy is required.
It is also the case that after the rate limit was imposed, and a 429 response was returned to the client, there were also aggressive retries without backoff. With a 4xx response the library itself won't retry, but it is evident that the application itself is retrying without a backoff strategy. There should be explicit advice given to apps to avoid doing this in the future.
Global vulnerability. Having Cassandra in multiple regions, with all data replicated to all regions, did prevent a single trigger from causing disruption for clients in every region. This is precisely because a single partition, if updated at a high rate, will require updates to be made in all regions, on the specific instances that that partition is placed. It is unfortunate that having read/write replicas means that a sufficiently intensive write operation will affect all copies; increasing the level of redundancy (ie increasing the number of replicas of any given partition) would in fact make the situation worse. The way to address this phenomenon in Cassandra is to scale horizontally, so that affected instances represent a smaller fraction of the overall partition/hash space.
Fallback consistency demands by clients. During one of the incidents, clients were experiencing failed operations because the cluster was unable to achieve quorum for writes for a fraction of the partition/hash space. However, there were still multiple copies of the data available, and updates were still successfully being made to half of the replicates, but the number of writes achieved fell just short of the requirement for quorum.
A change has been implemented that will permit a system to reattempt an operation with a lower consistency requirement so that it is possible to continue a degraded level of service when the Cassandra service is compromised in this way. This change will be rolled out shortly.
Monitoring, operations and incident handling

Lack of visibility into the level of activity in Cassandra. We had been capturing a wide range of internal metrics but these stats were not sufficiently visible, so we did not see the underlying causes of the load as early as we might have done. This has been rectified by adding the requisite additional charts and alerts to our monitoring.
Lack of visibility into the source of particular traffic. We log all API requests along with basic information about the requestor, in order that we can investigate anomalies and get visibility in real time of the origin of certain traffic patterns. We are implementing changes so that we are able to identify specific apps as well as requestors.
Untested maintenance/reinstatement operations. We ought to have known at an earlier stage that there was an incompatibility in the Cassandra/VM combination. The issue only manifested during replays of certain commit logs, but there could have been routine testing of that kind of operation (eg using Chaosmonkey or similar) which would have surfaced that issue prior to introduction.
Compaction and repairs were regularly run, but effectiveness was not monitored. There were some very large partitions existing in the system that were predominantly containing expired data, but if routine maintenance operations had been effective then this should not have been the case. We have monitoring to ensure repairs are running, but no way to verify the effectiveness of those repairs. We are continuing to look into how this might be improved.
Conclusion

This was an unfortunate series of incidents that exposed a number of shortcomings in the Cassandra layer, and our operations associated with that layer. Ultimately the principal cause of disruption was very straightforward and entirely typical - an unsustainable load, unrestricted by rate-limiting, and then magnified by retries. However, some operational limitations that limited our visibility, together with some blind alleys that came about from our attempts to react, meant that it took us much longer to recognise that that was happening.
We are actively addressing the underlying technical limitations that have been identified in the investigation. Multiple changes have been deployed already, and further changes are being made.
We will continue to address some operational issues that have been highlighted by our experience managing the incident; we are sorry for the disruption caused by this incidents and we will work on the issues we have identified to avoid the same kind of disruption in the future.
Support

Please get in touch if you have any questions in regards to this incident.
https://www.ably.io