paddybyers/post-mortem-20190103.md

## post-mortem-20190103.md

      
    Raw
  

              post-mortem-20190103.md
            
          
    Summary of incident affecting the Ably production cluster 3 January 2019 - preliminary investigation and conclusions

Overview

See the incident on our status site at https://status.ably.io/incidents/574
There was an incident affecting the Ably production cluster on 3 January that caused a significant level of disruption for multiple accounts, primarily in the us-east-1 region.
The incident was most severe for a 2-hour period between 1520 and 1720 (all times in UTC) in which there were elevated error rates across all regions, with the greatest impact in us-east-1. During that time, channel attach error rates averaged nearly 50%, and message publish error rates averaged around 42%. For a further period of 2½ hours, as the system recovered, error rates were also escalated; taking the entire period of the incident, channel attach error rates averaged around 33% and message publish error rates averaged 22%.
This was a serious incident affecting service for many accounts in the us-east region; we are sorry for the disruption that this has caused our customers, and we are determined to do everything we can to learn lessons from the incident to ensure that we avoid similar situations in the future. Our commercial team will be liaising with our customers in the coming days to ensure we take the necessary remedial steps in line with our SLAs.
The specific event that triggered the incident was the release of Bundler v2.0.0 at 0600. As a result of an unpinned dependency (more details below) this caused all launches of new instances into the Ably cluster to fail, and that went unnoticed until the us-east load grew to a level that triggered failures in the unexpanded cluster at 1520. We rectified the issue and were able to bring on new instances from 1537 but secondary issues arising from the overloaded state meant that the cluster failed to stabilise with the arrival of new capacity. A prolonged fault diagnosis was required to determine the necessary corrective action but, once this was decided, manual repair was carried out and the system regained stability.
The specific issues that caused the extended disruption are still being investigated. Some bugfixes and enhancements have been implemented as a result of the investigations so far, and others are anticipated as we develop a fuller understanding of what happened. There are also improvements to monitoring, alerting and visualisation, and operational processes, being planned based on lessons learned.
Background

We operate a main production cluster of our realtime message platform in multiple AWS regions globally. Instances, when brought into service, are bootstrapped by installing and starting the relevant services, and the code itself is encapsulated in binary slugs (derived from Heroku buildpacks), downloaded onto vanilla AMIs based on a configuration of the versions/shas of every component. This arrangement is intended to ensure that every part of the instance setup is (a) based on a frozen configuration and (b) obtainable entirely from services whose configuration and availability are under our control. The services we run are broadly all encapsulated and configured using the same mechanism, across multiple systems and languages including Ruby, Go, nodejs, Elixir.
In the messaging system, there are two principal layers: the frontend layer is responsible for terminating client connections, and processing requests against the REST API; the core layer is responsible for channel message processing, persistence and routing. A discovery layer also exists, responsible for compiling a composite picture of the entire cluster state.
We monitor the health of instances by two primary mechanisms. There is a local health server process, which monitors various instance metrics, queries health endpoints of other locally-running processes, and generates health-related metrics and various alerts and notifications when instances (or any processes on them) are unhealthy. There are also external monitors, including Ably-managed services and monitoring services operated by an external service provider, that monitors critical system health indicators, including a combination of explicit health checks against public endpoints, and monitoring of metrics for trends, and against specific thresholds.
During normal operations that cluster scales autonomously in response to daily load variations, based on cluster load metrics (CPU, memory). However, in certain regions (notably us-east-1 and us-west-1) there is a predictable and significant surge in load coinciding with the start of the work day in those timezones, and we explicitly scale the cluster ahead of time to more readily handle that morning surge.
When the system scales, certain roles (eg the processing in a region for a given channel) relocate, in order to maintain a balance of load across the resized system. Consistent hashing is used as the basis for placement and relocation of roles, and this depends on there being a consensus across the cluster (or, at least, a group consensus among all participants in a given channel) as to the cluster state.  A consensus formation process ensures that a consistent view of cluster state is reestablished after any change. Scaling operations, and the resultant relocation, themselves generate load, and so the system operates with a certain margin of capacity so that scaling operations can be performed when required.
As it happened

In the description below all times are in UTC.
The pre-scheduled scale-up in the us-east and us-west regions resulted in a number of instances being created, but failing to bootstrap; the bundler change resulted in no Ruby-based services becoming operational, which included the health server. As a result, none of the failing instances were able to generate any instance-based health notifications. External health monitors did not indicate an issue because the cluster was itself still healthy, and we did not have any health metrics that indicated a disparity between the number of instances in the scaling group versus the number of instances operational in the cluster.
By 1200, as load increased in the us regions, further autoscaling was triggered, which spawned yet more non-functional instances. Again, this was unnoticed at the time.
At 1518 we had the first alert arising from adverse load conditions in us-east-1. By 1520 there were multiple additional alerts indicating increased latencies. Investigation of these alerts showed that the cluster by this time was seriously under-resourced, and demand in the region was still rising sharply.
By 1532 we had determined the cause of the problem and decided that a new AMI was required to work around the problem, made the necessary changes, and built the AMI, and updated the cluster configuration. Newly scaled-instances adopted that configuration and were successfully bootstrapping and joining the cluster by 1537.
In the meantime, in the period from 1520, external error rates were climbing, and by 1523 the load on the frontend layer was causing a high number of errors across all instances in the region.
In the period from 1539-1600 frontend instances progressively came online, and load reduced dramatically. We expected that the incident would effectively be resolved at this point, as the system recovered. As the frontend layer recovered, however, the capacity deficit now hit the core layer, and multiple instances came online from 1600, triggered as usual by autoscaling metrics.
This scaling, however, failed to reduce the error rate appreciably. The underlying reasons for this are still being investigated in detail, but cause appears to have been a persistent failure to reach consensus on the cluster state as those new core instances joined. The resulting inconsistency resulted in ongoing errors.
We investigated the cause of the problem which was primarily an erroneous state in us-east, but had some ongoing adverse impact on the other regions. Our tooling was able to tell us that there were inconsistencies, but we did not know the extent of the problem, or the specific inconsistencies causing that condition. Because we could not predict how long the errors would now persist in us-east, we prepared to transfer us-east traffic to us-west, so we started also scaling that region. In parallel, hesitant to perform a widespread untargeted redeployment of multiple instances, we worked to understand where the cluster inconsistency was.
While this was happening we redeployed individual instances selectively that were exhibiting issues from the earlier overload event, and as a result of these actions by 1713 the error rate had substantially dropped, and most accounts were then operating normally. The residual cluster inconsistency, however, continued to affect some accounts (depending on the specific placement in the cluster of roles associated with those apps and accounts).
At 1800 we had conclusively determined the source of the cluster inconsistency, and we systematically redeployed all instances in the service discovery layer to address the problem. This process was ongoing until 1835, by which point most services were back to normal. We continued to see above-normal error levels as residual problems were evident in various instances that had not been detected and repaired automatically. After a number of investigations and manual interventions the service was fully restored by 1902.
Investigation

The causes of the incident were in two distinct parts: the factors that led to the unpinned bundler dependency in production, and the failure of the realtime system to recover autonomously from the overload situation that resulted.
Bundler dependency

As stated above, the intent of our architecture is that there is no software running in any production system whose configuration, or availability, is outside our control. However, a few months ago, whilst we were rolling out new AMIs and performing a number of upgrades across the board, we ran into an incompatibility with Heroku buildpacks and Ruby 2.5. At the time, it was decided to implement a workaround for the short-term whereby bundler was installed from the public repository at bootstrap time, and an issue was raised to resolve the problem down the line.
Creating this dependency on the third-party repository was a very poor decision for multiple reasons:


we mistakenly didn't lock in the hot fix to install bundler to a particular version;


we compounded the problem by later building AMIs that incorporated this version of the manager, which could not start up without the gem bundler command executing. This meant that the bundler problem not only broke  our manager and health server, but we bricked the AMIs as well;


this is a huge security risk as every booted instance could be compromised by a compromised Rubygems service;


this specific solution was avoidable; there were many other alternative solutions we could have adopted, such as just not upgrading, or forking the buildpacks and fixing the underlying issue in the buildpacks.


This was a failure to exercise the required discipline in this aspect of infrastructure management, plain and simple. We knew at the time that we were violating our own principles with the change, but were implicitly operating under two invalid assumptions:


the issue was only very short-term, because we would fix the issue properly very soon;


the risk of a failure in the meantime was negligible.


In the event, circumstances led to both assumptions being incorrect with the results we have seen. All instances across all clusters have now been updated to eliminate this issue.
The other infrastructure shortfall, immediately linked to this, was the failure to detect the disparity between the number of instances in the scaling group(s) and in the cluster. We were aware of that potential issue, but had believed that the level of control we had in the environment meant that the risk of instances failing to launch and bootstrap, undetected by the various health monitors, was very low. That assumption turned out to be wrong, in that the level of control was not what we had expected; but nonetheless there should be a way of detecting the failure to launch, irrespective of cause. We have now deployed additional monitoring to detect and alert us to any sustained disparity in instance counts.
Cluster instability/inconsistency

The cluster inconsistency was determined to be in the discovery layer, and this has been confirmed to have been caused, at least in part, by an exception at 1547 in one instance in us-east when a scaling event added a new core instance to the cluster. A sequence of occurrences of the same exception, across other nodes in the discovery layer, resulted in there being an approximately bimodal condition, where part of the cluster was self-consistent, but not in agreement with the remainder, which was also self-consistent. The root cause of this exception is the loss of internal consistency in a specific data structure; the cause of that has not yet resolved, but bugfixes have been made so that that condition will not cause an exception, and the wider impact of that exception would not happen. Investigation of this is ongoing.
Aside from these principal issues, our investigation has identified a number of bugs and other issues for which we have remedial actions.
Bugs/issues resolved


the health server had a race condition which could result in multiple instances of the same process being restarted when recovering from an error; this affected only a handful of nodes during the incident, and did not have any significant adverse effect, but added to the difficultly of getting a clear picture of the cluster state during the incident. This is a case of a broader issue with the recovery process related to the handling of locks and the correct sequencing of recovery steps. This has been resolved;


there was a bug in a health monitor that determines the health of the cluster based on the consistency between an instance's own view of the cluster and that of other instances. This resulted in an exception and meant that we did not always have access to a reliable picture of the cluster consistency. This has been resolved;


some realtime nodes, if they failed to discover and connect to a discovery layer endpoint, were eventually giving up and not retrying until they succeeded. This resulted in some fraction of the instances, post-recovery, from joining the cluster. These zombie instances had no effect on the error rates experienced by clients, but they did have an impact on cluster capacity, and generated significant noise in our logs and metrics systems, which made it harder to get a good picture of error rates and overall cluster health during the incident. This has been resolved;


Monitoring, alerting and visualisation


When the cluster was continuing to experience errors due to lack of consensus on cluster state, we knew this was happening, but did not have sufficiently detailed information in real time to know the real extent of the problem. We have experienced this issue before, and had already deployed some additional metrics to help with analysis of this issue, as a result of analysis of an earlier incident. However, those additional metrics were of limited value in this incident; the bimodal nature of the discrepancy meant that all instances were generating error metrics are similar levels, and those did not help to isolate specific problematic instances. Additional visualisation tools are already nearing readiness which will also help further in this kind of situation.

Incident handling procedures


We are looking at the actions we took in handling the incident to understand whether or not we should have acted differently. It is not possible to have a pre-prepared playbook for every eventuality, but there need to be clearly defined strategies for handling situations of different severities, and clear criteria, based on experience, of when a given strategy should be triggered.


The strategy of migrating traffic away from a problematic region is something we would normally do when it is evident that issues in a region are widespread an unlikely to resolve very quickly. However, in this incident that was not possible because the root issue (inability to bring instances online) prevented us from scaling the us-west region. Once capacity in us-east was restored, and we were able to scale us-west, then we did do that but by that time had determined that directly recovering us-east was preferable.


Customers whose accounts have Active Traffic Management (ATM) use a custom domain name which enables us to migrate the traffic to an entirely independent cluster in the event of a significant incident. In this case this was also not possible, because we were initially unable to scale an alternate cluster. Once we had resolved the initial issue, we assumed that adding capacity to the main production cluster would resolve the issue quickly. When it became clear that the issues would not resolve quickly we should have revisited the decision as to whether or not to activate an alternate cluster.


Conclusion

The outage resulted from an unpinned dependency on bundler, which is a dependency of the Ruby-based services running on our instances. This was a serious shortcoming and departure from the principles we know are important for reliable infrastructure operations. For the time that the bundler workaround was deployed we were exposed to this risk of an incident of this type and that risk materialised. We're very frustrated that this action resulted in this incident, and we are determined not to allow this type of indiscipline to occur in the future.
Inability to scale regions, or an alternate cluster, impacted our ability to enact the usual responses to incidents.
We did respond quickly once the problem was evident, and rectified the root issue, but the system then experienced secondary issues as a result of being overloaded. These secondary issues are themselves the result of a bug that is not fully resolved, but is now understood and will not recur. From a situation of widespread disruption, resolution took a long time due to the inherent complexity of dealing with a large cluster, some bugs, and some operational inefficiencies.
Although our response was broadly the correct approach, it took too long to restore service fully. We have identified a range of bugs and other issues highlighted by this experience.
Statistics

For the period of peak disruption (1520-1720)

message failure rate: 42.7%
connection failure rate: 1.6%
channel failure rate: 49.8%
router http response failure rate: 37.7%

For the incident overall (1520-2000)

message failure rate: 21.7%
connection failure rate: 0.7%
channel failure rate: 33.1%
router http response failure rate: 26.6%

Support

Please get in touch if you have any questions in regards to this incident.
https://www.ably.io