Summary of incident causing alert emails and other website notifications to stall 23 April 2020 - investigation and conclusions
There was an incident affecting alert emails and other website notifications, impacting the service from 23 April to 27 April. The incident was due to background worker processes not running, caused by a version incompatibility issue following a dependency upgrade.
The Ably website acts as the marketing website for the company and also provides business logic, interactive dashboards and other functionality as part of the Ably product.
The website uses Sidekiq to perform various background tasks, including delivering emails and processing notifications from the realtime infrastructure and for notifying clients. Sidekiq also performs many tasks supporting the business functionality of the website. Sidekiq uses Redis as its data store.
As it happened
As part of our continuous efforts to keep our systems up to date, at 0922 on 20200423 (all times in UTC) we updated Sidekiq along with a few other dependencies. On this occasion the new version of Sidekiq, version 6.0.0, required a newer version of Redis than the version supported by the cloud Redis service in use, Redis To Go. Redis To Go only offers version 3.2.12 of Redis whereas Sidekiq 6 requires version 4.0.0 or later.
The compatibility issue prevented our background workers from starting up. Due to a lack of sufficient monitoring this went unnoticed by the team.
At 0927 on 20200427 we were alerted by our support teams of an issue when it became evident that back office reports and synchronization jobs were not being run. By 1046 the root cause was identified and confirmed, and we proceeded with downgrading Sidekiq and pushing the fix into production. An update was prepared at 1059 and completed review, and deployed to production at 1204.
Once the workers were running again there was a backlog of thousands of jobs. Most of the backlogged notifications were rate limit and usage notifications from the core Ably realtime system; these notifications were processed and a majority of them gave rise to further jobs to send email alerts to service users. During this time many of the notifications that were processed, and the resulting emails, contained stale information. The load induced by processing the backlog caused stress to the system; at 1459 a configuration change was made to reduce the level of concurrency in processing the backlog to avoid complications due to excessive load during the recovery.
The backlog of messages was eventually cleared, and system operation returned to normal by 1800.
2020-04-23 09:22 - Deployed new version of the Ably website with the incompatible Sidekiq version
2020-04-27 09:27 - Errors reported by internal users of downstream systems
2020-04-27 10:46 - Problem identified and confirmed by checking Sidekiq dashboard and Heroku dyno metrics
2020-04-27 10:59 - Root cause identified as Sidekiq/Redis version incompatibility, patch prepared to downgrade the Sidekiq gem
2020-04-27 12:04 - After patch is peer-reviewed, unit tested, integration tested and confirmed working in staging; version is promoted to production
2020-04-27 12:13 - Monitoring shows the Sidekiq workers are working again but a new issue emerged as the exception monitoring showed the database connection pools being exhausted
2020-04-27 14:59 - Concurrency configuration on Sidekiq worker was reduced to decrease contention on the database connection pool. Increasing the database connection pool size was ruled out as it would require a code change which would take longer to test and deliver into production.
2020-04-27 18:00 - Monitoring shows all backlogged work has been cleared from the Sidekiq queues and operations were back to normal
We have completed the investigation into the technical issues that led to the incident, and the events of the subsequent incident response. We have a good understanding of factors that led to the incident, and those that contributed to its duration and impact.
System test, monitoring and operations
Failure to detect version incompatibility prior to deployment. The changes to Sidekiq and its dependencies underwent the usual verification steps including peer review and test in CI. However, the CI environment was running the Redis dependency internally, and was running Redis 4 instead of the Redis 3.2.12 version that was in production. This meant that there was no opportunity in CI to detect the incompatibility. Secondly, the majority of the CI tests for the queue processing functionality are unit tests of job-handling functionality which do not themselves depend on Sidekiq. Finally, although certain manual interactive checks were made before approving the change to be released to production, these did not include any end-to-end testing of functionality that depended on the queue service. The remedial steps taken to address these issues are as follows.
CI infrastructure update to ensure the same version of all dependencies are used as are used in production. This addresed the Redis version disparity, but an issue was also identified with PostgreSQL.
CI test coverage to include sufficient integration tests to include end-to-end job processing via Sidekiq.
Failure to detect that the Sidekiq workers were not running. Monitoring of the health of the Sidekiq workers has been added so that the running of the workers and the latency of job processing are monitored. Alerts arising from those monitors are handled via the escalation and paging system used for alerts arising from the Ably realtime message processing system. Sidekiq’s built-in health endpoint does not expose the latency of the job queue, so a health endpoint has been added that includes latency monitoring.
Failure to detect that jobs were not being processed. There was no end-to-end monitoring of the event processing pipeline in production. This end-to-end monitoring is being implemented.
Failure to handle surge/backlog load. Changes have been made to the database connection pools so it is possible to handle larger concurrent workloads in the Sidekiq workers if necessary. Changes have been made to the way certain configuration parameters are handled in order that they are explicitly controlled with a revision history, instead of being implicit in configuration of the cloud service.
Incident response procedures and playbook. Part of the delay in resolving the issue was an immaturity and lack of rehearsal of incident response operations in the website team. Operations surrounding the principal Ably service are more mature - in terms of incident response procedures, root cause investigation, and playbooks for remediations - but the website team has not previously established procedures and did not have the same level of preparedness. All aspects of the incident handling response are being reviewed to learn lessons and improve our ability to respond to any future incidents.
This incident was triggered by a broken software update that failed to be detected in CI and reached the production system. However, the magnitude of its subsequent impact was primarily a result of the failure of monitoring to detect the problem in production; the problem persisted for several days and, as a result, there was a failure to deliver alerts and other notifications to service users.
We take service continuity very seriously, and this incident represents a significant shortfall in the level of service we are committed to providing for our customers. The website has not previously been treated as such a critical element of the Ably service offering, but it is clear that it is necessary to ensure that the same approach is taken to operational integrity as for the primary Ably service. The investigation has been wide-ranging, and we have taken steps to address the root causes, but also the issues that contributed to the impact of the incident.
We are sorry for the disruption caused by this incident, and are committed to learning from it so that we do not have a recurrence of any of the issues identified in the investigation.