dwbutler/post_mortem.md

## post_mortem.md

      
    Raw
  

              post_mortem.md
            
          
Downtime window: 4:41 PM EST - 4:55 PM EST
Total downtime: 14 minutes

"Bob" (backend engineer) was the developer on call during this incident.
All names have been changed to protect the innocent. :)
Summary

Bad code which prevented server startup was deployed to production. It had not
been manually tested in a dev environment, but had passed all unit and
integration tests.
During the same deploy, a database migration removed a column from a table,
making the old code incompatible. When the code was rolled back, the old column
had to be added back manually because the Rails migration was irreversible.
Response time was delayed because Bob was out shopping, and nobody was
available as a backup. He should have monitored the deploy, but got impatient
when the deploy took too long.
Timeline

Bad code not caught during testing

This pull request (redacted) contains
the bad code and the migration that caused the downtime. Since it was a backend
change without any frontend component (yet) there was no way to manually test
the change. So unit and integration tests were used to verify parts of the
functionality.
Some of the functionality was not in use yet, so it was considered safe to
delay writing tests for them. So the bad code was not covered by any automated tests.
Furthermore, since eager loading was turned off in test mode, the bad code was
never run at all.
The code was never deployed to a Heroku dev environment because there was no
need to do manual testing. When the pull request was merged, it was
successfully deployed to staging, but the staging app server crashed. This fact
was missed until later.
A deploy that took too long

Total deployment time is normally between 20 - 25 minutes. The
breakdown is 8 minutes to run tests and 6 minutes for the
actual deployment. The rest of the time is Circle CI doing setup and
housekeeping.
On this occasion, Bob was babysitting the deploy as usual and was getting
impatient. He noticed that a step called Releasing parallel containers that are no longer needed was taking a really long time - minutes. Normally it should take
seconds. He gave up waiting and went grocery shopping.
An incompatible migration

The same pull request added a database migration which removed the topic_id
column from the topic_follows table and replaced it with a tag column. This
was in fact not a "zero downtime" migration in the sense that the existing code
on the app server was not compatible with this change. This fact wasn't
realized until after the code had been deployed to production.
Initially, Airbrake reported 6 instances of the following error:
PG::UndefinedColumn: ERROR: column topic_follows.topic_id does not exist

This alerted Bob to the fact that something was wrong. Two minutes later, he
received a Pingdom alert that the entire site was down. Initially, he thought that the
site was down due to this error, and that somehow the new code that was pushed
was still using the old column somewhere. So he believed that a simple code
rollback would bring the site back up.
In fact, as he learned later, this error was caused by the existing code
breaking because the column was gone, before the app had restarted. Such
problems are quite difficult to predict and reason about, which is why it's
best practice to not remove columns immediately, but wait until new code has
been deployed first that no longer uses the column.
No one else available to help

Unfortunately, Bob was at Whole Foods.
Bob tossed the groceries and headed straight for his car. He knew it would
take at least 10 minutes to get home and roll back the deploy. He checked Slack
to see if anyone else could help. "Mary" (backend developer) was on vacation and
"Mark" (frontend engineer based in Poland) was offline. "Matt" (product manager)
was also offline but he would probably respond to a phone call, so Bob called
him. He responded, but was unable to help in time because he didn't have the
Heroku Toolbelt set up. By the time he had it set up, Bob was already back home.
An irreversible migration

When Bob got back, he rolled back the deploy by running:
heroku rollback -a prod

To be safe, he also added back the topic_id column. He already knew the
migration was irreversible because he had seen the error come up while
switching branches. Reversible migrations are good in same cases, but are often
unrealistic to run in practice. So he didn't feel it was necessary to enforce a
practice of reversible migrations.
He ran heroku pg:psql -a prod, googled the syntax for adding a column
since he almost never does it manually, and entered:
alter table topic_follows add column topic_id int;

Three minutes after the rollback, the new app servers were up, and Pingdom
reported that the site was back up.
Aftermath

After this incident, we were initially confused about what exactly had gone
wrong. At first it seemed like the database migration had caused the downtime,
but this was a red herring.
During his investigation, Bob discovered that he had never actually deployed
his code anywhere. (He thought he had deployed it to dev1.) He checked staging
and discovered that it too had a 500 error on the homepage. Checking the logs,
he saw the following error message:
NoMethodError: private method `new' called for Procto:Class
  airwolf/app/services/is_following_topic.rb:3:in `<class:IsFollowingTopic>''

This led him to fix the bad code, and to investigate why this bad code wasn't
caught during testing.
After fixing the issue and deploying to staging, he verified that the problem
was fixed, and deployed the change to production.
Later, he ran the following SQL to remove the unneeded column:
alter table topic_follows drop column topic_id;

Lessons Learned


A Heroku dev environment needs to be the canary in the coal mine. Normally
this is staging, but there was no monitoring on staging to verify that a
deploy resulted in a functional app server. This is now in place via a
Pingdom alert.
Eager loading has been turned on in test mode so that bad code will cause
immediate failure. This should allow this kind of problem to be caught much
sooner, either on the developer's machine or on Circle CI while the feature
is in progress.
The deploy process will be sped up. Specifically, Mark is working on speeding
up asset compilation. Bob will review the deploy script, remove any
unnecessary steps, and contact Circle CI support about any slow steps that
aren't under our control.
A developer will always be available in a 20 minute time window around a
deploy, especially if that developer was the one who kicked off the deploy!
Matt will be available as a backup to run any commands needed in case a
developer isn't able to get to a computer.
With any database migration, it is always good to have a rollback plan in
mind. The riskier the change, the more fleshed out the rollback plan should
be.
Removing a column is a risky change and should be treated as such. We have
already identified tools that can automatically detect database migrations
that are unsafe or require downtime, and they just need to be implemented.
These tools include
https://github.com/gocardless/activerecord-safer_migrations and
https://github.com/ankane/strong_migrations