dwbutler/post_mortem.md

## post_mortem.md

      
    Raw
  

              post_mortem.md
            
          
    Written by David. Much thanks to Lukasz for technical and moral support, and helping me grep the Papertrail logs.
All times are in EST.

Downtime window: 8:00 AM - 8:25 AM
Total downtime: 25 minutes

TL;DR

Heroku PGBackups was running during a database migration, and was unexpectedly holding a lock. Postgres stopped serving any queries until the backup was canceled.
Timeline

PGBackups started at around Sep 22 07:28:47 as process 3092. It turns out that backups are scheduled every day at 6:00 AM EST.
heroku pg:backups schedules -a prod

=== Backup Schedules
DATABASE_URL: daily at 10:00 (UTC)

At 7:36, the articles table was copied using a COPY command.
Sep 22 07:36:50 prod app/postgres.3092:  [DATABASE] duration: 10591.841 ms  statement: COPY public.articles (id, wp_id, slug, author_id, title, title_plain, content, content_excerpt, category_title, category_slug, thumbnail, published_at, meta_desc, meta_keywords, related_ids, created_at, updated_at, featured_image_id, vertical_layout, is_private_to_community, external_url, picture_file_name, picture_content_type, picture_file_size, picture_updated_at, publish_to_community_mentorship, meta_title, view_count) TO stdout;

At 8:00, a routine database migration was run and connected to Postgres as process 4598. The intention was to add a NOT NULL constraint and default value of 0.
Sep 22 08:00:15 prod app/postgres.4598:  [DATABASE] statement: ALTER TABLE "articles" ALTER COLUMN "view_count" TYPE integer 

It turns out that ALTER TABLE statements requires an Access Exclusive lock. "This mode guarantees that the holder is the only transaction accessing the table in any way."
But PGBackups (process 3092) was holding a lock on the articles table.
Sep 22 08:00:16 prod app/postgres.4598:  [DATABASE] process 4598 still waiting for AccessExclusiveLock on relation 16512 of database 16417 after 1000.114 ms Detail: Process holding the lock: 3092. Wait queue: 4598. Query: ALTER TABLE "articles" ALTER COLUMN "view_count" TYPE integer 

As HTTP requests continuing flowing into the web app to view articles, SELECT queries requesting an Access Share lock on the articles table got queued in behind the request for an Access Exclusive lock. Access Share conflicts only with Access Exclusive.
Sep 22 08:00:19 prod app/postgres.2650:  [DATABASE] process 2650 still waiting for AccessShareLock on relation 16512 of database 16417 after 1000.123 ms Detail: Process holding the lock: 3092. Wait queue: 4598, 2650, 2602. Query: SELECT  "articles".* FROM "articles" WHERE "articles"."slug" = 'things-you-need-for-an-apartment'  ORDER BY "articles"."id" ASC LIMIT 1

The site started timing out requests and returning 500 errors.
At 8:04 I stopped the migration process, and rolled back the deploy.
heroku ps:stop run.3640 -a prod
heroku rollback -a prod

By 8:06, 450 lock requests were in the queue. After that point, no more connections were available.
Sep 22 08:06:01 prod app/postgres.5407:  [DATABASE] remaining connection slots are reserved for non-replication superuser connections

All requests to the site started erroring out.
Connections were being left open despite the connecting processes being killed. I tried scaling workers down to zero and putting the app in and out of maintenance mode, but this didn't help. I attempted to run heroku pg:psql in order to kill the ALTER TABLE statement, but was unable to connect because I was not a superuser. All pg:psql commands were inoperable. It's not possible to restart Heroku Postgres. We did not have a follower database to fail over to. (I had not added one because the new Heroku Postgres plans come with automatic failover.) Restoring from a backup wouldn't work because nothing could connect to Postgres.
After much research, googling, discussion with Lukasz, and general panicking, I checked around the Heroku Postgres panel one more time. I noticed that a PGBackups job was running. I suspected this might be the cause, but there was no way to cancel it from the panel.
At 8:25, I issued the cancel command manually:
heroku pg:backups cancel -a prod

Connections start coming alive again. The one remaining web dyno was borked and had to be restarted. I scaled the infrastructure back up.
Why was PGBackups holding a lock?

Lukasz believes that PGBackups is using the pg_dump utility with the -j option to perform the dump in parallel worker processes. According to their documentation:
"Requesting exclusive locks on database objects while running a parallel dump could cause the dump to fail. The reason is that the pg_dump master process requests shared locks on the objects that the worker processes are going to dump later in order to make sure that nobody deletes them and makes them go away while the dump is running. If another client then requests an exclusive lock on a table, that lock will not be granted but will be queued waiting for the shared lock of the master process to be released. Consequently any other access to the table will not be granted either and will queue after the exclusive lock request."
Why did the Access Share locks wait in a queue behind Access Exclusive locks?

Access Exclusive locks conflict with all other locks, which means the Access Exclusive lock must be released before any other lock can be granted. Access Share locks conflict only with Access Exclusive locks. Postgres queues locks depending on how they conflict, even if previous locks haven't been granted yet.
Lessons Learned / Action Items


Don't let PGBackups run during a database migration.
Inspect the Heroku Postgres logs during migrations. Papertrail works well.
Look at the Papertrail logs more carefully during the incident, instead of after.
Have a pg:psql connection already open in advance in case intervention is needed.
Don't perform migrations during sensitive time windows. (In this case, an important event was scheduled to start at 9:00 sharp!)
Schedule PGBackups to run off-hours. (Done - heroku pg:backups schedule DATABASE_URL --at "04:00 America/New_York" -a prod)
A follower database would have helped us fail over more quickly. (Added)
An Amazon RDS instance would give us more control. For example, we could have restarted, or logged in as superuser to kill the original query.