ashfurrow/May 12 2018 Outage Retrospective.md

## May 12 2018 Outage Retrospective.md

      
    Raw
  

              May 12 2018 Outage Retrospective.md
            
          
    May 12 2018 Intermittent Outage Retrospective

On May 12, for about five hours in the afternoon EDT, mastodon.technology experienced intermittent outages, lasting from twenty seconds to five minutes. The outages were caused in the course of an investigation into a failure to backup the database. The database is fine, and turned out that the backup method itself wasn't working properly.
Timeline

March 10, 2018: I upgrade the instance to Mastodon 2.3.0, from 2.1.x. After the upgrade, I notice Sidekiq keeps segfaulting and dropping jobs. I investigate and open an issue, which is later resolved. However, in the process of investigation, I upgrade my host's Docker version.
May 12, 2018, Around Noon EDT: I notice that local backups are not nearly as large as they should be. The offsite backups on S3 indicate the problem started on the March 11. I reproduce the failure to run pg_dump, which hangs every time. It gets stuck around the same table each time, but it varies slightly. After a thorough investigation, I determine that the database is not corrupt, and that the invocation of pg_dump itself was failing. The investigation stopped around 6:00pm.
I tried a few different options for pg_dump but I couldn't get anything to work... until I used -f. I had been letting the output from docker exec ... pg_dump go to stdout, which I redirect to a file on the host machine. But it was stalling, I believe due to a change in how stdout in docker exec commands works. I'm not sure, I'm still investigating that part.
Immediate Remedy

I changed the backup script to put the backup on the container's filesystem, then made a manual backup.
# Previously worked, but stopped working.
docker exec -t mastodon_db_1 pg_dump > $filename

# New method that works.
docker exec -t mastodon_db_1 pg_dump -f /dump.bak
docker cp mastodon_db_1:/dump.bak ./$filename
(Note: this is a simplified version of the command that's actually run.)
I have verified that the daily cron task to backup the database and move it to offsite storage is working as expected.
Next Steps

Something about the Docker upgrade changed how stdout redirects from docker exec calls work. I'm not sure what yet, I'm still investigating that part. Hopefully there will be an existing bug report or discussion somewhere I can contribute to.
This incident highlights an important lesson: backups, including automated backups, are not a "set it and forget it" part of administering a Mastodon instance. I'm scheduling recurring reminders to remind me to check on the validity of the backups. I should have discovered the problem sooner. If something had happened to the production database, I'd have no recent backups to restore from. I'm reminded of the GitLab outage where precisely this happened, and this incident only further highlights the importance of checking backup validity.
Additionally, I learned that I need to provide upfront notices about maintenance windows. While I'm glad no one seemed too annoyed by the outages, the fact is that "the database backups aren't working" can be investigated and fixed without rushing.