(Everything in Pacific time.)
From 5:51pm on May 18, 2019 to 11:28am on May 22 (89.5 hours), we returned 502 to all dice.camp clients.
- Docker logs weren't getting rotated.
- These logs eventually expanded to fill all available disk space, causing the initial failure.
- We bounced the server (routine procedure when we see a dice.camp failure), but with the disk full, postgres couldn't create its lock file, and failed to start.
- Without postgres, the local mastodon API (which it uses to serve most traffic) died in the middle of any request.
Things that went well
- Automated monitoring notified us of the outage.
- SSH access and general *nix knowledge made the actual fix relatively simple.
Things that went poorly
- Outage started on the weekend, when we try to disconnect. Debugging didn't start in earnest 'til Monday.
- Docker admin knowledge is lacking.
Where we got lucky
- We have great community members willing to help debug a random server.
- Set up logrotate for Docker logs.
- Double-check that it's actually rotating.
- Set up monitoring for disk use (this will hit us with db use eventually).
|Sat 2019-05-18 17:51||Uptime Robot notified us on Slack of the outage.|
|Mon 2019-05-20 06:55||Rebooted the machine, but it didn't recover.|
|Mon 2019-05-20 07:34||Restored from backup, still didn't recover.|
|Mon 2019-05-20 09:28||Investigation started, blocked by unfamiliarity with Docker.|
|Tue 2019-05-21 06:28||Community member brought in for background knowledge.|
|Wed 2019-05-22 10:59||"Or can you attach to the running Docker container and read what's in
|Wed 2019-05-22 11:13||Root cause (disk full) identified.|
|Wed 2019-05-22 11:28||Docker logs cleared, logrotate set up for future maintenance, outage ends.|
- Docker logs live at
docker psto learn what jobs are running.
docker logsitself will stream everything, better is:
docker logs --since 30m --follow mastodon_db_1
- Configuration for logs rotation is (now) at