(Everything in Pacific time.)
From 5:51pm on May 18, 2019 to 11:28am on May 22 (89.5 hours), we returned 502 to all dice.camp clients.
- Docker logs weren't getting rotated.
- These logs eventually expanded to fill all available disk space, causing the initial failure.
- We bounced the server (routine procedure when we see a dice.camp failure), but with the disk full, postgres couldn't create its lock file, and failed to start.
- Without postgres, the local mastodon API (which it uses to serve most traffic) died in the middle of any request.
- Automated monitoring notified us of the outage.
- SSH access and general *nix knowledge made the actual fix relatively simple.
- Outage started on the weekend, when we try to disconnect. Debugging didn't start in earnest 'til Monday.
- Docker admin knowledge is lacking.
- We have great community members willing to help debug a random server.
- Set up logrotate for Docker logs.
- Double-check that it's actually rotating.
- Set up monitoring for disk use (this will hit us with db use eventually).
Time | Event |
---|---|
Sat 2019-05-18 17:51 | Uptime Robot notified us on Slack of the outage. |
Mon 2019-05-20 06:55 | Rebooted the machine, but it didn't recover. |
Mon 2019-05-20 07:34 | Restored from backup, still didn't recover. |
Mon 2019-05-20 09:28 | Investigation started, blocked by unfamiliarity with Docker. |
Tue 2019-05-21 06:28 | Community member brought in for background knowledge. |
Wed 2019-05-22 10:59 | "Or can you attach to the running Docker container and read what's in /var/log ?" |
Wed 2019-05-22 11:13 | Root cause (disk full) identified. |
Wed 2019-05-22 11:28 | Docker logs cleared, logrotate set up for future maintenance, outage ends. |
- Docker logs live at
/var/lib/docker/containers/*/*-json.log
docker ps
to learn what jobs are running.docker logs
itself will stream everything, better is:docker logs --since 30m --follow mastodon_db_1
- Configuration for logs rotation is (now) at
/etc/logrotate.d/docker