Skip to content

Instantly share code, notes, and snippets.

@blinks
Last active February 28, 2020 15:48
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save blinks/dabac9fb850ad00941f9a50b882a0e38 to your computer and use it in GitHub Desktop.
Save blinks/dabac9fb850ad00941f9a50b882a0e38 to your computer and use it in GitHub Desktop.
dice.camp postmortem, May 2019

dice.camp postmortem

(Everything in Pacific time.)

Impact

From 5:51pm on May 18, 2019 to 11:28am on May 22 (89.5 hours), we returned 502 to all dice.camp clients.

Root Cause

  • Docker logs weren't getting rotated.
  • These logs eventually expanded to fill all available disk space, causing the initial failure.
  • We bounced the server (routine procedure when we see a dice.camp failure), but with the disk full, postgres couldn't create its lock file, and failed to start.
  • Without postgres, the local mastodon API (which it uses to serve most traffic) died in the middle of any request.

Lessons Learned

Things that went well

  • Automated monitoring notified us of the outage.
  • SSH access and general *nix knowledge made the actual fix relatively simple.

Things that went poorly

  • Outage started on the weekend, when we try to disconnect. Debugging didn't start in earnest 'til Monday.
  • Docker admin knowledge is lacking.

Where we got lucky

  • We have great community members willing to help debug a random server.

Action Items

  • Set up logrotate for Docker logs.
  • Double-check that it's actually rotating.
  • Set up monitoring for disk use (this will hit us with db use eventually).

Timeline

Time Event
Sat 2019-05-18 17:51 Uptime Robot notified us on Slack of the outage.
Mon 2019-05-20 06:55 Rebooted the machine, but it didn't recover.
Mon 2019-05-20 07:34 Restored from backup, still didn't recover.
Mon 2019-05-20 09:28 Investigation started, blocked by unfamiliarity with Docker.
Tue 2019-05-21 06:28 Community member brought in for background knowledge.
Wed 2019-05-22 10:59 "Or can you attach to the running Docker container and read what's in /var/log?"
Wed 2019-05-22 11:13 Root cause (disk full) identified.
Wed 2019-05-22 11:28 Docker logs cleared, logrotate set up for future maintenance, outage ends.

Supporting Materials

  • Docker logs live at /var/lib/docker/containers/*/*-json.log
  • docker ps to learn what jobs are running.
  • docker logs itself will stream everything, better is: docker logs --since 30m --follow mastodon_db_1
  • Configuration for logs rotation is (now) at /etc/logrotate.d/docker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment