blinks/2019-05-18-postmortem.markdown Secret

## 2019-05-18-postmortem.markdown

      
    Raw
  

              2019-05-18-postmortem.markdown
            
          
    dice.camp postmortem

(Everything in Pacific time.)
Impact

From 5:51pm on May 18, 2019 to 11:28am on May 22 (89.5 hours), we returned 502 to all dice.camp clients.
Root Cause


Docker logs weren't getting rotated.
These logs eventually expanded to fill all available disk space, causing the initial failure.
We bounced the server (routine procedure when we see a dice.camp failure), but with the disk full, postgres couldn't create its lock file, and failed to start.
Without postgres, the local mastodon API (which it uses to serve most traffic) died in the middle of any request.

Lessons Learned

Things that went well


Automated monitoring notified us of the outage.
SSH access and general *nix knowledge made the actual fix relatively simple.

Things that went poorly


Outage started on the weekend, when we try to disconnect. Debugging didn't start in earnest 'til Monday.
Docker admin knowledge is lacking.

Where we got lucky


We have great community members willing to help debug a random server.

Action Items


 Set up logrotate for Docker logs.
 Double-check that it's actually rotating.
 Set up monitoring for disk use (this will hit us with db use eventually).

Timeline


Time
Event


Sat 2019-05-18 17:51
Uptime Robot notified us on Slack of the outage.


Mon 2019-05-20 06:55
Rebooted the machine, but it didn't recover.


Mon 2019-05-20 07:34
Restored from backup, still didn't recover.


Mon 2019-05-20 09:28
Investigation started, blocked by unfamiliarity with Docker.


Tue 2019-05-21 06:28
Community member brought in for background knowledge.


Wed 2019-05-22 10:59
"Or can you attach to the running Docker container and read what's in /var/log?"


Wed 2019-05-22 11:13
Root cause (disk full) identified.


Wed 2019-05-22 11:28
Docker logs cleared, logrotate set up for future maintenance, outage ends.


Supporting Materials


Docker logs live at /var/lib/docker/containers/*/*-json.log
docker ps to learn what jobs are running.
docker logs itself will stream everything, better is: docker logs --since 30m --follow mastodon_db_1
Configuration for logs rotation is (now) at /etc/logrotate.d/docker
Time	Event
Sat 2019-05-18 17:51	Uptime Robot notified us on Slack of the outage.
Mon 2019-05-20 06:55	Rebooted the machine, but it didn't recover.
Mon 2019-05-20 07:34	Restored from backup, still didn't recover.
Mon 2019-05-20 09:28	Investigation started, blocked by unfamiliarity with Docker.
Tue 2019-05-21 06:28	Community member brought in for background knowledge.
Wed 2019-05-22 10:59	"Or can you attach to the running Docker container and read what's in `/var/log`?"
Wed 2019-05-22 11:13	Root cause (disk full) identified.
Wed 2019-05-22 11:28	Docker logs cleared, logrotate set up for future maintenance, outage ends.