ashfurrow/August 18, 2018 downtime postmortem.md

## August 18, 2018 downtime postmortem.md

      
    Raw
  

              August 18, 2018 downtime postmortem.md
            
          
    This is an attempt to document the downtime that occurred on mastodon.technology on August 18, 2018 from roughly 10am to 3pm, Eastern Daylight Time. It's not hyper-accurate. Please contact @ashfurrow for any clarifications.
Ash sincerely apologizes for the downtime.
Timeline


August 17, 02:44 UTC (10:44PM EDT): the @announcements account posts notice of maintenance window for the following day. The maintenance window was described as two hours long, beginning at 14:00 UTC the next day.
August 18, beginning at roughly 14:10UTC, mastodon.technology experienced several periods of extended downtime. Site monitoring results are attached in a text file, as well as a screenshot of the site response time graph.
August 18, roughly 15:45 UTC, mastodon.technology came back online but with severe latency issues. See attached graph screenshot of site latency times. User experience was severely degraded.
August 18, roughly 19:88 UTC, mastodon.technology came back online with previously, normal performance. The @announcements account posted an explanation.

Cause

Memory problems with the mastodon.technology Rails application server Docker image have required recent reboots of that container. In an attempt to mitigate this, Ash decided to increase the RAM of the DigitalOcean droplet that mastodon.technology runs on.
Updates to the VM's operating system were done as well, just prior to the droplet resize. By accident, these updates were interrupted, which left the site inaccessibile. Ash recovered the droplet from a backup from August 16, 2018; the database, Redis store, and ElasticSearch data are all stored on separate drives from the droplet, and were not affected. There was no data loss. After roughly an hour, of downtime, the site came back online but with the severely degraded performance.
Performance issues were caused by a failure to use the correct precompiled web assets. Without these assets, requests were taking a very long time. Attempts to precompile the assets again were unsuccessful, with the Rake task stalling and CPU use skyrocketing for the node process. Attempts to precompile assets on another machine and move them over were also unsuccessful. Attempts to rebuild the Docker images that run the site from scratch were also unsuccessful.
After taking a walk, Ash resized the droplet back to its original parameters. He precompiled the assets successfully and restarted the web server container. Response times returned to normal.
It is unclear if the cause of the failed asset precompile is:

the resize and subsequent backup restore (the backup image was made on the old size), or
the resize itself.

Although, the former seems more likely.
What Next?

The original memory issues are still present, so we need to resize the droplet again eventually. An announcement on the @announcements account will be given at least 48 hours in advance, with a description of possible downtime up to an hour (the length of time it takes to resize a droplet twice, and test in between).
Next Time

A few lessons were learned:

Droplet resizing should be done in isolation of any update to the software running on the Droplet.
Updates to the VM software need to be done in isolation of any other changes to the VM's environment.
Extreme care must be taken whenever applying updates to the software on the VM; it is critical that they are not interrupted.
Droplet resizing is not necessarily as easy as DigitalOcean describes in its marketing materials.


## Site monitoring status results in UTC.txt
Status From Until Duration

Up 2018-08-18 19:09:00 2018-08-18 19:12:05 Ongoing (3 minutes)
Down 2018-08-18 18:53:24 2018-08-18 19:09:00 15 Minutes
Up 2018-08-18 17:28:36 2018-08-18 18:53:24 1 Hours 24 Minutes
Down 2018-08-18 16:20:44 2018-08-18 17:28:36 1 Hours 7 Minutes
Up 2018-08-18 15:44:59 2018-08-18 16:20:44 35 Minutes
Down 2018-08-18 15:14:44 2018-08-18 15:44:59 30 Minutes
Up 2018-08-18 15:09:26 2018-08-18 15:14:44 5 Minutes
Down 2018-08-18 14:10:18 2018-08-18 15:09:26 59 Minutes

## Site response time graph screenshot.png

      
    Raw
  

              Site response time graph screenshot.png
	Status From Until Duration

	Up 2018-08-18 19:09:00 2018-08-18 19:12:05 Ongoing (3 minutes)
	Down 2018-08-18 18:53:24 2018-08-18 19:09:00 15 Minutes
	Up 2018-08-18 17:28:36 2018-08-18 18:53:24 1 Hours 24 Minutes
	Down 2018-08-18 16:20:44 2018-08-18 17:28:36 1 Hours 7 Minutes
	Up 2018-08-18 15:44:59 2018-08-18 16:20:44 35 Minutes
	Down 2018-08-18 15:14:44 2018-08-18 15:44:59 30 Minutes
	Up 2018-08-18 15:09:26 2018-08-18 15:14:44 5 Minutes
	Down 2018-08-18 14:10:18 2018-08-18 15:09:26 59 Minutes