Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

Timeline

Overnight on Thursday (May 2, 2019) to Friday, mastodon.technology experienced degraded service, including but not limited to:

  • Media uploads failing.
  • Emoji picker intermittently not working.
  • Some API endpoints were returnin 500 errors.

Friday morning, around 9:30am EDT, I saw several messages from users alerting me to the problem and began investigating. By 9:49am, I had diagnosed the problem (the server disk was ful) and restored normal service, announcing the outage on the @announcements account.

Cause & Resolution

DigitalOcean's dashboard showed disk use at 100%. Disk use had been constant until the previous evening when it started increasing linearly. This was related to an ssh terminal I'd left open overnight that was tailing Sidekiq logs (to investigate an unrelated issue that a user had reported) with something like the following command:

docker-compose logs -f sidekiq | grep -i "..."

Something about that had accumulated disk use, though I'm not sure exactly what. I ssh'd into the server and found that the command line was sluggish and couldn't perform many actions (because the disk was full). I terminated the ssh connection remotely and found an old database backup to delete, which allowed me to shut down (and restart) the Docker containers. At that point, disk use dropped back to its previous levels.

ElasticSearch enters a read- and delete-only mode when its disk gets full and needs to be manually switched back into normal operations. I've documented the subsequent steps I took here.

For Next time

The hourly cron task to check disk use, and alert me if it exceeds a certain level, did not work. I'd recently changed the structure of the server's disks, but made a mistake updating the regex used in this script.

I learned about how long-running log-tailing can bloat disk use (approximately 30GB overnight, see attached screenshot – disk size is 100GB) though I don't know the underlying cause. ncdu was behaving very slowly and I wanted to fix the problem instead of satisfying my curiousity (but let me know in the gist comments if you have ideas).

I apologize for the downtime and appreciate your patience. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.