Skip to content

Instantly share code, notes, and snippets.

@codl
Last active August 11, 2018 06:33
Show Gist options
  • Save codl/e6f4184764c1e99d902ece0604f5c9c9 to your computer and use it in GitHub Desktop.
Save codl/e6f4184764c1e99d902ece0604f5c9c9 to your computer and use it in GitHub Desktop.

On 2018-08-11, from 01:36 to 04:13 UTC, chitter.xyz suffered an outage.

Timeline

all times UTC

  • 01:36 - the network goes down for an indeterminate length of time
  • 01:XX - ImageMagick convert processes start hoarding memory
  • 01:XX - system goes under memory pressure
  • 01:51 - mastodon app server serves its last request
  • 01:54 - since it sets its own OOM score very high, only netdata gets killed, over and over. it gets restarted 10 minutes later every time and killed again soon after
  • 02:33 - OOM-killer finally kills a convert process
  • 02:33 - journald crashes from SIGABRT, restarts
  • 02:35 - systemd restarts mastodon app server
  • 03:5X - codl wakes up and finds caddy is not accepting connections. nothing notable in caddy's logs
  • 04:13 - caddy is restarted. everything comes back up immediately

Notes

It's not clear whether or not the network outage caused convert to go wild but it seems likely that it did

Restarting netdata over and over is absurd. What's more, netdata proved invaluable in investigating this after the fact, although one hour of history was not nearly enough. Will look into making it not raise its OOM score, and increasing how much history it keeps around. It's ok if some of it gets swapped out.

It seems caddy broke under memory pressure, possibly at 01:51. It stopped listening but didn't crash, so it was not restarted. Will look into systemd's watchdog facilities

Not sure why journald crashed but it didn't lose any logs in the process and that's very impressive. it even logged itself crashing :o

Glossary

  • ImageMagick: image processing library. mastodon uses its convert tool to scale images down and make thumbnails
  • Mastodon app server: the bit of mastodon that replies to users' requests, as opposed to the bits that run in the background
  • OOM Killer: the bit of linux that, when the system is completely out of free memory, picks a process to kill to hopefully get back to a working system
  • OOM Score: a score given to every process based on a dozen metrics like how much memory it and its children are using, how long it has been running, which user is running it, etc. when the OOM Killer is ran, the process with the highest OOM Score is killed. the OOM score of a process can be manually adjusted up or down for things that are less or more critical
  • Netdata: real-time monitoring software. it is very good and very thorough but it does use a lot of memory
  • systemd: supervisor software. it launches and monitors services and does a heap of useful things, and less useful things. people love to hate it
  • journald: systemd component that keeps track of logs generated by services, as well as system logs
  • codl: that's me!
  • caddy: https server. in our setup, it proxies requests to the mastodon app server
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment