Skip to content

Instantly share code, notes, and snippets.

@wlonkly
Last active July 8, 2021 15:13
  • Star 11 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save wlonkly/5ba3f643d1677f4bd44e7563403dbbd6 to your computer and use it in GitHub Desktop.
Steps I took to troubleshoot a full disk

I wrote this down after I responded to a page today (a holiday) because it would've been a decent pairing opportunity for a couple of new people on my team. Second best is that people can read what I did afterwards and ask me any questions. And then I realized that there's nothing PagerDuty-specific or confidential in here, so I may as well share it wider. It's hardly an epic incident, but it's a good example of "doing the work", I think. I borrowed the "write down what you learned" approach from Julia "b0rk" Evans. It's a fantastic practice.

The PagerDuty incident: "Disk will be full in 12 hours. device:/dev/nvme0n1p1, host:stg-nomadusw2-client-..."

(Note for non-PD readers: We run Nomad where others might run Kubernetes.)

Here's the process I went through.

  • Noticed that the usual docker system prune -a -f didn't resolve it
  • Tried docker system prune -a -f and it cleared up 0B

Learned: It's not stale docker image layers.

  • Looked at du -sh | grep dev and saw /mnt was 77% full. (We bind-mount various filesystems under /mnt. I don't love it.)
  • Figured it's probably /var from previous experience
  • Did cd /var; du -sh * | sort -h (both "h" mean "human format" eg "1.02GB"):
0   lock
0   run
4.0K    crash
4.0K    local
4.0K    opt
16K tmp
20K vault
36K snap
244K    mail
264K    consul
796K    spool
1.9M    backups
75M cache
84M awslogs
318M    chef
4.9G    log
128G    lib

Learned: It's /var/lib

  • Did du -sh * | sort -h in /var/lib and so forth, until I narrowed it down to /var/lib/docker/overlay2

Learned: It's docker overlay2 layers, but previously learned it's not stale image layers.

  • Is it active images? No, docker images shows nothing near 70+ GB.

Learned: It's not a docker image.

  • Kept descending into the filesystem with du -sh | sort -h.
  • Wished I had ncdu which makes this much easier.
  • apt install ncdu because hey why not? We can throw this host away after, I'm responding to a (minor) incident and I have freedom to install diagnostic tools by hand.
  • Tracked down to: /var/lib/docker/overlay2/face4015.../merged/opt/kafka_2.13-2.8.0/logs

Learned: It's some Kafka-related logs in the layer face4015...

Problem: How do I find out what container owns that layer?

  • Stack Overflow had nothing going in that direction, only container-to-layer.
  • Idea: Explore the rest of the filesystem under overlay2/face4015.../merged.
  • Discovered /run.sh under that directory
  • Ran docker ps | grep run.sh, which output (roughly):
454f4d73bc17 nomadic-mirrormaker-datadog:5b17 "/run.sh"  27 hours ago  Up 27 hours (healthy)   8125/udp, 8126/tcp                                                                                                                                                                                   datadog-cec12ae6-9fd9-bfed-7e39-2aeacc448b81
c90e6050ec7c nomadic-mirrormaker:5b17         "/run.sh"  27 hours ago  Up 27 hours             26946->1099/tcp, 26946->1099/udp   

Learned: It's one of those two "nomadic-mirrormaker" containers, whatever that is.

  • Guess: It's probably not the datadog sidecar.
  • Noted for later: Why is there a datadog sidecar? Containers can reach the datadog agent on the host.
  • Looked up nomadic-mirrormaker in the Nomad UI
  • Noticed that the "owner" tag on the job is "dbre"

Learned: It's owned by the DBRE team.

Resolution: Left a note in that team's slack channel, and will reschedule the job if things fill up, which will force it to restart with a brand-new container and thus no logs yet. Also added a note about using datadog sidecars.

Update: I have just learned about docker ps --size, about 30 minutes too late! But I know it for next time now.

-- By the way, docker system prune -a -f is pronounced "docker system prune as fuck". You're welcome.

@rugwirobaker
Copy link

A very nice writeup, thank you. I would gobble up a writeup about you team/company' s experience running nomad(there are not a lot of those in the public domain).

@timurakhmadeev
Copy link

timurakhmadeev commented Jul 2, 2021

You can do du -m /var | sort -n | tail -30 instead of manually decsending.

@catherio
Copy link

catherio commented Jul 8, 2021

Following along for learning purposes! If I do du -sh then I only see just one entry for the whole directory, and the man page says -s is equivalent to -d 0 so it wouldn't show you sub-entries. Is that a typo RE what command you'd run to give you something worth piping into grep or sort? Thanks in advance!

@wlonkly
Copy link
Author

wlonkly commented Jul 8, 2021

@catherio Oh, I see what I mistranscribed -- what I actually did to debug was cd /var; du -sh * and repeat. I didn't even know du had a -d option, -sh * is just hardcoded into my brain at this point! du -h -d 1 /var etc. would give similar results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment