Skip to content

Instantly share code, notes, and snippets.

@mharsch
Last active May 13, 2022 18:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mharsch/fc04ef0d236ebb66965e1a0d0232ece9 to your computer and use it in GitHub Desktop.
Save mharsch/fc04ef0d236ebb66965e1a0d0232ece9 to your computer and use it in GitHub Desktop.
user watchdog script for flooding system log errors
If you have a failure mode that can be diagnosed by a high rate of messages flooding the system log,
you can use the script below in conjunction with the watchdog service to detect the condition
and reboot if it persists. This is a pretty big hammer, but potentially better than just rebooting on
loss of network or some other arbitrary liveness test.
add a 'test-binary' line to /etc/watchdog.conf pointing to this script (will be treated as V0 test script
with no corresponding repair script). Also, set 'interval' to something like 20 seconds (at least greater than
the sample period in the script).
You can convince yourself it's working (once the watchdog service has been restarted with the above config)
by monitoring /var/log/watchdog/test-bin.stdout as well as the watchdog.service log while manuallying flooding
the log yourself:
while true;do
logger 'log storm'
done
logflood.sh:
#!/usr/bin/bash
readonly EUSERVALUE=246
SAMPLE_SEC=3
NUM_MSGS=`(timeout $SAMPLE_SEC tail -f -n 1 /var/log/syslog || true) | wc -l`
MEASURED_RATE=$((NUM_MSGS / SAMPLE_SEC))
THRESHOLD_RATE=200
if [[ $MEASURED_RATE -gt $THRESHOLD_RATE ]]; then
echo "log is getting flooded at a rate of $MEASURED_RATE messages per second"
exit $EUSERVALUE
fi
#echo "log looks fine; carry on"
exit 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment