mharsch/gist:fc04ef0d236ebb66965e1a0d0232ece9

## gistfile1.txt
If you have a failure mode that can be diagnosed by a high rate of messages flooding the system log,
you can use the script below in conjunction with the watchdog service to detect the condition
and reboot if it persists.  This is a pretty big hammer, but potentially better than just rebooting on
loss of network or some other arbitrary liveness test.

add a 'test-binary' line to /etc/watchdog.conf pointing to this script (will be treated as V0 test script
with no corresponding repair script).  Also, set 'interval' to something like 20 seconds (at least greater than
the sample period in the script).

You can convince yourself it's working (once the watchdog service has been restarted with the above config)
by monitoring /var/log/watchdog/test-bin.stdout as well as the watchdog.service log while manuallying flooding
the log yourself:

while true;do
logger 'log storm'
done


logflood.sh:

#!/usr/bin/bash

readonly EUSERVALUE=246

SAMPLE_SEC=3
NUM_MSGS=`(timeout $SAMPLE_SEC tail -f -n 1 /var/log/syslog || true) | wc -l`
MEASURED_RATE=$((NUM_MSGS / SAMPLE_SEC))
THRESHOLD_RATE=200

if [[ $MEASURED_RATE -gt $THRESHOLD_RATE ]]; then
    echo "log is getting flooded at a rate of $MEASURED_RATE messages per second"
    exit $EUSERVALUE
fi
#echo "log looks fine; carry on"
exit 0
	If you have a failure mode that can be diagnosed by a high rate of messages flooding the system log,
	you can use the script below in conjunction with the watchdog service to detect the condition
	and reboot if it persists. This is a pretty big hammer, but potentially better than just rebooting on
	loss of network or some other arbitrary liveness test.

	add a 'test-binary' line to /etc/watchdog.conf pointing to this script (will be treated as V0 test script
	with no corresponding repair script). Also, set 'interval' to something like 20 seconds (at least greater than
	the sample period in the script).

	You can convince yourself it's working (once the watchdog service has been restarted with the above config)
	by monitoring /var/log/watchdog/test-bin.stdout as well as the watchdog.service log while manuallying flooding
	the log yourself:

	while true;do
	logger 'log storm'
	done


	logflood.sh:

	#!/usr/bin/bash

	readonly EUSERVALUE=246

	SAMPLE_SEC=3
	NUM_MSGS=`(timeout $SAMPLE_SEC tail -f -n 1 /var/log/syslog \|\| true) \| wc -l`
	MEASURED_RATE=$((NUM_MSGS / SAMPLE_SEC))
	THRESHOLD_RATE=200

	if [[ $MEASURED_RATE -gt $THRESHOLD_RATE ]]; then
	echo "log is getting flooded at a rate of $MEASURED_RATE messages per second"
	exit $EUSERVALUE
	fi
	#echo "log looks fine; carry on"
	exit 0