Skip to content

Instantly share code, notes, and snippets.

@mharsch
Last active June 16, 2023 21:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mharsch/ba29222ed33bf9bc428e50306c2761c2 to your computer and use it in GitHub Desktop.
Save mharsch/ba29222ed33bf9bc428e50306c2761c2 to your computer and use it in GitHub Desktop.
setup linux to reboot if the host loses access to hostname (not IP)

Deprecated

I no longer recommend the below method. For my updated recommendation, see this gist

Setup Linux to reboot if it loses access to a certain host by hostname (not IP)

The linux watchdog service can be configured to run certain 'liveness' tests periodically, and then take some action (such as reboot) if a test fails (and doesn't recover) for some period of time.

There is a built-in test for pinging an IP address (i.e. 'ping' directive), but you may instead need to test access to a hostname (e.g. api.importantserver.com) and there is no guarantee that the name will always resolve to the same IP. Here we describe a way to specify a custom script for testing ping connectivity to a hostname, and how to plug this script into the linux watchdog(8) service.

Make sure watchdog service is installed and running

sudo apt install watchdog
systemctl status watchdog

Create default directory for watchdog test scripts

sudo mkdir /etc/watchdog.d

create a new script in the test scripts directory

sudo nano /etc/watchdog.d/pinghost.sh

now paste the following contents into the pinghost.sh file

#!/usr/bin/bash

readonly EUSERVALUE=246
readonly TARGET=io.adafruit.com

ping -c3 -q $TARGET > /dev/null
if [ $? -eq 0 ];
then
    exit 0
else
    echo "failed to ping $TARGET";
    exit $EUSERVALUE
fi

change the value of 'TARGET' to match your desired ping destination. Now make the script executable by root:

sudo chmod u+x /etc/watchdog.d/pinghost.sh

Lastly, edit the watchdog config file and tune the values of 'interval' and 'retry-timeout' if desired. The 'interval' setting is how often (in seconds) to run the test script (suggestion: 120 or higher). The 'timeout' setting is how long (in second) to wait in a test-failing state before rebooting (suggestion: 1800).

sudo nano /etc/watchdog.conf

Once you're happy with the config settings, stop and start the service to activate the changes.

sudo systemctl stop watchdog
sudo systemctl start watchdog

Now the watchdog is running. You can check the status of the watchdog service with this command:

systemctl status watchdog

If you see that the service fails with the following error

This interval length (59) might reboot the system while the process sleeps! Try 59 or less

then you can add '--force' to the command line in the service file (/lib/systemd/system/watchdog.service) like this:

[Unit]
Description=watchdog daemon
Conflicts=wd_keepalive.service
After=multi-user.target
OnFailure=wd_keepalive.service

[Service]
Type=forking
EnvironmentFile=/etc/default/watchdog
ExecStartPre=/bin/sh -c '[ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module'
ExecStart=/bin/sh -c '[ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options --force'
ExecStopPost=/bin/sh -c '[ $run_wd_keepalive != 1 ] || false'

[Install]
WantedBy=default.target

After making the above change, you'll need to run:

sudo systemctl stop watchdog
sudo systemctl daemon-reload
sudo systemctl start watchdog

You can get the whole log history from the watchdog service with:

journalctl -u watchdog

or you can just watch the end of the log as it grows by adding the -f (follow) parameter:

journalctl -u watchdog -f

You should test that the watchdog script is working by disconnecting a network cable (or otherwise disabling network connectivity). In the log file, you should see output similar to this to indicate a watchdog-triggered reboot:

Dec 29 11:17:04 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:17:24 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:17:44 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:18:04 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:18:24 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:18:44 miketinkerpi watchdog[633]: Retry timed-out at 80 seconds for /etc/watchdog.d/pinghost.sh
Dec 29 11:18:44 miketinkerpi watchdog[633]: repair binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:18:44 miketinkerpi watchdog[633]: shutting down the system because of error 246 = 'user-reserved code'
Dec 29 11:18:44 miketinkerpi watchdog[975]: /usr/lib/sendmail does not exist or is not executable (errno = 2)
-- Boot 9ebb8dd9c90b4219a3784acf8e4361e5 --
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment