Skip to content

Instantly share code, notes, and snippets.

@ogarrett
Last active February 4, 2021 11:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save ogarrett/10e7fd734222c843fce0aea2dde25692 to your computer and use it in GitHub Desktop.
Save ogarrett/10e7fd734222c843fce0aea2dde25692 to your computer and use it in GitHub Desktop.
NGINX Plus Health Check - mark node as unhealthy if bandwidth utilisation exceeds threshhold

Overview

Requirement is for NGINX Plus to back off and stop sending new connections to an upstream node if the network utilization for that node exceeds a given threshhold.

Strategy

Create a simple HTTP-accessible script that runs on each upstream node. Script returns 200 OK (HTTP status code) if node is not overloaded, and 503 Too Busy if node is overloaded.

Use the script as the target for an NGINX Plus health check

Implementation

Running scripts from NGINX alone is not possible, as NGINX does not provide CGI or a similar application platform. We don't want the complexity of installing php, python or any other app platform on our upstream servers, so we'll use a simple HTTP responder (loadtest.sh) written in bash and running from systemd. You can of course adapt/port the loadtest.sh script to php, python etc, according to what can be run on the upstream server.

The script runs off port 8099 (for example), and returns status accordingly:

curl -D - http://dev0:8099/

HTTP/1.0 200 OK
Content-Type: text/plain
Connection: close

HTTP Status: 200 OK

Transfer counter 13088764457 to 13088764457 bytes
Timer 1610462202357 to 1610462204365 ms

Bytes transferred = 0 bytes over time 2008 milliseconds

Bandwidth = 0 Mbits

"Failure" output, when current bandwidth exceeds limit:

curl -D - http://dev0:8099/

HTTP/1.0 503 Too Busy
Content-Type: text/plain
Connection: close

HTTP Status: 503 Too Busy

Transfer counter 13558896787 to 13814984187 bytes
Timer 1610462222625 to 1610462224639 ms

Bytes transferred = 256087400 bytes over time 2014 milliseconds

Bandwidth = 970 Mbits

Steps

Put loadtest.sh somewhere appropriate, such as /usr/local/bin. Make executable.

Edit loadtest.sh to define the correct network interface to monitor, and to define bandwidth threshold.

Test loadtest.sh by writing an HTTP request to STDIN:

printf "GET /\r\n\r\n" | /usr/local/bin/loadtest.sh

HTTP/1.0 200 OK
Content-Type: text/plain
Connection: close

HTTP Status: 200 OK

Transfer counter 14170650531 to 14170650531 bytes
Timer 1610462694117 to 1610462696125 ms

Bytes transferred = 0 bytes over time 2008 milliseconds

Bandwidth = 0 Mbits

Configure systemd to run this script in response to a connection to port 8099.

File /etc/systemd/system/loadtest.socket:

[Unit]
Description=HTTP service for load testing health check

[Socket]
ListenStream=8099
Accept=yes

[Install]
WantedBy=sockets.target

File /etc/systemd/system/loadtest@.service:

[Unit]
Description=Load Test HTTP health check script

[Service]
ExecStart=-/usr/local/bin/loadtest.sh
StandardInput=socket
User=nginx
Group=nginx

Start the new service and test with web client:

systemctl start loadtest.socket
systemctl status loadtest.socket

● loadtest.socket - HTTP service for load testing health check
Loaded: loaded (/etc/systemd/system/loadtest.socket; enabled; vendor preset: enabled)
Active: active (listening) since Tue 2021-01-12 14:50:42 UTC; 4s ago
Listen: [::]:8099 (Stream)
Accepted: 1; Connected: 0;
Tasks: 0 (limit: 4620)
Memory: 52.0K
CGroup: /system.slice/loadtest.socket

Jan 12 14:50:42 dev0 systemd[1]: Listening on HTTP service for load testing health check.
curl -D - localhost:8099

Check /var/log/syslog for errors; for example, need to ensure that user:group nginx:nginx can access and run the script.

Testing

If you need to simulate high traffic, one approach is to scp a large file to /dev/null:

dd if=/dev/zero of=/tmp/1G bs=1M count=1024
scp /tmp/1G user@localhost:/dev/null

In this case, ensure that you monitor the localhost interface, using IF=lo in loadtest.sh

Use the NGINX Plus dashboard to view the real-time status of the health checks you've configured.

#!/bin/bash
# first line is HTTP request, then HTTP headers
read request
while read header ; do
[ "$header" == $'\r' ] && break
done
IF=enp0s3 # interface to monitor in /proc/net/dev
BW=500 # threshold bandwidth, in Mbits. If write bandwidth on $IF exceeds this value, return 503
# IF=lo # uncomment if we want to monitor localhost for testing
# Take a look at /proc/net/dev to see how this works...
# Get bytes written (field 10) and current time (milliseconds), wait, then get sample again
B1=$( cat /proc/net/dev | grep $IF: | awk '{print $10}' )
T1=$(( $(date +%s%N) / 1000000 ))
sleep 2 # wait 2 seconds; sleep is not always accurate
B2=$( cat /proc/net/dev | grep $IF: | awk '{print $10}' )
T2=$(( $(date +%s%N) / 1000000 ))
BYTES_T=$(( $B2 - $B1 ))
TIME_MS=$(( $T2 - $T1 ))
BW_MBITS=$(( ( $BYTES_T * 1000 * 8 ) / ( $TIME_MS * 1024 * 1024 ) )) # note Integer arithemetic
STATUS="200 OK"
[[ $BW_MBITS -gt $BW ]] && STATUS="503 Too Busy"
printf "HTTP/1.0 $STATUS\r\n"
printf "Content-Type: text/plain\r\n"
printf "Connection: close\r\n"
printf "\r\n"
cat << EOM
HTTP Status: $STATUS
Transfer counter $B1 to $B2 bytes
Timer $T1 to $T2 ms
Bytes transferred = $BYTES_T bytes over time $TIME_MS milliseconds
Bandwidth = $BW_MBITS Mbits
EOM
# primary virtual server, listening on port 80 and load-balancing to the upstreams group
server {
listen 80;
location / {
proxy_pass http://upstreams;
status_zone status_page;
# We'll probe HC script on :8099
health_check port=8099;
}
# expose NGINX Plus API and dashboard (be aware of security implications)
location /api {
api;
}
location = /dashboard.html {
root /usr/share/nginx/html;
}
}
upstream upstreams {
zone backend 64k;
server dev0:8080; # test server, we just forward to :8080
}
# test server on :8080
server {
listen 8080;
location / {
root /usr/share/nginx/html;
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment