Skip to content

Instantly share code, notes, and snippets.

@Drallas
Last active May 5, 2024 07:56
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Drallas/4b965da52d259f0125f18bca39ffc8a3 to your computer and use it in GitHub Desktop.
Save Drallas/4b965da52d259f0125f18bca39ffc8a3 to your computer and use it in GitHub Desktop.

Docker Swarm Keepalived Configuration

Part of collection: Hyper-converged Homelab with Proxmox

Keepalived is a Loadbalancer to add ‘high availability` to Linux systemen. See the Keepalived documentatie for more background information.

This setup build on High Available Pi-hole failover cluster using Keepalived and Orbital Sync.

Setup Keepalived

This setup is using a virtual ip address: 192.168.1.4, which is the only that is needed to access Application on the Docker Swarm. http://192.168.1.4:<port-number>

All nodes

sudo apt-get install keepalived -y

Docker Manager Nodes

Add the Script and the Master Node configuration to the Docker Manager servers:

# node_active_ready_check.sh
sudo curl https://gist.github.com/Drallas/4b965da52d259f0125f18bca39ffc8a3/raw/1774c4fad8783c02d0803b58ad1e6f250a432533/script-node_active_ready_check.sh -o /etc/scripts/node_active_ready_check.sh
sudo chmod +x /etc/scripts/node_active_ready_check.sh

# keepalived.conf
sudo curl https://gist.github.com/Drallas/4b965da52d259f0125f18bca39ffc8a3/raw/9368ec523fda134b68e66ce857b607c93b3678e7/script-keepalived-master.conf -o /etc/keepalived/keepalived.conf

Docker Worker Nodes

Add the Keepalived configuration to the server:

# keepalived.conf
sudo curl https://gist.github.com/Drallas/4b965da52d259f0125f18bca39ffc8a3/raw/9368ec523fda134b68e66ce857b607c93b3678e7/script-keepalived-slave.conf -o /etc/keepalived/keepalived.conf

On change node's there can't be a script that check docker node ls, hence it only monitors the status of the docker service. Each slave has it's unique priority value and unicast_src_ip & unicast_peer configuration.

Edit nano /etc/keepalived/keepalived.conf on each slave node, and change the priority and unicast blocks:

Docker Manger 1

priority 165

Docker Manger 2

priority 155

Docker Manger 3

priority 155

Docker Worker

priority 145

If the Node with ip 192.168.1.111 fails 192.168.1.112 becomes Master, if that one fails too 192.168.1.113, etc.

Start keepalived

sudo systemctl enable --now keepalived && sudo systemctl status keepalived

DNS

set the dns <appname>.<domain>.<countrycode> A 192.168.1.4

Testing

Stop the Docker Service on the Master Node sudo systemctl stop docker.socket && sudo systemctl status docker.service

See on the High Available Pi-hole failover cluster using Keepalived and Orbital Sync test section more details how to test this.

When done start Docker again sudo systemctl start docker.service and monitor sudo systemctl status keepalived to see the Node assuming the MASTER status again.

! Configuration File for keepalived
global_defs {
vrrp_startup_delay 5
enable_script_security
max_auto_priority
script_user root
}
vrrp_track_process track_docker {
process dockerd
weight 10
}
vrrp_script node_active_ready_check {
script "/etc/scripts/node_active_ready_check.sh"
interval 5
}
vrrp_instance docker_swarm {
state MASTER
interface eth0
virtual_router_id 10
priority 160
advert_int 1
authentication {
auth_type PASS
auth_pass 8kgEDPp3
}
virtual_ipaddress {
192.168.1.4/24
}
track_process {
track_docker
}
track_script {
node_active_ready_check
}
}
! Configuration File for keepalived
global_defs {
vrrp_startup_delay 5
enable_script_security
max_auto_priority
script_user root
}
vrrp_track_process track_docker {
process dockerd
weight 10
}
vrrp_instance docker_swarm {
state BACKUP
interface eth0
virtual_router_id 10
priority 145
advert_int 1
authentication {
auth_type PASS
auth_pass 8kgEDPp3
}
virtual_ipaddress {
192.168.1.4/24
}
track_process {
track_docker
}
}
#!/bin/bash
status=$(docker node ls --format "{{.Status}} {{.Availability}}")
if [[ "$status" == *"Ready"* && "$status" == *"Active"* ]]; then
echo "Node is active and ready."
exit 0
else
echo "Node is not active or not ready."
# Log the reason to a file
exit 1
fi
@scyto
Copy link

scyto commented Oct 8, 2023

Why use the unicast - seems to add complexity?
Also why have priority differences if all nodes are the same and if services can be on any node?

@Drallas
Copy link
Author

Drallas commented Oct 8, 2023

Unicast is more explicit perhaps overkill, but first bold before i strip away settings. Prio's are for predictability, might also remove that when I evolve the setup.

@scyto
Copy link

scyto commented Oct 8, 2023

Ok. Think about you 10 on your process check…. You have bigger than a 10 weight difference prio across the node, so node 1 with a process fail will have a prio equal to node 3 and presumably 5 higher than node 4 so keepalived might run on node 1 if node 2 and node 3 fails the process check… luckily the script check on 1 will cause a keepalive fatal so should be good, but given you don’t have the script check on node 2 you may find node 2 can die and be equal to node 4 in weight…. You may want to increase the process weight to 20.

@Drallas
Copy link
Author

Drallas commented Oct 8, 2023

@scyto Thanks for your feedback!

I revised my config and the priority values.

I also promoted 3 out of the 4 Docker Swarm nodes manager. In case Node 1 fails, either Node 2 or Node 3 takes over the Manger leader role and the VIP.

Only If Node 1,2,3 fails the fourth gets the vip!

@Drallas
Copy link
Author

Drallas commented Oct 12, 2023

Removed Unicast seemed network stack seems to hang because of it. Now two days without and no ore connection loss.

@scyto
Copy link

scyto commented Oct 12, 2023

Removed Unicast

weird, i wouldn't have thought it would harm anything.... what was the logic for going unicast in the first place?

@Drallas
Copy link
Author

Drallas commented Oct 12, 2023

Just trying stuff, but didn't think that one trough, Perhaps I revisit it some day, for now I'm happy it's stable. Migrating data now to CepfFS, so far 350 GB via rsync via VirtioFS without any issue!

@scyto
Copy link

scyto commented Oct 12, 2023

Just trying stuff

got it and thats best way to learn, i only used what i did originally as thats whats all the examples had :-)

or now I'm happy it's stable.

thats all that matters

350 GB via rsync via VirtioFS without any issue

sweet - not sure when i will get to virtioFS or the docker plugins for ceph... my folks are coming to visit from the UK for 3 weeks and i have a week to prep the house (it needs a lot of work) so may not get to look at either until mid to late Nov

@Drallas
Copy link
Author

Drallas commented Oct 12, 2023

Let me know, how it goes, played a lot with it the past weeks. Use the updated script from my Gist, it's fixing some stuff, see Proxmox Forum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment