Skip to content

Instantly share code, notes, and snippets.

@Apsu

Apsu/VIP.md

Last active Dec 24, 2015
Embed
What would you like to do?
Quick description of VIP failover + local service routing issue
vrrp_instance foo {
interface bar
virtual_router_id 1
state BACKUP
priority 100
virtual_ipaddress {
x.x.x.y
}
notify_master "/etc/keepalived/update_route add x.x.x.y"
notify_backup "/etc/keepalived/update_route del x.x.x.y"
}
#!/usr/bin/env bash
# Helper functions
_log() { if [[ -n "$1" ]]; then logger -t update_route "$2" "$1"; fi; }
_exit() {
if [[ -n "$1" ]]; then
if [[ $2 ]]; then
_log "$1" "-s"
else
_log "$1"
fi
fi
exit 1
}
_errexit() { _exit "$1" 1; }
_help() { _errexit "Error updating route. Usage: update_route VIP"; }
# Grab args
action=$1
vip=$2
# Check args
if [[ $# -lt 2 ]] || [[ -z $action ]] || [[ -z $vip ]]; then _help; fi
# Try to snag VIP interface
iface=$(ip r sh table local $vip 2>/dev/null | cut -d' ' -f4)
# Didn't find it?
if [[ -z $iface ]]; then _errexit "Invalid VIP $vip! VIP must exist on an interface."; fi
# Grab primary IP on interface
src=$(ip -o -4 a sh $iface primary | sed -nr '1 s/^.*inet ([^/]*).*$/\1/p')
# Check it
if [[ -z $src ]]; then _errexit "No IP found on $iface. Expected at least $vip."; fi
if [[ $src == $vip ]]; then _errexit "Primary IP $src on $iface is VIP. A non-VIP primary IP must exist."; fi
# Do the things
case $action in
"add")
_log "Merging local route for $vip with source $src."
ip r r table local local $vip dev $iface src $src # Replace
;;
"del")
_log "Deleting local route for $vip with source $src."
ip r d table local local $vip dev $iface src $src # Delete
;;
*)
_help
esac
exit 0 # Success!

In Linux, when you add an IP to an interface, the kernel creates two routes for you:

table local: local x.x.x.y dev foo proto kernel scope host src x.x.x.y
table main: x.x.x.a/bb dev foo proto kernel scope link src x.x.x.y

Now, if you are setting up an HA pair or cluster, you will often have a VIP -- a "virtual" or "floating" IP -- which is moved between boxes during failovers. And if you happen to be running clients on these nodes as well which connect to that VIP, something very odd happens when you move the IP.

So... linux has routing rules, tables, and a cache. When a connection is made, the cache is consulted for a matching route tuple (src, dst, tos, fwmark, iif) and if it exists, the connection stores a pointer to it so each packet can rapidly be routed. If the cache entry expires or otherwise goes away, a new route is cloned by following the policy rules to look in the tables.

Now... when an IP you're connected to/from goes away... something very odd happens. The stack realizes that it can't route the packet -- at all, not just unreach/prohibit/etc -- and it "responds" to each side of the connection with 0 window ACKs (for TCP). This causes the sender to go into a persist timer state, which waits x seconds, then sends a probe to solicit an ACK, and see if the window is > 0 yet. A packet with window = 0 means "I can't accept data yet", which is very true in this case. The IP is gone.

But see, you'd think "Wait, we have the interface route. The VIP moved to another box, we can just take that route". Turns out... carefully reading the source and thinking about it... If your source IP on the connection was also the VIP (the default for the local route/local->local connection), the routing stack very much so can't clone a new route for you, because part of the information it stores/caches/matches on is source IP, so it will never match any route. The code just bails hardcore.

Best case scenario, the previous-VIP-node has services timing out for a while. Worst case, you have some non-transactional data which was incompletely transmitted/received, while you wait x minutes (sometimes hours) for the client's connection to timeout.

So! This led me to the solution.

After adding the VIP via whatever failover software... Replace the local route's source hint with the static IP on that interface. So you get:

table local: local x.x.x.y dev foo proto kernel scope host src x.x.x.z

Now services connect from .z to .y, and when .y goes away, you clean up the route. Now the source IP always exists, a new (unicast) route is cloned from the main table, ARP happens on dev foo, the packet gets successfully forwarded to the new-VIP-node, and its daemon sends a RST because it doesn't know the connection. This allows the service to realize it needs to reconnect, and its state doesn't sit around incomplete for whatever random length of time.

And BOOM. Problem solved :D Going to write a nice blog post on this I think, since it's becoming an extremely common pattern, particularly in cloud infrastructures, to have HA controllers with services that inter-communicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.