In Linux, when you add an IP to an interface, the kernel creates two routes for you:
table local: local x.x.x.y dev foo proto kernel scope host src x.x.x.y table main: x.x.x.a/bb dev foo proto kernel scope link src x.x.x.y
Now, if you are setting up an HA pair or cluster, you will often have a VIP -- a "virtual" or "floating" IP -- which is moved between boxes during failovers. And if you happen to be running clients on these nodes as well which connect to that VIP, something very odd happens when you move the IP.
So... linux has routing rules, tables, and a cache. When a connection is made, the cache is consulted for a matching route tuple (src, dst, tos, fwmark, iif) and if it exists, the connection stores a pointer to it so each packet can rapidly be routed. If the cache entry expires or otherwise goes away, a new route is cloned by following the policy rules to look in the tables.
Now... when an IP you're connected to/from goes away... something very odd happens. The stack realizes that it can't route the packet -- at all, not just unreach/prohibit/etc -- and it "responds" to each side of the connection with 0 window ACKs (for TCP). This causes the sender to go into a persist timer state, which waits x seconds, then sends a probe to solicit an ACK, and see if the window is > 0 yet. A packet with window = 0 means "I can't accept data yet", which is very true in this case. The IP is gone.
But see, you'd think "Wait, we have the interface route. The VIP moved to another box, we can just take that route". Turns out... carefully reading the source and thinking about it... If your source IP on the connection was also the VIP (the default for the local route/local->local connection), the routing stack very much so can't clone a new route for you, because part of the information it stores/caches/matches on is source IP, so it will never match any route. The code just bails hardcore.
Best case scenario, the previous-VIP-node has services timing out for a while. Worst case, you have some non-transactional data which was incompletely transmitted/received, while you wait x minutes (sometimes hours) for the client's connection to timeout.
So! This led me to the solution.
After adding the VIP via whatever failover software... Replace the local route's source hint with the static IP on that interface. So you get:
table local: local x.x.x.y dev foo proto kernel scope host src x.x.x.z
Now services connect from .z to .y, and when .y goes away, you clean up the route. Now the source IP always exists, a new (unicast) route is cloned from the main table, ARP happens on dev foo, the packet gets successfully forwarded to the new-VIP-node, and its daemon sends a RST because it doesn't know the connection. This allows the service to realize it needs to reconnect, and its state doesn't sit around incomplete for whatever random length of time.
And BOOM. Problem solved :D Going to write a nice blog post on this I think, since it's becoming an extremely common pattern, particularly in cloud infrastructures, to have HA controllers with services that inter-communicate.