Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
dnsdist buggy firstAvailable
To Whomever May Be Concerned,
In testing dnsdist (version 1.2.0) out on a new system configured with the ServerPolicy(firstAvailable), we noticed what seems like a pretty big bug. We've got a lot of nodes servicing anycast addresses, converting from named listening on those addresses to just listening on the local addresses and then letting dnsdist handle listening on the anycast addresses. In this case, we've got a group of 24 servers configured as backends to dnsdist in geographically diverse areas in an ordered config serving DNS requests from localhost/localcluster/remote systems. On a local node, my test was running a "dig +short @anycastaddr google.com" in a loop. What we end up seeing is that when we kill named on the local system, queries jump to the last system in the ordered list. It does not matter what system is there or how latent it is (we tried changing up the configuration to different systems), or the order number configured (these were tested at 100, 90, and now 9 just to ensure it wasn't an error in sorting numbers). IF we set the last system in the list to administratively DOWN, then the ordering works as expected. When the final server in the list is put back in service, queries jump back to the very last system in the list until the local named instance is brought back up and then queries return there. Some data below demonstrating this:
# Queries going to localhost, first host in the ordered list.
> showServers()
# Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
0 127.0.0.1:53 up 1.0 0 0 1 68 0 0.0 0.6 0
1 10.3.5.13:53 up 0.0 0 5 1 0 0 0.0 0.0 0
2 10.3.5.14:53 up 0.0 0 5 1 0 0 0.0 0.0 0
3 10.6.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
4 10.6.3.2:53 up 0.0 0 9 1 0 0 0.0 0.0 0
5 10.6.3.3:53 up 0.0 0 9 1 0 0 0.0 0.0 0
6 10.6.3.65:53 up 0.0 0 9 1 0 0 0.0 0.0 0
7 10.6.3.66:53 up 0.0 0 9 1 0 0 0.0 0.0 0
8 10.6.3.67:53 up 0.0 0 9 1 0 0 0.0 0.0 0
9 10.3.8.27:53 up 0.0 0 9 1 0 0 0.0 0.0 0
10 10.3.8.47:53 up 0.0 0 9 1 0 0 0.0 0.0 0
11 10.2.7.15:53 down 0.0 0 9 1 0 0 0.0 0.0 0
12 10.2.7.16:53 down 0.0 0 9 1 0 0 0.0 0.0 0
13 10.2.7.17:53 down 0.0 0 9 1 0 0 0.0 0.0 0
14 10.2.7.18:53 down 0.0 0 9 1 0 0 0.0 0.0 0
15 10.2.7.19:53 down 0.0 0 9 1 0 0 0.0 0.0 0
16 10.2.7.20:53 down 0.0 0 9 1 0 0 0.0 0.0 0
17 10.8.3.2:53 up 0.0 0 9 1 0 0 0.0 0.0 0
18 10.8.3.65:53 up 0.0 0 9 1 0 0 0.0 0.0 0
19 10.8.3.66:53 up 0.0 0 9 1 0 0 0.0 0.0 0
20 10.4.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
21 10.4.3.2:53 up 0.0 0 9 1 0 0 0.0 0.0 0
22 10.4.3.66:53 up 0.0 0 9 1 0 0 0.0 0.0 0
23 10.4.3.65:53 up 0.0 0 9 1 0 0 0.0 0.0 0
24 10.8.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
All 0.0 68 0
# Dropping local named instance
~]# service named stop ; dnsdist -c
Redirecting to /bin/systemctl stop named.service
> showServers()
# Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
0 127.0.0.1:53 up 1.1 0 0 1 106 0 0.0 0.5 1
1 10.3.5.13:53 up 0.0 0 5 1 0 0 0.0 0.0 0
2 10.3.5.14:53 up 0.0 0 5 1 0 0 0.0 0.0 0
3 10.6.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
<snip>
22 10.4.3.66:53 up 0.0 0 9 1 0 0 0.0 0.0 0
23 10.4.3.65:53 up 0.0 0 9 1 0 0 0.0 0.0 0
24 10.8.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
All 1.0 106 0
# dnsdist drops localhost and local system IP for this system out of rotation.
# NOTE - queries are now diverted to the last node in the list. This system is ordered higher than node 1, still up and receiving
# requests happily. It's also of course less latent since it's one hop away. Yet, we're crossing an ocean here for resolution.
> showServers()
# Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
0 127.0.0.1:53 down 0.0 0 0 1 107 2 0.0 0.5 0
1 10.3.5.13:53 up 0.0 0 5 1 0 0 0.0 0.0 0
2 10.3.5.14:53 down 0.0 0 5 1 0 0 0.0 0.0 0
3 10.6.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
<snip>
22 10.4.3.66:53 up 0.0 0 9 1 0 0 0.0 0.0 0
23 10.4.3.65:53 up 0.0 0 9 1 0 0 0.0 0.0 0
24 10.8.3.1:53 up 0.8 0 9 1 20 0 0.0 24.7 0
All 0.0 127 2
# Forcing down the last server in the list
> getServer(24):setDown()
> showServers()
# Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
0 127.0.0.1:53 down 0.0 0 0 1 107 2 0.0 0.5 0
1 10.3.5.13:53 up 0.0 0 5 1 2 0 0.0 0.0 0
2 10.3.5.14:53 down 0.0 0 5 1 0 0 0.0 0.0 0
3 10.6.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
<snip>
22 10.4.3.66:53 up 0.0 0 9 1 0 0 0.0 0.0 0
23 10.4.3.65:53 up 0.0 0 9 1 0 0 0.0 0.0 0
24 10.8.3.1:53 DOWN 0.8 0 9 1 34 0 0.0 39.8 0
All 0.0 143 2
# Traffic shifts to the next lowest ordered system (#1), as it should.
> showServers()
# Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
0 127.0.0.1:53 down 0.0 0 0 1 107 2 0.0 0.5 0
1 10.3.5.13:53 up 1.0 0 5 1 71 0 0.0 0.6 0
2 10.3.5.14:53 down 0.0 0 5 1 0 0 0.0 0.0 0
3 10.6.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
<snip>
22 10.4.3.66:53 up 0.0 0 9 1 0 0 0.0 0.0 0
23 10.4.3.65:53 up 0.0 0 9 1 0 0 0.0 0.0 0
24 10.8.3.1:53 DOWN 0.0 0 9 1 34 0 0.0 39.8 0
All 0.0 212 2
# Putting last system in list (#24) back in active state, and dnsdist starts sending traffic to it again?!
> getServer(24):setAuto()
> showServers()
# Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
0 127.0.0.1:53 down 0.0 0 0 1 107 2 0.0 0.5 0
1 10.3.5.13:53 up 1.1 0 5 1 86 0 0.0 0.5 0
2 10.3.5.14:53 down 0.0 0 5 1 0 0 0.0 0.0 0
3 10.6.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
<snip>
22 10.4.3.66:53 up 0.0 0 9 1 0 0 0.0 0.0 0
23 10.4.3.65:53 up 0.0 0 9 1 0 0 0.0 0.0 0
24 10.8.3.1:53 up 0.0 0 9 1 36 0 0.0 41.8 0
All 1.0 229 2
# ...and traffic keeps getting sent to it, despite high latency and higher numerical order in the active systems list.
> showServers()
# Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
0 127.0.0.1:53 down 0.0 0 0 1 107 2 0.0 0.5 0
1 10.3.5.13:53 up 0.0 0 5 1 86 0 0.0 0.5 0
2 10.3.5.14:53 down 0.0 0 5 1 0 0 0.0 0.0 0
3 10.6.3.1:53 up 0.0 0 9 1 0 0 0.0 0.0 0
<snip>
24 10.8.3.1:53 up 0.8 0 9 1 172 0 0.0 125.9 0
All 0.0 365 2
# If I add a dummy entry at the end of the list w/ a higher priority, things work as they're supposed to (although, I'm not
# completely convinced it's taking latency in consideration when it fails over to all the same weighted systems, it seems to
# jump towards the end of that list regardless of latency).
> showServers()
# Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
0 127.0.0.1:53 down 0.0 0 0 1 35 1 0.0 1.4 0
1 10.3.5.13:53 down 0.0 0 5 1 31 2 0.0 0.6 0
2 10.3.5.14:53 down 0.0 0 5 1 0 0 0.0 0.0 0
3 10.6.3.1:53 DOWN 0.0 0 6 1 32 0 0.0 0.4 0
4 10.3.8.27:53 up 1.0 0 7 1 239 0 0.0 0.4 0
<snip>
21 10.4.3.2:53 up 0.0 0 9 1 0 0 0.0 0.0 0
22 10.4.3.66:53 up 0.0 0 9 1 0 0 0.0 0.0 0
23 10.4.3.65:53 up 0.0 0 9 1 431 0 0.0 148.6 0
24 10.8.3.1:53 up 0.0 0 19 1 0 0 0.0 0.0 0
25 127.0.0.255:53 down 0.0 0 99 1 0 0 0.0 0.0 0
All 0.0 768 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.