csfrancis/dns.md

## dns.md

      
    Raw
  

              dns.md
            
          
    Debugging DNS

This is a story about how I debug things. We're big users of NGINX and Lua at Shopify. Lately, we've been looking at implementing support to add and remove upstream peers to lua-upstream-nginx-module. Step 1 is to get the tests working for the module to make sure the world's in a sane state. I built a development build of NGINX and ran the tests, and this is when the trouble started.
A few of the tests were failing with similar errors:
#   Failed test 'TEST 4: sample in README - response_body - response is expected (req 0)'
#   at /usr/local/share/perl/5.18.2/Test/Nginx/Socket.pm line 1277.
# @@ -1,5 +1,5 @@
#  upstream foo.com:
#      addr = 127.0.0.1:80, weight = 4, fail_timeout = 53, name = 127.0.0.1, max_fails = 100
# -    addr = 106.184.1.99:81, weight = 1, fail_timeout = 10, name = agentzh.org:81, max_fails = 1
# +    addr = {106.184.1.99:81, 106.184.1.99:81}, weight = 1, fail_timeout = 10, name = agentzh.org:81, max_fails = 1
#  upstream bar:
#      addr = 127.0.0.2:80, weight = 1, fail_timeout = 10, name = 127.0.0.2, max_fails = 1
# Looks like you failed 1 test of 3.
t/sanity.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/3 subtests

The difference here is that in the original, addr is a single value, but when I ran the test, it's coming back as an array of two values (that are the same??). I originally started by running a single test that demonstrated the problem in isolation.
Scott's Debugging Rule #1: Simplify the test case
agentzh has started writing a great book on OpenResty, and he's completed a chapter on the test framework. OpenResty uses perl for integration testing (disclaimer: I'm a total perl noob), but I learned that you can isolate a single test by adding --- ONLY to the test spec. Running the test case again, and the problem persists. Here's the relevant upstream definition from the test:
    upstream foo.com {
        server 127.0.0.1 fail_timeout=53 weight=4 max_fails=100;
        server agentzh.org:81;
    }

Presumably, it's the agentzh.org upstream peer that is returning two addresses.
At this point I decided to jump into the lua-upstream-nginx-module source to see how it was getting the address for an upstream peer. Nothing crazy - it's simply rendering whatever NGINX tells it the upstream peer address(es) is. If there's a single result, the address is returned as a string; if there are multiple results they come back in an array (a table, in Lua). So for some reason, I was seeing multiple results for the agentzh.org peer. I did a quick dig to confirm my sanity:
vagrant ~ $ dig agentzh.org

; <<>> DiG 9.9.5-3ubuntu0.8-Ubuntu <<>> agentzh.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13140
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; MBZ: 0005 , udp: 4096
;; QUESTION SECTION:
;agentzh.org.                   IN      A

;; ANSWER SECTION:
agentzh.org.            5       IN      A       106.184.1.99

;; Query time: 90 msec
;; SERVER: 192.168.253.2#53(192.168.253.2)
;; WHEN: Mon Jul 18 13:08:53 UTC 2016
;; MSG SIZE  rcvd: 56

Only one record came back, and it's the same address I saw in the array.
At this point, I decided to dive into the NGINX source to see how it resolves upstream addresses. Upstream peers are parsed by NGINX as URLs. Digging deeper, name resolution is done by the ngx_inet_resolve_host function. Ultimately, this function calls back to libc's getaddrinfo function. Now that I knew this was how NGINX was resolving hostnames, I followed debugging rule #1 and created a basic program to try to demonstrate the problem:
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#include <arpa/inet.h>
#include <string.h>

int main(int argc, char **argv) {
  struct addrinfo hints = { 0 },
                  *res, *rp;
  int i = 0;

  hints.ai_family = AF_UNSPEC;
  hints.ai_socktype = SOCK_STREAM;

  getaddrinfo("agentzh.org", NULL, &hints, &res);

  for (i = 0, rp = res; rp != NULL; rp = rp->ai_next) {
    printf("%d, %d, %d, %s, ", rp->ai_family, rp->ai_addrlen, rp->ai_flags, inet_ntoa(((struct sockaddr_in *) rp->ai_addr)->sin_addr));
    printf("%d, %d, %s, %d, %d\n", rp->ai_socktype, rp->ai_protocol, rp->ai_canonname, ((struct sockaddr_in *) rp->ai_addr)->sin_family, ((struct sockaddr_in *) rp->ai_addr)->sin_port);
    i++;
  }
  return 0;
}
Sure enough, running the example program yielded two results:
vagrant ~ $ gcc -g a.c; ./a.out
2, 16, 0, 106.184.1.99, 1, 6, (null), 2, 0
2, 16, 0, 106.184.1.99, 1, 6, (null), 2, 0

This was enough proof for me to eliminate NGINX as the root cause. I REALLY DID NOT WANT TO HAVE TO DEBUG LIBC. I find the eglibc code base very difficult to trace through. There is a TON of code, and the functions are massive and the formatting is terrible. All of my prior experiences walking through this code have been traumatic. Before diving in, I tried Googling things like "getaddrinfo Linux multiple results", and came up with nothing.
I also ran Wireshark (because for some reason tcpdump locks up my vagrant -- but that's something to debug on another day) and I observed two DNS queries being sent for agentzh.org - one for an A record and another for an AAAA record. The A record query returned one result, and the AAAA record query returned no results. My initial hypothesis was that maybe getaddrinfo was misinterpreting the result of the empty AAAA record query and somehow coercing it into another result.
Another thing I tried was changing my getaddrinfo call to use a different host. Instead of agentzh.org, I tried google.com. This returned a single result.
I put on my hazmat suit and ran apt-get source libc6.
Because I now had an example program, it was now possible to run gdb to step through getaddrinfo to see where things go sideways. I started up gdb and b getaddrinfo to set a breakpoint on getaddrinfo. I ran my program and the breakpoint fired. Initially, there was no source displayed, but running dir ~/tmp/eglibc-2.19 adds the glibc source to gdb's source search path.
The getaddrinfo function is a nightmare. It's implemented in the file sys/posix/getaddrinfo.c, which is 2664 lines. The core functionality in my case is implemented in the gaih_inet, which just a touch over ONE THOUSAND LINES. The function is littered with #ifdefs, gotos, and for me, is very difficult to reason about. As far as I can tell there are no tests. It blows my mind that so many of our systems are built on top of this code. I'm sure this code has grown organically over years, possibly decades, and I don't fault the developers who maintain it. Refactoring getaddrinfo could be the yakshave to end all yakshaves.
Anyway, after spending some time analyzing and stepping through getaddrinfo, I discovered that the real work is done in a function called gethostbyname4_r. I'd never heard of this function before - it's undocumented. I did some Googling to see if I could find any more information. I found a helpful article here: http://log.or.cz/?p=77 that gave me a little more context. I also found that gethostbyname4_r was an indirect cause of CVE-2015-7547. This came as no real surprise to me, and I wouldn't be surprised if there were further exploits lingering around this code. Unfortunately, this biased my investigation and made me believe that the problem MUST be in glibc.
Scott's Debugging Rule #2: Detach biases - know when to abandon a hypothesis
I spent a bunch more time trying to understand how gethostbyname4_r works, which was pretty much fruitless. There was simply too much context for me to absorb.
At some point, I had an AHA moment. The thing about debugging is you're bombarded with a TON of information. The most gratifying part is when you finally establish the link between disparate pieces of information that explain what the fuck is going wrong. For me, sometimes this happens when I just stew on a problem for a bit. Sometimes it's after running a command multiple times and noticing something different after the 100th iteration that was happening all along.
In this case, I noticed something different between running dig in my vagrant environment and on my OSX host. Specifically, the DNS server being queried. On my host, the query was sent to my home router's DNS server (this is reported by dig):
;; SERVER: 192.168.73.1#53(192.168.73.1)

However, my vagrant instance is talking to the VMWare Fusion DNS:
;; SERVER: 192.168.253.2#53(192.168.253.2)

And within this is the issue. When performing the AAAA query on my host, I received no results for the query:
~ $ dig agentzh.org aaaa

; <<>> DiG 9.8.3-P1 <<>> agentzh.org aaaa
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23936
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;agentzh.org.                   IN      AAAA

;; Query time: 4 msec
;; SERVER: 192.168.73.1#53(192.168.73.1)
;; WHEN: Sun Jul 17 22:33:44 2016
;; MSG SIZE  rcvd: 29

When performing the query in my vagrant, I received a CNAME response:
vagrant ~ $ dig agentzh.org aaaa

; <<>> DiG 9.9.5-3ubuntu0.8-Ubuntu <<>> agentzh.org aaaa
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7358
;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: Message has 16 extra bytes at end

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; MBZ: 0005 , udp: 4096
;; QUESTION SECTION:
;agentzh.org.                   IN      AAAA

;; ANSWER SECTION:
agentzh.org.            5       IN      CNAME   agentzh.org.

;; Query time: 82 msec
;; SERVER: 192.168.253.2#53(192.168.253.2)
;; WHEN: Mon Jul 18 02:28:03 UTC 2016
;; MSG SIZE  rcvd: 81

So for whatever reason, my VMWare DNS server is returning a CNAME result for the query. Why? I don't know yet. At that point I had to ragequit; it was 11PM (which to me is basically like 4AM because I have a toddler and an infant) and I was about to pass out.
So in summary:

I probably could have saved a ton of time by comparing the results of dig between my OSX host and vagrant.
This is especially annoying, because I did compare the sample program between OSX and vagrant (it worked correctly on OSX). I assumed that this must have been on bug in Linux's glibc (wrong).
The packet captures between OSX and vagrant showed similar outbound DNS queries. What the packet capture DID NOT reveal was the query sent from vagrant to the VMWare DNS server (but it did show the outbound query from the VMWare DNS server).
I still don't know why the VMWare DNS server is returning a CNAME for the agentzh.org AAAA query when the upstream returned an empty result.