lusis/an-doc.md

## an-doc.md

      
    Raw
  

              an-doc.md
            
          
    The story behind the graph

I promised this last week but between work, family and sickness I was unable to actually DO it until tonight.
The Environment

Before I explain what happened, I want to explain a little bit about the environment where this happened. I'm sure many people will want to criticize thingX or thingY about the environment but that's not the point. You work with the cards you are dealt and pragmatism and reality rule the day.
Production

The enStratus production environment is an interesting one. Everything we have runs on top of CloudStack with VMWare as the hypervisor. The company which hosts this environment partially manages the stack. We do not own any of our own hardware and we only have limited management capabilities of CloudStack and the VMWare side. Everything else - servers, network, SAN - are all managed outside of us. Again, this is relevant to this discussion only as it relates to the outage depicted in the graph and the scope of the graph itself.
DNS Servers

The graph itself is actually the number of tx/rx octets to our DNS servers. These are running dnsmasq. Our instances on CloudStack were provisioned with Knife and all use DHCP. This is the default CloudStack way. dnsmasq is simple to configure, supports ".d" directories and can read a simple text file of address/hostname keypairs for records. This is all managed by Chef so it's largely self-maintaining.
But being that everything node is assigned an ip address via DHCP (thought persistent until the instance is destroyed), we have add the following lines to our dhclient.conf on the nodes:
prepend domain-name-servers 10.1.1.13, 10.1.1.17;
At this point, some of you have probably already figured out what's "wrong" about the graph.
The outage

Being that we're on CloudStack, all of our instances live live on the SAN I mentioned above. Most people were correct that the "gap" in metrics was the result of an outage. What happened in our case was that the SAN had a nasty controller bug and was failing over between its HBAs (at least as I've been told - again I don't have any control over the SAN). An HBA failure SHOULD be transparent with multipathing but because of the amount of I/O we do and the various abstractions (CS, VMWare), controller failures are immediately visible to us regardless of downstream multipathing. This throws our root volumes into read-only mode and essentially takes us offline. It's not a good thing.
Last week we had this happen 3 times before the SAN was fixed and we were already doing additional volume because the majority of our customers run on AWS which was having its own problems that we were trying to manage for our customers.
Annotated graphs and fragments

Historical usage

Ever since I stood up our DNS servers, the graphs have looked like so:

Obviously had I bothered to actually look at this BEFORE the outage I would have seen the imbalance. Then again, it wasn't a big problem as we had no problems with DNS - latency or otherwise. Things have been running fine. We have checks in place to verify that both servers are responding in a timely fashion to a handful of critical DNS records.
The outage

Then we had the 'outage'. As I said before, the gap was indeed where the SAN outage was. What happened after was this shitshow that didn't become apparent until the next day:

Remember I said that we modify the dhclient config on our instances to prepend our own name servers instead of the ones cloudstack sends down (which are our provider's public DNS servers). Well that change above was ALSO set on our DNS servers. The DNS instances hadn't been rebooted since they were brought online. The resolv.conf was correct on each dns server until the reboot which overwrote it.
Basically we DDoS'd ourselves. It wasn't immediately apparent until we we started polling cloud resources again. The part of enStratus that polls cloud resources does so based on several factors to determine the interval. It's not a straight "every N minutes". Between that, the fact that we had restarted everything and AWS started having issues, we had this delay before all the DNS queries started flooding in. These weren't just DNS queries for AWS instance hostname lookups but also our internal ones - like our database servers, riak nodes and other internal services.
That's when I started looking at this graph. It was immediately visible that something was not right based on our historical traffic alone. I realized at that point, we were not really utilizing the second DNS server but suddenly we were AND they were both seeing weird spikes.
So I try various tunings of dnsmasq. I turn on query logging. I poke around. I check resolv.conf on all of our instances before I finally realize that there's a loop between the two DNS servers. The increased DNS volume on one server was causing it to time out. Queries were failing over to the second node which was set to query the first node (because we use dnsmasq which uses the resolvers DNS servers except for its own domains) which was already having issues.
Once I realized that was the problem and fixed that, things started to stabilize:

The traffic coming closer together between the DNS servers is where I hit some of our high volume query nodes and reordered the servers in the resolv.conf. But that's not all.
Now you might be looking at that graph and wondering what caused that DRAMATIC drop off in traffic.
When I turned on query logging as part of the troubleshooting, I noticed that we had a SERIOUSLY high level of queries for ipv6 addresses and these were ALL for AWS instance dns names. I added the following to our JAVA_OPTS in our init scripts and bounced all the services:
-Djava.net.preferIPv4Stack=true
The impact of disabling ipv6 is more evident when you look at the sum of both servers traffic before and after:

The biggest takeaway for me in this was that, in an effort to have a sane base configuration for our systems, I actually HURT myself. Had I been using something other than dnsmasq, it's likely that the shitshow wouldn't have happened (but that's not dnsmasq's fault).
Where I fucked up

There are a lot of fuckups in this story. The only ones I can address are the ones I have control over.
I never bothered to look at our DNS server traffic or I would have seen the imbalance. I didn't think through fully the impact of setting the dhclient options in conjunction with using dnsmasq. I didn't have query logging enabled (and didn't have logstash handling the log) so I scrambled to enable that in the middle of the shitshow. Basically I didn't have a full picture of what was "normal".