Spindel/ipv6-debugging.md Secret

## ipv6-debugging.md

      
    Raw
  

              ipv6-debugging.md
            
          
    So. Today I finally figured out the issue that's been plaguing our off-site backup server.
It turned out to be an already fixed bug.  More about that later.
The Problem

ssh sessions to the machine would work, and then break, packets no longer arriving.  The network was ipv6-only, and the machine wasn't local to us.

Other than that, the ipv6 connection was made through Hurricane Electric (6in4 tunnel via Tunnelbroker.net )
Connections to another machine, in the same place, worked.
Connections from this other machine to our backup machine worked.
The problem is intermittent, coming and going. Sometimes being stable for hours, and sometimes dropping out for long periods.
So. Debugging this was slow. Various packet loss faults made me suspect the tunnel, but that seemed inconsistent with the fact that the other machine worked.
The next issue was routing. A difference in the routing tables? Something involving ipv6 Router Advertisment vs. DHCPv6?

Both machines were similar, One Fedora 21 and one CentOS 7.  Same networking stack, different kernels, but not that far away from eachother.
I spent a lot of time comparing firewalls, routing tables, and watching logs. Not much came from that.  Nor from rebooting and reconfiguring the firewall / tunnel host.
Clarity

In the end, I was looking at the sysctl parameters for ipv6 ( sysctl -a |grep ipv6 ) and saw something strange. One of the network interfaces had hop_limit set to something different.
A quick watch head /proc/sys/net/ipv6/conf/*/hop_limit later, and I had something more to go on. The value was changing. Between 0 and 64.
After that, it was time to figure out why.  Up with wireshark and tcpdump, and log for a bit.
The bug

And while that was running, I could google for the behaviour. And find a patch for NetworkManager from september last year. And bugs belonging to it.
hop_limit is, for those not into network design. A count of how many "steps" a packet may take into the internet before it's discarded. Setting this to zero means that it's not allowed to step anywhere.
Except that Zero was supposed to be a special value, meaning "use your own default".
But that only explained the change, not the method of it. Or the why.
The race condition

It turns out that radvd sends out two network broadcasts. One for the link network fe80:: and one for the global.
The link one, sends out Cur hop limit: 0 , meaning "default". The global one sends out Cur hop limit: 64.
The order of these two packets turned out to not be quite stable. So sometimes the link one comes first, NetworkManager sees this, and sets it to 0. Then the 64 one arrives, and NetworkManager sets it back to 64.
Since hop_limit is a function of the ethernet interface not of the ip address this caused a bouncing setting.
But after a while, the link local advertisement came last, and suddenly NetworkManager set it from 64 to 0 and left it there.  No packets could then reach hosts on the internet (or return there).
These two behaviours. A network ordering/race, and a traditional bug in NetworkManager turned together into one of the most annoying to debug bugs I've seen this year.
Fortunately it's only March so far, so there will be more bugs to appear.