webratz/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Who doesn't know this situation: Sometime something weird is going on in your (ECS) cluster, but everything is so vague that you don't know where to start. This is a story about a bug we experienced in prod and how we found out what was the root cause.
The problem

Imagine a ECS cluster with multiple tasks running with an NFS (EFS in this case) backed persistent root volume. Launching new tasks works just fine but as soon you are doing a cluster upgrade complete hosts seem to be unable to start tasks or can only start a few and then break. But also some hosts just work fine.
So scaling up or down your cluster is a really scary thing that could not be done anymore without manual intervention.
Thats an inacceptable situation, but what to do?
Want to know how we handle the NFS mount? My colleague Philipp has a blog post on that
symptoms

Only some hosts are affected

During the upgrades we noticed that some hosts didn't have any issues with launching tasks. But what was the common pattern? Number of tasks, availability zone, something else?
As many host instances in eu-west-1c were affected the AZ theory was the first to verify.
Things that could go wrong with EFS in an AZ are:

EFS Mountpoint not there
Wrong security group attached to mountpoint
per subnet routing in VPC gone wrong
routing / subnet config on the instances
iptables on instance misconfigured
docker interfaces catches NFS traffic

I checked all of them, but everything was configured correctly, but still when i logged on to a host and tried to mount one of the EFS volumes manually it just hung forever.
Could it be something else network related that I did not think of? UDP vs TCP in NFS?
It's tcpdump time!

tcpdump -v port not 443 -i any -s 65535 -w dump.pcap excludes stuff from SSM Agents that are running on the ECS instance and should really catch all traffic on all interfaces.
Starting tcpdump, run the mount command manually, stop tcpdump, pcap file on s3, fetch file to analyse locally with wireshark.
Digging through the whole dump I was unable to find any package to the EFS Mountpoint. Did I do something wrong? But also after the second try there was not a single package related to NFS.
The only possible conclusion here: the system does not even try to mount, but why?
Stracing things down

ps shows many other hanging NFS mounts triggered by ECS / docker. But why are they hanging?
Its strace time! Let's see what is going on
strace -f mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $EFSID.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$EFSID &
We see that everything hangs on the mount syscall. And that for even longer than 10 minutes, which should have hit already every timeout. (In this case actually the latest timeout should have triggered after two minutes (600/10*2))
As I noticed before, there were many other hanging NFS mounts, so lets kill them off and try again.
Now the mount runs perfectly fine! Are we hitting an AWS service limit here? Is there some race condition blocking out each other?
According to AWS docs there is not such an service limit, so lets stick with the deadlock theory and try to build a test case.
https://gist.github.com/b327c580bff23a9a1a84e4996823c567
This works perfectly fine, what else is doing docker here? Is this test case not accurate? Let's have a look on what is happening within docker:
https://github.com/docker/docker-ce/blob/master/components/engine/volume/local/local_unix.go#L85
https://github.com/moby/moby/blob/master/pkg/mount/mount.go#L70
All of this looks quite sane. There is some fiddling with addr option, but we also tested mounting by IP and had the same issue.
Stepping back to our initial problem: The issues only occur on a cluster update where all tasks are re-scheduled to other instances. What if that happens not sequentially but in parallel?
Our simple test is just running in a loop, so one mount command runs after another.
https://gist.github.com/bc2327d8a7bbbe20063cd6de7399f5c2
Now we are backgrounding the mount command. So we do not wait anymore for the command to complete, but immediately start the next one.
Here we go: now after a few iterations things seem to block and mounting stops to work. We now have a test case that we can use to verify if we fixed the issue.
and now

We now know that the issue has to be in the kernel and have a way to reproduce it, but I have no experience in tracing actually what is going on in the kernel here.
So Google has to help out and after searching for nfs mount race condition deadlock rhel it returns a bug report.
The issue described in the report looks very similiar to what we experience and one of the comments mentions that this behaviour was introduced with NFS 4.1.
So lets change our snippet to use NFS 4.0 and see if everything is working now as expected:
https://gist.github.com/bf9f58e0dd8cb6a86024a754d9571889
Awesome. Our quite coomplex to debug issue was easy to solve. We changed the mount options in our ECS cluster and everything is running smooth now.
The current Kernel of the AWS ECS optimized AMI seems not to contain that bug fix at the time of writing.