Skip to content

Instantly share code, notes, and snippets.

@williammartin
Created June 7, 2017 10:54
Show Gist options
  • Save williammartin/80114d225f359985ed1958b05587f407 to your computer and use it in GitHub Desktop.
Save williammartin/80114d225f359985ed1958b05587f407 to your computer and use it in GitHub Desktop.
Networking Flakiness in GATS

Networking Flakiness in GATS

Description

When running nested GATS, we occasionally get containers without network connectivity.

It's unclear whether this is new flakiness. It's unclear whether this affects BOSH deployments.

The tests that seem to hit this the most are those in networking_test.go around resolving DNS, but DNS resolution is failing due to a lack of connectivity (unable to ping 8.8.8.8) rather than a problem with resolution.

Reproduction

I was able to consistently reproduce by running GATS and putting a ping -c 3 8.8.8.8 || sleep 10000 in a JustBeforeEach in networking_test.go.

Seems possible to reproduce with just a single container creation (through GATS) that is then unable to ping 8.8.8.8.

It's strange that even though all tests in this file are performing this ping, there still seems to be a correlation with the DNS tests failing the most. This might just be a bias in what I'm seeing, or it might be that they often coincide with some other event that makes occurrence of this flake more likely (suites are randomized but specs are not).

These tests are run in containers in concourse so there is some nesting going on here.

Healthy Container

Interfaces on host

806: whc6n3t49uas-1@if807: is the container side veth for the concourse test container (-1 prefix compared to -0 and no bridge indicates this).

4: wheb9g66aaof-0@if3 is the host side veth for the container we are testing network connectivity for.

2: wbrdg-0afe0000 is the bridge for the veth for the container we are testing network connectivity for.

root@267bc95a-571c-4178-78cd-399da1e5765a:/tmp/build/e55deab7# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: wbrdg-0afe0000: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1432 qdisc noqueue state UP mode DEFAULT group default
    link/ether 26:e2:4e:b0:5f:b9 brd ff:ff:ff:ff:ff:ff
4: wheb9g66aaof-0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1432 qdisc noqueue master wbrdg-0afe0000 state UP mode DEFAULT group default qlen 1
    link/ether 9a:ff:39:a9:66:6a brd ff:ff:ff:ff:ff:ff
806: whc6n3t49uas-1@if807: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1432 qdisc noqueue state UP mode DEFAULT group default qlen 1
    link/ether 76:e5:87:16:ad:fa brd ff:ff:ff:ff:ff:ff

Interfaces in container

root@267bc95a-571c-4178-78cd-399da1e5765a:/tmp/build/e55deab7# /tmp/build/e55deab7/gr-release-develop/bin/runc exec 29a7fe24-7062-47c2-465f-b43d9515793e /bin/ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: wheb9g66aaof-1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1432 qdisc noqueue qlen 1
    link/ether ae:44:d5:ae:51:8f brd ff:ff:ff:ff:ff:ff

Pinging 8.8.8.8 from within container

root@267bc95a-571c-4178-78cd-399da1e5765a:/tmp/build/e55deab7# /tmp/build/e55deab7/gr-release-develop/bin/runc exec 29a7fe24-7062-47c2-465f-b43d9515793e /bin/ping -c 3 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=51 time=0.421 ms
64 bytes from 8.8.8.8: seq=1 ttl=51 time=0.417 ms
64 bytes from 8.8.8.8: seq=2 ttl=51 time=0.394 ms

--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.394/0.410/0.421 ms

tcpdump of host side veth during ping from within container

root@267bc95a-571c-4178-78cd-399da1e5765a:/tmp/build/e55deab7# tcpdump -i wheb9g66aaof-0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on wheb9g66aaof-0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:01:47.138833 ARP, Request who-has 10.254.0.1 tell 10.254.0.2, length 28
09:01:47.138864 ARP, Reply 10.254.0.1 is-at 26:e2:4e:b0:5f:b9 (oui Unknown), length 28
09:01:47.138869 IP 10.254.0.2 > google-public-dns-a.google.com: ICMP echo request, id 11520, seq 0, length 64
09:01:47.139617 IP google-public-dns-a.google.com > 10.254.0.2: ICMP echo reply, id 11520, seq 0, length 64
09:01:48.139034 IP 10.254.0.2 > google-public-dns-a.google.com: ICMP echo request, id 11520, seq 1, length 64
09:01:48.139420 IP google-public-dns-a.google.com > 10.254.0.2: ICMP echo reply, id 11520, seq 1, length 64
09:01:49.139264 IP 10.254.0.2 > google-public-dns-a.google.com: ICMP echo request, id 11520, seq 2, length 64
09:01:49.139607 IP google-public-dns-a.google.com > 10.254.0.2: ICMP echo reply, id 11520, seq 2, length 64
09:01:52.155571 ARP, Request who-has 10.254.0.2 tell 10.254.0.1, length 28
09:01:52.155633 ARP, Reply 10.254.0.2 is-at ae:44:d5:ae:51:8f (oui Unknown), length 28
^C
10 packets captured
10 packets received by filter
0 packets dropped by kernel

Unhealthy Container

Interfaces on host

root@8047e7ad-d4c2-460b-7c16-7d2187595852:/tmp/build/e55deab7# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
851: whc6n3t49ubb-1@if852: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1432 qdisc noqueue state UP mode DEFAULT group default qlen 1
    link/ether d2:03:75:dc:79:b4 brd ff:ff:ff:ff:ff:ff
410: wbrdg-0afe0000: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1432 qdisc noqueue state UP mode DEFAULT group default
    link/ether 62:a4:06:cc:02:21 brd ff:ff:ff:ff:ff:ff
412: whecjqqmjdn1-0@if411: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1432 qdisc noqueue master wbrdg-0afe0000 state UP mode DEFAULT group default qlen 1
    link/ether 3a:d5:0c:df:1a:b1 brd ff:ff:ff:ff:ff:ff

Interfaces in container

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
411: whecjqqmjdn1-1@if412: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1432 qdisc noqueue state UP mode DEFAULT group default qlen 1
    link/ether b6:bf:da:a4:02:07 brd ff:ff:ff:ff:ff:ff

Pinging 8.8.8.8 from within container

root@8047e7ad-d4c2-460b-7c16-7d2187595852:/tmp/build/e55deab7# /tmp/build/e55deab7/gr-release-develop/bin/runc exec 591924f7-9211-4032-5f7a-65ecd893e346 /bin/ping -c 3 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
92 bytes from 591924f7-9211-4032-5f7a-65ecd893e346 (10.254.0.2): Destination Host Unreachable
92 bytes from 591924f7-9211-4032-5f7a-65ecd893e346 (10.254.0.2): Destination Host Unreachable
92 bytes from 591924f7-9211-4032-5f7a-65ecd893e346 (10.254.0.2): Destination Host Unreachable
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

tcpdump of host side veth during ping from within container

This was kind of interesting. Clearly there are no IP packets coming through but also interesting is that this command hung for a while even after I issued a Ctrl+C to try and kill it.

root@8047e7ad-d4c2-460b-7c16-7d2187595852:/tmp/build/e55deab7# tcpdump -i whecjqqmjdn1-0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on whecjqqmjdn1-0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C^C09:35:04.288318 ARP, Request who-has 10.254.0.1 tell 8047e7ad-d4c2-460b-7c16-7d2187595852, length 28

1 packet captured
3 packets received by filter
0 packets dropped by kernel

second tcpdump of host side veth during a second ping from within container

These ARP packets getting captured were immediate in their response in subsequent pings, like the first ARP lookup was slow. Still no IP packets. 8047e7ad-d4c2-460b-7c16-7d2187595852 is the hostname of the host.

root@8047e7ad-d4c2-460b-7c16-7d2187595852:/tmp/build/e55deab7# tcpdump -i whecjqqmjdn1-0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on whecjqqmjdn1-0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:35:50.005350 ARP, Request who-has 10.254.0.1 tell 8047e7ad-d4c2-460b-7c16-7d2187595852, length 28
09:35:51.003541 ARP, Request who-has 10.254.0.1 tell 8047e7ad-d4c2-460b-7c16-7d2187595852, length 28
09:35:52.003537 ARP, Request who-has 10.254.0.1 tell 8047e7ad-d4c2-460b-7c16-7d2187595852, length 28
^C
3 packets captured
3 packets received by filter
0 packets dropped by kernel

Other Random Checks

OS Info

This container is Debian but we've seen this in BusyBox too.

root@8047e7ad-d4c2-460b-7c16-7d2187595852:/tmp/build/e55deab7# /tmp/build/e55deab7/gr-release-develop/bin/runc exec 591924f7-9211-4032-5f7a-65ecd893e346 /bin/cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Pinging the IP address returned in ping

I think this is essentially localhost but I don't know ping very well.

root@8047e7ad-d4c2-460b-7c16-7d2187595852:/tmp/build/e55deab7# /tmp/build/e55deab7/gr-release-develop/bin/runc exec 591924f7-9211-4032-5f7a-65ecd893e346 /bin/ping -c 3 10.254.0.2
PING 10.254.0.2 (10.254.0.2): 56 data bytes
64 bytes from 10.254.0.2: icmp_seq=0 ttl=64 time=0.056 ms
64 bytes from 10.254.0.2: icmp_seq=1 ttl=64 time=0.058 ms
64 bytes from 10.254.0.2: icmp_seq=2 ttl=64 time=0.057 ms
--- 10.254.0.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.056/0.057/0.058/0.000 ms

Pinging own hostname

root@8047e7ad-d4c2-460b-7c16-7d2187595852:/tmp/build/e55deab7# /tmp/build/e55deab7/gr-release-develop/bin/runc exec 591924f7-9211-4032-5f7a-65ecd893e346 /bin/ping -c 3 591924f7-9211-4032-5f7a-65ecd893e346
PING 591924f7-9211-4032-5f7a-65ecd893e346 (10.254.0.2): 56 data bytes
64 bytes from 10.254.0.2: icmp_seq=0 ttl=64 time=0.054 ms
64 bytes from 10.254.0.2: icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from 10.254.0.2: icmp_seq=2 ttl=64 time=0.064 ms
--- 591924f7-9211-4032-5f7a-65ecd893e346 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.049/0.056/0.064/0.000 ms

Summary

There doesn't seem to be anything unusual about the links when comparing healthy to unhealthy containers.

It seems ARP packets are getting from contaienr veth to host veth but IP are not.

Questions

Can this be reproduced in Ubuntu where I have iptables?

Can this be reproduced in GATS against BOSH deployment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment