Skip to content

Instantly share code, notes, and snippets.

@mcastelino
Last active August 31, 2021 14:06
Show Gist options
  • Save mcastelino/62cf2a882ce05d07400b5a10f21b6437 to your computer and use it in GitHub Desktop.
Save mcastelino/62cf2a882ce05d07400b5a10f21b6437 to your computer and use it in GitHub Desktop.
docker swarm networking - packet trace

How docker swarm load balanced traffic flows (on a given host).

How the traffic that get to a host is outside the scope of this.

Create a simple service

docker service create --name testswarm --replicas 1 --publish 8080:80 nginx /bin/bash -c "hostname > /usr/share/nginx/html/hostname; nginx -g \"daemon off;\""

Here port 8080 is published and maps to port 80 inside the container. Any traffic that hit the node on port 8080 will make its way into (one of) the ngnix container on port 80.

The container Has two network interfaces

  • eth0 which is the ingress interface
  • eth1 which is the egress interface

All requests originating from the load balancer will come in through the ingress interface. Also note the lowered MTU on the ingress interface (as it is linked to the overlay network)

Keep in mind when using swarm, docker creates an network namespace that actually handles the load balanced ingress traffic. That is the ingress_sbox

Configuration:

  • Container Ingress IP: 10.255.0.4
  • Container Ingress VIP:10.255.0.2
  • Gateway Bridge IP: 172.18.0.1
  • Ingress Sbox Ingress IP: 10.255.0.3 (talks to Container Ingress IP)
  • Ingress Sbox gateway IP: 172.18.0.2 (talks to sbox gateway network)

How are the interfaces connected

                                                      |vxlan
                                    +-----------------------------------+
                                    |                 |                 |
                                    |                 +-----------+     |
                                    |              vxlan          +---------------------------------+
                                    |                             +-------------------+             |
                                    |                             +     |             |             |
                                    |                                   |             |             |
                                    +-----------------------------------+             |             |
                                                                                      |             |
                                                                                      |             |
                                                                                      |             |
                                                                                      |             |
        172.18.0.                                                                     |             |
 +------------------------+                              +----------------------------------+       |
 |                        |                              |            ingress_sbox    |     |       |
 |      docker_gwbridge   |                              |                            +     |       |
 |                        +------------------------------+172.18.0.2             10.255.0.3 |       |
 |                        |                              |                                  |       |
 +------------------------+                              |                                  |       |
            |                                            |                                  |       |
            |                                            +----------------------------------+       |
            |                                                                                       |
            |                                                                                       |
            |                                                                                       |
            |                                                                                       |
            |                                                                                       |
            |                                                                                       |
            |                                             +----------------------------------+      |
            |                                             |          nginx     10.255.0.4 (IP)      |
            |                                             |                    10.255.0.2 (VIP)+----+
            |                                             |                                  |
            |                                             |                                  |
            +---------------------------------------------+172.18.0.3                        |
                                                          |                                  |
                                                          |                                  |
                                                          +----------------------------------+

So traffic flow is

localhost -> docker_gwbridge_172.18.0.1 -> ingress_sbox_172.18.0.2-> marked traffic ip_vs -10.255.0.2----> container_10.255.0.4

How the traffic gets in/out of the ingress_sbox

Host iptables that matter:

-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER-INGRESS
-A DOCKER-INGRESS -p tcp -m tcp --dport 8080 -j DNAT --to-destination 172.18.0.2:8080
-A DOCKER-INGRESS -p tcp -m state --state RELATED,ESTABLISHED -m tcp --sport 8080 -j ACCEPT

Ingress sbox rules:

nsenter --net=/var/run/docker/netns/ingress_sbox

-A PREROUTING -p tcp -m tcp --dport 8080 -j MARK --set-xmark 0x100/0xffffffff
-A POSTROUTING -d 10.255.0.0/16 -m ipvs --ipvs -j SNAT --to-source 10.255.0.3

Ingress sbox load balancing

IPVS Setup in ingress_sbox - This sends traffic coming in to 10.255.0.4

Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
FWM  256 rr
  -> 10.255.0.4:0                 Masq    1      0          0

ipvsadm-save
-A -f 256 -s rr
-a -f 256 -r 10.255.0.4:0 -m -w 1

Note: 0x100 == 256 :)

Port remapping actually happens inside the container namespace

Connecting to the container from the ingress_sbox (both 80 and 8080 work)

curl http://10.255.0.4:80/hostname
1f99d9cf0236
curl http://10.255.0.4:8080/hostname
1f99d9cf0236

nsenter --net=/var/run/docker/netns/02d51fa13d84

-A PREROUTING -d 10.255.0.4/32 -p tcp -m tcp --dport 8080 -j REDIRECT --to-ports 80
-A OUTPUT -d 127.0.0.11/32 -j DOCKER_OUTPUT
-A POSTROUTING -d 127.0.0.11/32 -j DOCKER_POSTROUTING
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:41343
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:43411
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p tcp -m tcp --sport 41343 -j SNAT --to-source :53
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p udp -m udp --sport 43411 -j SNAT --to-source :53
COMMIT
# Completed on Thu Feb  2 21:57:42 2017
# Generated by iptables-save v1.6.0 on Thu Feb  2 21:57:42 2017
*filter
:INPUT ACCEPT [67:5781]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [75:5765]
-A INPUT -d 10.255.0.4/32 -p tcp -m tcp --dport 80 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT
-A INPUT -d 10.255.0.4/32 -p udp -j DROP
-A INPUT -d 10.255.0.4/32 -p tcp -j DROP
-A OUTPUT -s 10.255.0.4/32 -p tcp -m tcp --sport 80 -m conntrack --ctstate ESTABLISHED -j ACCEPT
-A OUTPUT -s 10.255.0.4/32 -p udp -j DROP
-A OUTPUT -s 10.255.0.4/32 -p tcp -j DROP

DNS Resolution

Docker swarm has an internal DNS based load balancer that RRs the DNS requests to spread load. That runs on the localhost on the host bound to a host port specific to the container. https://github.com/docker/libnetwork/blob/5ac04367ae7b0b12c33bed5f5b395bd4c104fff9/sandbox.go#L815

There is a rule in the container namespace which is used to implement the docker DNS load balancer/resolver. That way 127.0.0.11:53 maps to a specific port on which the corresponding resolver is running.

	-A DOCKER_OUTPUT -d 127.0.0.11/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:41343
	-A DOCKER_OUTPUT -d 127.0.0.11/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:43411
	-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p tcp -m tcp --sport 41343 -j SNAT --to-source :53
	-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p udp -m udp --sport 43411 -j SNAT --to-source :53

The resolver is docker

 netstat -plunt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.11:41343        0.0.0.0:*               LISTEN      14447/dockerd
udp        0      0 127.0.0.11:43411        0.0.0.0:*                           14447/dockerd

How did it work on 1.12.1

  • The port forwarding was in the ingress sbox
  • The DNS remap was in the network ns

ingress-sbox

# Generated by iptables-save v1.6.0 on Thu Feb  9 21:24:56 2017
*mangle
:PREROUTING ACCEPT [36:4125]
:INPUT ACCEPT [3:180]
:FORWARD ACCEPT [33:3945]
:OUTPUT ACCEPT [3:180]
:POSTROUTING ACCEPT [36:4125]
-A PREROUTING -p tcp -m tcp --dport 8080 -j MARK --set-xmark 0x100/0xffffffff
-A OUTPUT -d 10.255.0.4/32 -j MARK --set-xmark 0x100/0xffffffff
COMMIT
# Completed on Thu Feb  9 21:24:56 2017
# Generated by iptables-save v1.6.0 on Thu Feb  9 21:24:56 2017
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER_OUTPUT - [0:0]
:DOCKER_POSTROUTING - [0:0]
-A PREROUTING -p tcp -m tcp --dport 8080 -j REDIRECT --to-ports 80
-A OUTPUT -d 127.0.0.11/32 -j DOCKER_OUTPUT
-A POSTROUTING -d 127.0.0.11/32 -j DOCKER_POSTROUTING
-A POSTROUTING -d 10.255.0.0/16 -m ipvs --ipvs -j SNAT --to-source 10.255.0.3
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:45190
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:40332
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p tcp -m tcp --sport 45190 -j SNAT --to-source :53
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p udp -m udp --sport 40332 -j SNAT --to-source :53
COMMIT
# Completed on Thu Feb  9 21:24:56 2017
# Generated by iptables-save v1.6.0 on Thu Feb  9 21:24:56 2017
*filter
:INPUT ACCEPT [3:180]
:FORWARD ACCEPT [33:3945]
:OUTPUT ACCEPT [3:180]
COMMIT

In container

# Generated by iptables-save v1.6.0 on Thu Feb  9 21:32:37 2017
*mangle
:PREROUTING ACCEPT [21:1358]
:INPUT ACCEPT [21:1358]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [15:2767]
:POSTROUTING ACCEPT [15:2767]
COMMIT
# Completed on Thu Feb  9 21:32:37 2017
# Generated by iptables-save v1.6.0 on Thu Feb  9 21:32:37 2017
*nat
:PREROUTING ACCEPT [3:180]
:INPUT ACCEPT [3:180]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER_OUTPUT - [0:0]
:DOCKER_POSTROUTING - [0:0]
-A OUTPUT -d 127.0.0.11/32 -j DOCKER_OUTPUT
-A POSTROUTING -d 127.0.0.11/32 -j DOCKER_POSTROUTING
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:39295
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:44854
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p tcp -m tcp --sport 39295 -j SNAT --to-source :53
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p udp -m udp --sport 44854 -j SNAT --to-source :53
COMMIT
# Completed on Thu Feb  9 21:32:37 2017
# Generated by iptables-save v1.6.0 on Thu Feb  9 21:32:37 2017
*filter
:INPUT ACCEPT [21:1358]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [15:2767]
COMMIT
# Completed on Thu Feb  9 21:32:37 2017

#With Clear Containers


+---------------------------------+                +--------------------------------+
|   ingress sbox                  |                |                                |
|                    +            |                |                       +        |
|                    +-----------------------------------------------------+        |
|            I IP    |            |                |                       +--------------+
|                    +----+       |                |           +-----------+        |
|                    +    |       |                |       over|ay box     |        |
|                         |       |                |           |                    |
+---------------------------------+                +--------------------------------+
                          |                                    |
                          |                                    |
                          |                                    |            host continer ns
                          |                   +--------------------------------------------+
                          |                   |        +-+     |                           |
                          |                   |        | +-----+     +-----------------+   |
                          |                   |        | |           |    IP           |   |
                          |                   |        | +--------------+ VIP          |   |
                          |         Resolver-----+     +-+           |                 |   |
     docker_gw_bridge     |       127.0.0.11  |                      |                 |   |
               +          |                   |       +-+R IP        |                 |   |
               +----------+                   |       | +---------------+ HIP          |   |
       H GW IP +--------------------------------------+ |            |                 |   |
               |                              |       +-+            +-----------------+   |
               +       default gw             |                    /etc/resolv.conf (127..)|
                                              +--------------------------------------------+

/etc/resolv.conf
@amshinde
Copy link

@mcastelino Can you update the intermediate solution that you used for the dns-proxy issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment