braindevices/how_docker_firewalld_libvirt_implement_rules.md

## how_docker_firewalld_libvirt_implement_rules.md

      
    Raw
  

              how_docker_firewalld_libvirt_implement_rules.md
            
          
    docker just do not get along with libvirt, iptables, nptables, firewalld, etc

why my container isolation comes and gos? iptables, nptables, firewalld, who did what?!

there is almost NOT legacy iptables command (iptables) on modern OS (RHEL8, ubuntu22, etc)
there can still be some lagecy ip_tables module loaded:
lsb_release -d; lsmod | grep tables; iptables -V
Description:    Fedora release 38 (Thirty Eight)
nf_tables             368640  1454 nft_compat,nft_chain_nat
nfnetlink              20480  8 nft_compat,nfnetlink_acct,nf_conntrack_netlink,nf_tables,ip_set
Description:    Red Hat Enterprise Linux release 8.8 (Ootpa)
nf_tables_set          49152  21
nf_tables             184320  447 nft_ct,nft_compat,nft_reject_inet,nft_fib_ipv6,nft_fib_ipv4,nft_counter,nft_chain_nat,nf_tables_set,nft_reject,nft_fib,nft_fib_inet
nfnetlink              16384  6 nft_compat,nf_conntrack_netlink,nf_tables,ip_set
libcrc32c              16384  4 nf_conntrack,nf_nat,nf_tables,xfs
iptables v1.8.4 (nf_tables)
lsb_release -d; lsmod | grep tables; iptables -V
Description:    AlmaLinux release 8.8 (Sapphire Caracal)
ip6_tables             32768  6
ip_tables              28672  0
nf_tables             180224  589 nft_compat,nft_counter,nft_chain_nat
nfnetlink              16384  6 nft_compat,nf_conntrack_netlink,nf_tables,ip_set
libcrc32c              16384  4 nf_conntrack,nf_nat,nf_tables,xfs
iptables v1.8.4 (nf_tables)

$ nft --version
nftables v0.9.3 (Topsy)
firewall-cmd --version
0.9.11

The reason why we have the ip_tables and ip6_tables here because the firewalld set FirewallBackend=iptables, which means ip_tables
However, the ip_tables and nf_tables handle xtables rules with xtables module¹.
We can see the iptables can show rules create by itself and by ip_tables but not nft rules.
nft can see all rules include the rules created by iptables and ip_tables (firewalld with legacy).

  
      flowchart TD
    libvirt[libvirt bridge] --> |"iptables cmd?"|ipt
    docker_alma --> |"firewall-cmd call"|fw_legacy
    docker_alma["docker with\nfirewalld-ipt"] --> |"iptables call"|ipt
    docker_rhel["docker with\nfirewalld-npt"] --> |"iptables call"|ipt
    docker_rhel --> |"firewall-cmd call"|fwnpt
    fw_legacy["firewalld-iptables\n(v0.9)"] --> ipt_k(ip_tables)
    fwipt["firewalld-iptables\n(>v1.3)"] --> nft_k
    fwnpt[firewalld-npt] --> nft_k
    ipt_k --> X{xtables match}
    ipt[iptables-nft] -->nft_k[nf_tables]
    nft[nft nftables] -->nft_k
    nft_k -->X
    nft_k -->nft_m[nftables match]
    subgraph application
         docker_alma
         libvirt
         docker_rhel
    end
    subgraph ip_tables/nft_tables frontend
        fw_legacy
        fwipt
        ipt
        nft
        fwnpt
    end
    subgraph "kernel API"
        ipt_k
        nft_k
        X
        nft_m
    end

    
      Loading

  
However, firewalld reload will remove all iptables rule created on the fly.
Strangely, you cannot see any difference in nft list ruleset or iptables -vL
This makes the behavior very hard to debug.
But I made sure by reproducing the behaviors on both firewalld-legacy and firewalld-nft
both at v0.9. maybe newer version has different behavior.
the test is:

starting default docker.service
add bridge network br1 and br2
start netcat-busybox1 and netcat-busybox2
try to ping each other: won't reach
try to nc each other: won't open the port or won't sent data
firewall-cmd --reload
try again, everything is transparent now
check iptables -vL and nft list ruleset, there is no difference before and after reload
try restart containers, isolation still failed
try to set the firewalld parameter: FlushAllOnReload=no, but nothing changed
reboot docker.service make the isolation back indicate re-apply of the iptables rules.

Here br1 and br2 is created with icc=false, when proper isolated:
===========summary
======isolated:
busybox1_1-->busybox1_2
busybox1_1-->busybox2_1
busybox1_1-->busybox2_2
busybox1_1-->busybox3_1
busybox1_1-->busybox3_2
busybox1_2-->busybox1_1
busybox1_2-->busybox2_1
busybox1_2-->busybox2_2
busybox1_2-->busybox3_1
busybox1_2-->busybox3_2
busybox2_1-->busybox1_1
busybox2_1-->busybox1_2
busybox2_1-->busybox2_2
busybox2_1-->busybox3_1
busybox2_1-->busybox3_2
busybox2_2-->busybox1_1
busybox2_2-->busybox1_2
busybox2_2-->busybox2_1
busybox2_2-->busybox3_1
busybox2_2-->busybox3_2
busybox3_1-->busybox1_1
busybox3_1-->busybox1_2
busybox3_1-->busybox2_1
busybox3_1-->busybox2_2
busybox3_2-->busybox1_1
busybox3_2-->busybox1_2
busybox3_2-->busybox2_1
busybox3_2-->busybox2_2
======connected:
busybox3_1-->busybox3_2
busybox3_2-->busybox3_1

after reload the firewalld:
===========summary
======isolated:
busybox1_1-->busybox1_2
busybox1_2-->busybox1_1
busybox2_1-->busybox2_2
busybox2_2-->busybox2_1
======connected:
busybox1_1-->busybox2_1
busybox1_1-->busybox2_2
busybox1_1-->busybox3_1
busybox1_1-->busybox3_2
busybox1_2-->busybox2_1
busybox1_2-->busybox2_2
busybox1_2-->busybox3_1
busybox1_2-->busybox3_2
busybox2_1-->busybox1_1
busybox2_1-->busybox1_2
busybox2_1-->busybox3_1
busybox2_1-->busybox3_2
busybox2_2-->busybox1_1
busybox2_2-->busybox1_2
busybox2_2-->busybox3_1
busybox2_2-->busybox3_2
busybox3_1-->busybox1_1
busybox3_1-->busybox1_2
busybox3_1-->busybox2_1
busybox3_1-->busybox2_2
busybox3_1-->busybox3_2
busybox3_2-->busybox1_1
busybox3_2-->busybox1_2
busybox3_2-->busybox2_1
busybox3_2-->busybox2_2
busybox3_2-->busybox3_1

almost all system depends on injection rule on the fly has this exact same problem, not just docker.
The real fix of this is using nftables instead of deprecated iptables,
where any tools can create there own table to hold all rules it needed⁸.
Then the firewalld reload only affect firewalld table and will not inteference the rules in other tables.
However, the nftables integration is still just a plan in almost all container tools.
They deal with this issue in different ways while they continue with iptables:

libvirt handle this problem directly in the driver since 2013, which directly capture the firewalld reload signal via dbus an then reapply the iptables rules. So this is the most stable way to coop with firewalld.
podman provide a command to just reapply the firewall rules⁵.
podman's old network module CNI and new backend netavark⁶ currently both need manual workaround⁷ for firewalld reload.
it may be fixed in 4.8 where they plan to do the samething as the libvirt.
or it may be fixed by using nftables

docker also have a setting for the daemon to do live reload⁹ which can quickly restart the docker service without interupt the container.
However, there is neither official workaround to automate this nor even any recognition of this issue as a bug.
Although the workaround in podman may also work for docker when combines with the live-restore.
In docker community, manual interventions seem to be an acceptable routine.
other frequent quirks: ERROR: ZONE_CONFLICT

ERROR: INVALID_ZONE: docker

docker.service try to put all bridges in one firewalld zone: docker
However, the target is set to ACCEPT, means there is no firewall at all between container and host.
The isolation purely depends on injected iptables rules.
Even worse, the docker daemon use go script to call firewalld to create docker zone,
if one manually configured a docker zone before the daemon, it will fail with ERROR: INVALID_ZONE: docker
So it is impossible to stablly configure the docker zone.
One may succeed if configure the zone with firewalld after the docker zone has been created.
But error will arise when update the docker system package.
ERROR: ZONE_CONFLICT: 'docker0' already bound to a zone

Even you did not manually configure the docker zone,
docker.service can still fail with ERROR: ZONE_CONFLICT: 'docker0' already bound to a zone
from time to time when you reboot your server or update the docker pacakges.
The reason is unkown so far, I guess there may be some racing condition inside the moby go scripts.
Because moby did not use dbus to communicate with firewalld as suggested at all.
can we use iptables:false in docker.service?

In previous section we see the docker/podman or any moby related projects depending on inject glboal iptables rules
to create their bridge network environment.
This often interfers other systems which also depend on injected rules.
Their cooperation with firewalld is poor, the default setting is also kind of insane for security.
So in many cases, when sysops try to stablize the system or harden the security,
they do not want docker daemon to touch the firewall at all.
The docker.service do come with a flag to stop this: iptables: false¹²
However, there is a big warning in docker doc:

It is not possible to completely prevent Docker from creating iptables rules, and creating them after-the-fact is extremely involved and beyond the scope of these instructions. Setting iptables to false will more than likely break container networking for the Docker engine.

But I am not going to believe this without trying.
So I did my own research and experiment.
the answer is big YES, at least we can do this, if we do not use fancy docker network features:

overlay, swarm mode, ipvlan, macvlan, etc. If one is using these features, one may already behind some fancy firewall rather than relying on host firewalld anyway.
on-the-fly network in docker compose: basically the default option of networks: <name>: external: false
if one start considering to use firewalld to harden the security, one normally would not want the docker compose to spawn their own networks on the fly.

The tricky parts maybe following:

how to isolate the containers: intra-bridge, inter-bridge and their free combination on demand.
how to give the container internet access
how to publish (map) the ports.

In following sections we are going to discuss them in detail.
In short all these can be done with simple firewalld settings.
firewalld for container isolation

1st observation is: if we purely depend on firewalld rule we won't get any container isolation unless we load the br_netfilter at boot time.
So it actually contradicts with the document. If we just use iptables: false and do nothing else, all containers can connect to each other freely. Except the outbound access to internet may be blocked, if the internet interface is not in the external zone.
This is the result of nature of bridge network and the lack of bridge concept in firewalld.

usually bridge traffic is layer2, all firewall: iptables/nftables(not entirely true) usually work on layer3.
thus the layer2 traffic is never going to match anything.
br_netfilter kernel module is used to promote the layer2 trafic to layer3 thus it gives the firewall chances to do something.
the nftables in theory have bridge tables which are supposed to look at layer2 traffic without the br_netfilter, but the rules are normally accept by default.
after more experiment with br_netfilter, it looks like if we load he br_netfilter very early at boot,
the isolation works already, by default the containers can access internet and the host can access
services in container by ips, even if the bridges are not in any zone,
unlike real NIC connecting other machine, which need zone2zone policy to alllow traffic¹⁰.
firewalld can be used to grant access after bridges are assigned to zones.

In most online tutorials they completely ignore the isolation of containers,
which defeated the whole concept of trying to secure the docker with firewalld.
Thus we have to do our own research with br_netfilter.
add br_netfilter.conf to /etc/modules-load.d/br_netfilter.conf then reboot
sudo sh -c 'echo br_netfilter > /etc/modules-load.d/br_netfilter.conf'

After reboot, we can seee the default non-zoned bridge behavior:

peer <--> peer is blocked
container --> host is blocked
container --> internet is allowed # if the outwards interface is in external zone

Thus the default behavior is already in our favor: without special rules the containers are mostly isolated from each other and from the host.
with iptables=false, br_netfilter is loaded:
external
  interfaces: enp1s0
public
  interfaces: enp2s0

===========summary
======connected:
host-->busybox1_1
host-->busybox1_2
host-->busybox2_1
host-->busybox2_2
host-->busybox3_1
host-->busybox3_2
host-->busybox4_1
host-->busybox4_2
busybox1_1-->google.com
busybox1_2-->google.com
busybox2_1-->google.com
busybox2_2-->google.com
busybox3_1-->google.com
busybox3_2-->google.com
busybox4_1-->google.com
busybox4_2-->google.com

since the interface is not in any zone, ip table as no rules about them. I suppose there won't be any connectivity.
However, this isn't the case, host can access all containers and all containers can access internet.
no I add them to some zones:
docker_product
  interfaces: br2
external
  interfaces: enp1s0
internal
  interfaces: br3
public
  interfaces: enp2s0
trusted
  interfaces: br1 br4

docker_product (active)                              
  target: default                                   
  icmp-block-inversion: no                       
  interfaces: br2                                
  sources:                                        
  services:                                        
  ports:                                          
  protocols:                           
  forward: no                          
  masquerade: no                        
  forward-ports:                         
  source-ports:                         
  icmp-blocks:                       
  rich rules: 

now we enable:

br1<-->br1
br4<-->br4
br1<-->br4
br2-->br1/br4
br3-->br1/br4

======isolated:
busybox2_1--xexternal
busybox2_1--xpublic
busybox2_2--xexternal
busybox2_2--xpublic
busybox3_1--xexternal
busybox3_1--xpublic
busybox3_2--xexternal
busybox3_2--xpublic
busybox2_1--xbusybox2_2
busybox2_1--xbusybox3_1
busybox2_1--xbusybox3_2
busybox2_2--xbusybox2_1
busybox2_2--xbusybox3_1
busybox2_2--xbusybox3_2
busybox3_1--xbusybox2_1
busybox3_1--xbusybox2_2
busybox3_1--xbusybox3_2
busybox3_2--xbusybox2_1
busybox3_2--xbusybox2_2
busybox3_2--xbusybox3_1
======connected:
host-->busybox1_1
host-->busybox1_2
host-->busybox2_1
host-->busybox2_2
host-->busybox3_1
host-->busybox3_2
host-->busybox4_1
host-->busybox4_2
busybox1_1-->external
busybox1_1-->public
busybox1_2-->external
busybox1_2-->public
busybox4_1-->external
busybox4_1-->public
busybox4_2-->external
busybox4_2-->public
busybox1_1-->google.com
busybox1_2-->google.com
busybox2_1-->google.com
busybox2_2-->google.com
busybox3_1-->google.com
busybox3_2-->google.com
busybox4_1-->google.com
busybox4_2-->google.com
busybox1_1-->busybox1_2
busybox1_1-->busybox2_1
busybox1_1-->busybox2_2
busybox1_1-->busybox3_1
busybox1_1-->busybox3_2
busybox1_1-->busybox4_1
busybox1_1-->busybox4_2
busybox1_2-->busybox1_1
busybox1_2-->busybox2_1
busybox1_2-->busybox2_2
busybox1_2-->busybox3_1
busybox1_2-->busybox3_2
busybox1_2-->busybox4_1
busybox1_2-->busybox4_2
busybox2_1-->busybox1_1
busybox2_1-->busybox1_2
busybox2_1-->busybox4_1
busybox2_1-->busybox4_2
busybox2_2-->busybox1_1
busybox2_2-->busybox1_2
busybox2_2-->busybox4_1
busybox2_2-->busybox4_2
busybox3_1-->busybox1_1
busybox3_1-->busybox1_2
busybox3_1-->busybox4_1
busybox3_1-->busybox4_2
busybox3_2-->busybox1_1
busybox3_2-->busybox1_2
busybox3_2-->busybox4_1
busybox3_2-->busybox4_2
busybox4_1-->busybox1_1
busybox4_1-->busybox1_2
busybox4_1-->busybox2_1
busybox4_1-->busybox2_2
busybox4_1-->busybox3_1
busybox4_1-->busybox3_2
busybox4_1-->busybox4_2
busybox4_2-->busybox1_1
busybox4_2-->busybox1_2
busybox4_2-->busybox2_1
busybox4_2-->busybox2_2
busybox4_2-->busybox3_1
busybox4_2-->busybox3_2
busybox4_2-->busybox4_1

we add a port to docker_product
docker_product (active)
  target: default
  icmp-block-inversion: no
  interfaces: br2
  sources: 
  services: 
  ports: 12345/tcp
  protocols: 
  forward: no
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules:

br2 can talk to host services
======isolated:
busybox3_1--xexternal
busybox3_1--xpublic
busybox3_2--xexternal
busybox3_2--xpublic
busybox2_1--xbusybox2_2
busybox2_1--xbusybox3_1
busybox2_1--xbusybox3_2
busybox2_2--xbusybox2_1
busybox2_2--xbusybox3_1
busybox2_2--xbusybox3_2
busybox3_1--xbusybox2_1
busybox3_1--xbusybox2_2
busybox3_1--xbusybox3_2
busybox3_2--xbusybox2_1
busybox3_2--xbusybox2_2
busybox3_2--xbusybox3_1

then I turn on masquerade on docker_product (which is kind of nonsense)
then br2 to br2 communication is possible now
br3 can access br2 can also communicate, but br3 cannot talk to br3.
moreover br2 still cannot access br3.
My theory is the NAT is not setup so busybox2_* --> |br2|br3|busybox3_* cannot do the trick.
but busybox2_* --> |br2|br2|busybox2_* there is kind of SNAT; however, in theory this should not required since br2 act like a bridge it should be able to just change some rules to allow peer communication.
now I remove the masquerad and create a docker_p2internal policy:
sudo firewall-cmd --permanent --new-policy docker_p2internal
sudo firewall-cmd --permanent --policy docker_p2internal --add-egress-zone internal
sudo firewall-cmd --permanent --policy docker_p2internal --add-ingress-zone docker_product
sudo firewall-cmd --permanent --policy docker_p2internal --set-target ACCEPT
sudo firewall-cmd --reload

docker_p2internal (active)
  priority: -1
  target: ACCEPT
  ingress-zones: docker_product
  egress-zones: internal
  services: 
  ports: 
  protocols: 
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules:

then the br2 --> br3 is allowed.
======isolated:
busybox3_1--xexternal
busybox3_1--xpublic
busybox3_2--xexternal
busybox3_2--xpublic
busybox2_1--xbusybox2_2
busybox2_2--xbusybox2_1
busybox3_1--xbusybox2_1
busybox3_1--xbusybox2_2
busybox3_1--xbusybox3_2
busybox3_2--xbusybox2_1
busybox3_2--xbusybox2_2
busybox3_2--xbusybox3_1


If we want the boxes on the same bridge zone to communicate each other we can simply enable the intra-zone forward on the zone.
now the br2<-->br2 is also working.
======isolated:
busybox3_1--xexternal
busybox3_1--xpublic
busybox3_2--xexternal
busybox3_2--xpublic
busybox3_1--xbusybox2_1
busybox3_1--xbusybox2_2
busybox3_1--xbusybox3_2
busybox3_2--xbusybox2_1
busybox3_2--xbusybox2_2
busybox3_2--xbusybox3_1

publish(map) ports

docker use 2 different techinques in parallel to forward the traffic:

iptables DNAT rules
a proxy process to binds ports in host network namespace like some kind of socat proxy.

In reality either one will work. Thus the proxy process is enough to publish port without the iptables rules.
We do not need to anything special to forward the traffic.
We can verify the port binding while netstat
sudo docker run --rm --name busybox1_1 --network br1 -p 12345:12345 busybox sh -c 'nc -v -lkp 12345 -e /bin/cat' &
$ netstat -ltnp | grep 12345
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:12345           0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::12345                :::*                    LISTEN      -

we can use nc -v localhost 12345 to communicate with the echo server.
dns of containers


By default, Docker containers run in a bridge network mode,
and you can resolve container IPs from the host system using the container
name or container ID.
Containers in the default bridge network can be resolved by their container name as a hostname.

This behavior does not depend on iptables thus it behaves the same.
However, outside the bridge, the DNS does not work.
For example, from host or from containers on other bridge.
It is not recommand to communicate with the container by ip or container name directly.
One should relies on exposed ports instead.
docker compose

if one want to be able to reach host or other zone, one have to use existing network: external: true.
Otherwise, the network is created on the fly and fall into non-zone interface which is mostly blocked.
version: '3'
services:
  echo-server1:
    image: busybox
    command: ["sh", "-c", "nc -v -l -p 12345 -e /bin/cat"]
    networks:
      - backend
  echo-server2:
    image: busybox
    command: ["sh", "-c", "nc -v -l -p 12345 -e /bin/cat"]
    networks:
      - frontend

networks:
  backend:
    name: br1
    external: true
  frontend:
    name: br4
    external: true
frangile bridge creation

From time to time, there can be broken bridges.
remove and recreate it is problematic: it kept complaining there is existing network using this br1.
but network ls show nothing. delete via brctl does not help.
restart host then start docker it still appears in ip a.
then I did network prune, restart does not help.
finally I removed /var/lib/docker/network/files/local-kv.db
then br1 disappear finally.
after recreate the br1 everything works as expected.
references


iptables: The two variants and their relationship with nftables
libvirt handle firewalld reload signal inside the driver
podman Netavark and Aardvark network manager
Docker networking fails after iptables service is restarted
podman-network-reload
netavark and CNI lost rules after firewalld reload
podman workaround for firewalld
firewalld the future is nftables
docker live restore
internet sharing with firewalld
how docker publishes ports
Prevent Docker from manipulating iptables