raggi/systemd-networkd-dhcp-loss-under-load.md Secret

## systemd-networkd-dhcp-loss-under-load.md

      
    Raw
  

              systemd-networkd-dhcp-loss-under-load.md
            
          
    systemd-networkd Could not set DHCPv4 address: Connection timed out

This bug has reproduced for me on systemd 245 (245.4-4ubuntu3.20). Looking in
the source code, the bug appears to still be present
(https://github.com/systemd/systemd/blob/45a6a2aace8315137b648193a8265997b3c267fb/src/network/networkd-dhcp4.c#L781).
The case of handling a timeout of the netlink reconfiguration stage of a DHCPv4
refresh does not yet appear to be covered.
Steps to reproduce


Configure a machine with a DHCPv4 lease on a network with a DHCPv4 server.
Place machine under unusual load sufficient to cause a timeout on netlink requests.
Observe the interface failing with the following logs:

systemd-networkd[139370]: eth0: Could not set DHCPv4 address: Connection timed out
systemd-networkd[139370]: eth0: Failed

It appears to be much easier/more common to produce this situation with
unusually high load in a credit based virualized compute environment. Other
users have discussed instances of this problem on both AWS and GCP:

https://repost.aws/questions/QU-IJlEVo2Q0enTASXvagUCw/t3-micro-ec2-instance-dropped-off-network
https://serverfault.com/questions/1126222/gcp-vm-using-cloud-nat-loses-internet-connection
https://serverfault.com/questions/1125634/linux-server-loses-network-connectivity-after-an-oom-event
https://www.reddit.com/r/linux4noobs/comments/pmigxp/issues_with_systemdnetworkd/
https://askubuntu.com/questions/1148980/run-directory-goes-out-of-space-and-server-goes-non-responsive
coreos/bugs#2020
https://repost.aws/questions/QUi1GW7UkqQDWrP8Sx90kiEA/ubuntu-instance-lost-connection-and-crash

Many of these issues are reported co-incident with OOM events, storage full, and
so on, but those co-incidents are distracting and may provide some demonstration
that the systems in question are under load and may well be running out of
compute credits. The timeout in question should not be affected by storage
pressure, nor by OOM unless the OOM terminates systemd-networkd that would
result in different symptoms.
Expected behavior

The DHCPv4 client should retry the lease refresh when the issue is a timeout,
eventually succeeding in these scenarios.
Actual behavior

Permanent loss of connectivity on the affected interface.
Additional information

In the reproduction condition, the DHCPv4 configuration is also set to preserve
addresses on other failure modes. The code path for the address set timeout
passes through link_enter_failed that unconditionally passes
may_keep_dhcp:false. It is likely desirable for most of these cases for this
kind of potentially transient condition to retain the addresses through the
retry process.