This bug has reproduced for me on systemd 245 (245.4-4ubuntu3.20)
. Looking in
the source code, the bug appears to still be present
(https://github.com/systemd/systemd/blob/45a6a2aace8315137b648193a8265997b3c267fb/src/network/networkd-dhcp4.c#L781).
The case of handling a timeout of the netlink reconfiguration stage of a DHCPv4
refresh does not yet appear to be covered.
- Configure a machine with a DHCPv4 lease on a network with a DHCPv4 server.
- Place machine under unusual load sufficient to cause a timeout on netlink requests.
- Observe the interface failing with the following logs:
systemd-networkd[139370]: eth0: Could not set DHCPv4 address: Connection timed out
systemd-networkd[139370]: eth0: Failed
It appears to be much easier/more common to produce this situation with unusually high load in a credit based virualized compute environment. Other users have discussed instances of this problem on both AWS and GCP:
- https://repost.aws/questions/QU-IJlEVo2Q0enTASXvagUCw/t3-micro-ec2-instance-dropped-off-network
- https://serverfault.com/questions/1126222/gcp-vm-using-cloud-nat-loses-internet-connection
- https://serverfault.com/questions/1125634/linux-server-loses-network-connectivity-after-an-oom-event
- https://www.reddit.com/r/linux4noobs/comments/pmigxp/issues_with_systemdnetworkd/
- https://askubuntu.com/questions/1148980/run-directory-goes-out-of-space-and-server-goes-non-responsive
- coreos/bugs#2020
- https://repost.aws/questions/QUi1GW7UkqQDWrP8Sx90kiEA/ubuntu-instance-lost-connection-and-crash
Many of these issues are reported co-incident with OOM events, storage full, and so on, but those co-incidents are distracting and may provide some demonstration that the systems in question are under load and may well be running out of compute credits. The timeout in question should not be affected by storage pressure, nor by OOM unless the OOM terminates systemd-networkd that would result in different symptoms.
The DHCPv4 client should retry the lease refresh when the issue is a timeout, eventually succeeding in these scenarios.
Permanent loss of connectivity on the affected interface.
In the reproduction condition, the DHCPv4 configuration is also set to preserve
addresses on other failure modes. The code path for the address set timeout
passes through link_enter_failed
that unconditionally passes
may_keep_dhcp:false. It is likely desirable for most of these cases for this
kind of potentially transient condition to retain the addresses through the
retry process.