https://github.com/greenplum-db/gpdb/pull/8884
I digged a bit on this, and found there is a blog about the case I care about: https://gavv.github.io/articles/ephemeral-port-reuse/ It says,
Hence, when an ephemeral port is allocated, SO_REUSEADDR enables the kernel to reuse any other non-listening ephemeral port.
The important point here is that the kernel doesn’t check whether there is an opened socket for an ephemeral port, it only checks whether there is a socket in the listening state for that port.
This means that the kernel is free to reuse an ephemeral port of any opened UDP socket (because listen is not used for datagram sockets) and any opened TCP socket for which listen was not called yet.
Please note the word non-listening . There is a program in the blog to verify the words. I tried it (with small modification, i.e. adding gethchar() in the end to prevent fd close when the program exits and run the program multiple times concurrently to exhaust tcp ports) and verified the tcp nolisten reuseaddr case.
I also roughly checked the latest Linux kernel (inet). Here are some tentative conclusions.
Without SO_REUSEADDR, for our case, if bind() fails with EADDRINUSE, it should normally mean kernel can not find an available port for bind() - EADDRINUSE is a bit misleading for this case though. Check the below code. https://github.com/torvalds/linux/blob/1c4e395cf7ded47f33084865cbe2357cdbe4fd07/net/ipv4/af_inet.c#L526
So reusing TIME_WAIT ports seem to be useful.
With SO_REUSEADDR, multiple concurrent bind() could bind to the same non-listening port. This seems to be not a problem for the non-SO_REUSEADDR case. That means that in our code, with SO_REUSEADDR it is possible subsequent listen() could fail (errno should be EADDRINUSE) even the available tcp port number are sufficient. Check the below code and its callers for related logic:
So questions: If you could easily reproduce this could you please apply your patch to see if there is listen() return error message in your environment? If the above theory is correct, the right fix seems to be:
enable the SO_REUSEADDR option. retry even if listen() fails with EADDRINUSE, and error out after some tries. The SO_REUSEADDR/bind behavior seems to be not friendly for programmers but follows the man page of socket(7) which probably aligns with related standard unfortunately.