The idea here is that we have some request that does not resolve the socket with a response (i.e., never calls response.end()
). When that happens, the request socket will remain open indefinitely. If this is in the browser, there is a hard limit timeout after (if I recall correctly) 60 seconds.
The problem is that with server-to-server API calls, the requesting client will never timeout the connection (unless we do so manually). This means that if we hit ulimit -n
number of zombie sockets, the server itself will no longer be able to assign client sockets to new connections. The behavior that's exhibited is that a request is able to come in, but it just hangs out in the waiting pool for an available socket, which will never happen because we have a ton of open sockets and no handles to them.
We don't yet know which route is causing the problem. It's very likely that it's an API call since browser calls will timeout.
Chris and I discussed (and Isaac confirmed) that sticking some middleware in front of all the routes with some hard timeout and logging is the best bet. We can define a threshold (say, 45 seconds – according to JP our longest valid request is 22 seconds (which itself is gross and we need to fix)) and say that anything above that threshold is a zombie'd request. When that happens, we can 500 and send something to loggins with a high-alert level indicating which route is trying to hold open sockets. We can also manually close the socket so the server doesn't get knocked out of rotation.