Skip to content

Instantly share code, notes, and snippets.

@murvinlai
Created June 17, 2011 21:38
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save murvinlai/1032413 to your computer and use it in GitHub Desktop.
Save murvinlai/1032413 to your computer and use it in GitHub Desktop.
Socket Hang up problem - A sample to generate the problem.
/*
* This is a simple HTTP data generator for testing.
*
*/
var http = require('http');
var counter = 0;
http.createServer(function (req, res) {
var start = new Date();
var myCounter = counter++;
var timeout = 50; // default timeout. Mimic how much time it takes to run a process.
// Controlled long response time causing timeout in client side.
if ( (myCounter % 20000) == 0) {
console.log('-------- reach 20000 ------------');
timeout = 600000; // 10 minutes
}
// give it some timeout
setTimeout(function() {
var output = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\n';
output += '<myData>ABCDE'+myCounter+'</myData>\n';
console.log("output: " + myCounter);
res.writeHead(200, {'Content-Type': 'application/xml'});
res.write(output);
res.end();
}, timeout);
}).listen(3015);
console.log('Server running at port 3015');
var http = require('http');
// option1 is the from Google. GoogleAPI usually has less timeout and causing less issue.
//https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=Paris%20Hilton&callback=foo&context=bar
var agent1 = http.getAgent('ajax.googleapis.com',80);
agent1.maxSockets = 2000;
var options1 = {
agent:agent1,
host: 'ajax.googleapis.com',
port: 80,
path: '/ajax/services/search/web?v=1.0&q=Paris%20Hilton&callback=foo&context=bar',
method: 'GET',
headers:{
"Connection":"keep-alive"
}
};
// option2 is your own server that feeds data.
var agent2 = http.getAgent('localhost', 80);
agent2.maxSockets = 2000;
var options2 = {
agent:agent2,
host:'localhost',
port:80,
path:'/',
method:'GET',
headers:{
"Connection":"keep-alive"
}
};
// option3 is my AWS micro instance. It has controlled timeout for ever 20000 request.
var agent3 = http.getAgent('50.19.237.122', 3015);
agent3.maxSockets = 2000;
var options3 = {
agent:agent3,
host:'50.19.237.122',
port:3015,
path:'/',
method:'GET',
headers:{
"Connection":"keep-alive"
}
};
// set which
var workingOption = options3;
var counter = 0;
var server = http.createServer(function (req, res) {
var myCounter = counter ++;
console.log("Start: counter: " + myCounter);
var start = new Date(); // just for timing.
var clientRequest = http.request(workingOption, function(response) {
var result = '';
response.setEncoding('utf8');
res.writeHead(response.statusCode, {'Content-Type': 'application/xml'});
response.on('data', function (chunk) {
result += chunk;
if (!res.write(chunk)) { // if write failed, the stream is choking
response.pause(); // tell the incoming stream to wait until output stream drained
}
}).on('end', function () {
var end = new Date();
console.log("End: counter: " + myCounter + ' Time: ' + (end-start) + " chunk: " + result.substring(0, 20));
res.end();
});
});
clientRequest.on('error', function(e) {
console.error("ERROR: " + JSON.stringify(e));
console.log("ERROR: " + JSON.stringify(e));
});
clientRequest.end();
});
server.listen(3204);
console.log('Server running at port 3204');
@murvinlai
Copy link
Author

test-simple-http-data.js - a node.js server to generate response data.
test-simple-http-server.js - a node.js server that will use http.request to connect remote server to get data (e.g. googleapi, or the one runnign test-simple-http-data.js).

The Socket hang up problem and other unpleasent behaviour can be reproduced with this test.

To run the test, you need 3 machines / servers.

Machine #1). your company (mac or linux) that can run ab benchmark or other load testing tool. (ab benchamrk is limited to 1024 concurrent thread sent out from Mac OS, even you increase the ulimit ).

Machine #2). If you want to run test-simple-http-data.js for data feed, then you need a server. In my test, I use ubuntu on
Amazon AWS EC2 micro instance. I set the ulimit to 90000 soft / 130000 hard. So, theoritically, I can have 90000 sockets running on this server. Under Load testing from machine #1. Machine #2 has a through put of ~2400 request / second.
Macine #2 has a controlled timeout for every 20000 http request.

Machine #3). The node.js server. When it runs, it can connect to either googleapi, machine #2, or your own server. It is configurable inside the code. In my test, I also run on Ubuntu, Rackspace VM. The unlimit is also 90000 soft / 130000 hard.
Machine #1 will send 80000 request with 1000 concurrent level. e.g. ab -n 80000 -c 1000 http://machine#3IP:3204
In addition, http.agent.maxSockets is set to 2000.

  • Http.request / Socket Behavior *
  1. When you start Machine #3 and begin load testing, regardless you are doing http.request to googleapi, or machine#2, or your own server, you will observe that the time require to get the response from remote server will rise from few hundreds ms to few seconds. Sometimes will go over 10 seconds. However, at certain point (depending your concurrent level), the response time will drop back to few hundreds ms or even below 100 ms. You can see the response time from the Machine#3 log.

  2. http.request should never expect remote server be very responsive as googleAPI does. That's why machine#2 controlled timeout kick in. If you configure Machine #3 to call machine #2, you will see the following:

a) For every 20000 request, Machine #2 will issue a timeout. Once there is a timeout of one request, all other subsequent requests will slow down.
b) Once it is timeout, a "Socket hang up" error (pointing to http.js:1285) will be thrown. You can see from the log.
c) If there are few numbers of timeout, most of the http.request response will be slow. If you let the load testing run long enough time, you may see all responses are as slow as 10 seconds or more.
d) If you run a long load testing, after a while and after a number of socket hang up problems, it will either:
i) all subsequent http.request will thrown "socket hang up" problem, or
ii) the server in Machine#3 will not response. It is not die but not responsive to any request from Machine#1, or you manually request from web browser.

  1. What I have tried before is that:
    a) Have my own timeout routine for http.request. If timeout, then it will do a http.clientRequest.abort().
    b) if there is a timeout or socket hang up, then retry the http.request by sending a separate http.request again. It works in the begining of the load testing.
    c) Both methods only works when in the beginning of the load testing when there are only a few socket hang up or timeout problem. However, once there are more socket hang up, it won't work anymore. and all the request will get a socket-hang up until restarting the server.

  2. Two ways that I do to go around the problem:
    a) If there are socket problem, then restart the server manually.
    b) I use Pedro's "fugue" to auto restart server. If there are more than 100 socket hang up or timeout, then I will kill that process, and fugue will restart for me. Then the server will be working and able to take more socket hang up errors.

  3. My thought: It has something to do with node handling connection pool and socket, and how to recover socket. I don't think it is linux OS problem as I have run it on different VM/OS but with reproducible error. I've also tried running two different node.js. Even one node has socket problem, the other can still run.

@murvinlai
Copy link
Author

Lastest Discovery:

There is a leak when there is socket timeout. Also, when there is a timeout (by default in http.js it is 2 minutes), it supposes to throw error and caught by http.clientRequest.on('error' ... ). But it doesn't happen every time.

So, here is my test. I run: ab -n 100000 -c1000 myurl for multiple times.
http.agent.maxSockets is 2000
OS ulimit for open file is 13000 hard, 90000 soft.

while it is running & testing, I run lsof -i 4 -p PID | wc -l
first run: start at 2 sockets, then goes up to 2700, after done (without socket hang up), it backs to 734 and stay there unchange.
2nd run: start at 734 sockets, then goes up to 3000, down to 1366
3rd run: start at 1366, then goes up to 3000+, down to 1999 .. And at the last 20000 requests, it just keep throwing socket hang up error. (Because it reaches the http.agent.maxSockets maximum).
4th run: cannot run anymore.

@murvinlai
Copy link
Author

More Discovery:

I have to revert what I said. It is complicated. socket timeout may cause slow down. But, it isn't the cause for the node.js crash. It is the socket & connection pool problem. I will explain later.

For my test, I do the following:

http.agent.maxSocket = 2000
ulimit -a, open files is 90000

run: ab -n1000 -c10 myurl

during that, I run: lsof -i 4 -p PID | wc -l

and I repeat this couple times. sometimes run ab-n10000 -c500 myurl

what I observer is that the # of socket used by the node process increase and doesn't come down.
e.g. for the first run, it started with 2, then up to 120.. then back down to 70 and stay there forever after all the test request are sent.
During first call, there was only one socket timeout. Before the socket timeout error was thrown, the # of socket was 71, then once it threw, then it was down to 70.

I run the 2nd time, the # of socket went from 70 to 160, at the end back down to 145 and stay there forever after all the test request are sent.
When socket error is throw (socket hang up), the number of sockets will go down by the # of socket error was thrown. For example, before the socket error was thrown, the # of socket was 150. If there were 5 socket hang up error, then the # of socket will decrease to 145.

For subsequent test, the # of socket keeps increasing.

If the # of socket opened is 500, I send test 1 request. (ab -n1 -c1 myurl). after the request is done, the # of socket opened for node.js process may stay at 500. However, it may also increase to 501. This is expected because of connection pool.

There were cases that during the test run, the # of socket exceed the http.agent.maxSockets (2000). However, when that happen, a lot of socket hang up will be thrown. That is expected.

However, the # of socket opened stay at 2000 or above, then I cannot send any http.request anymore. Everything will be timeout. Even I send one request. This 2000 socket is still availabe but not accessible.

I think somewhere in the http.js or net.js code check the current opened socket. If it is more than maxSocket, then it will not take any more request, nor Garage collect any old request.

@murvinlai
Copy link
Author

I look into http.js and here I think I got the bug for why socket broken / or cannot establish socket when run for long time:

Agent.prototype._establishNewConnection = function() {
var self = this;
assert(this.sockets.length < this.maxSockets); <------

So, what happen is that, when I run the node.js for long time. The # of socket in the connection pool will increase. However, overtime, the # of sockets in the pool will not drop back to what the initial number is.

Therefore, if there are lot of requests, and it reach the maximum, new request cannot establish the connection. In fact, this is normal.

HOWEVER, if there are no more request (silence) for a while, and the # of socket stay at the maxium socket allowed or even above, then there is no way to establish new connection even the sockets in the pool are not doing anything. so, the node.js server will stuck unless restarting server.

May be need to run self._cycle(); before the assert?

@murvinlai
Copy link
Author

I dig deeper.. and the code doesn't stop at assert(this.sockets.length < this.maxSockets); <------
when #of sockets >= this.maxSockets;

it stop even before reaching that code.
in _cycle = function() {

...

if (!haveConnectingSocket && this.sockets.length < this.maxSockets) { <----
this._establishNewConnection();
}

}

That's where prohibit the code from establish new connection.

I think the bug is in the socket checking code :

for (var i = 0; i < this.sockets.length; i++) {
var socket = this.sockets[i];
// If the socket doesn't already have a message it's sending out
// and the socket is available for writing or it's connecting.
// In particular this rules out sockets that are closing.
if (!socket._httpMessage &&
((socket.writable && socket.readable) || socket._httpConnecting)) {
debug('Agent found socket, shift');
// We found an available connection!
this.queue.shift(); // remove first from queue.
assert(first._queue === this.queue);
first._queue = null;

  first.assignSocket(socket);
  httpSocketSetup(socket);
  self._cycle(); // try to dispatch another
  return;
}

if (socket._httpConnecting) haveConnectingSocket = true;

}

I think

if (!socket._httpMessage &&
((socket.writable && socket.readable) || socket._httpConnecting)) {

doesn't catch the case that if the socket is in the connection pool, but actually doing nothing.

I believe some sockets do not have the proper stage set after the request is done. and those sockets will not deplete forever.

@murvinlai
Copy link
Author

so.. there are two issues:

  1. some sockets cannot free up. ( i called it ghost)
  2. if over time, the # of ghost socket reaches the maxSockets, then no more new socket can be created, nor any ghost socket will be clean up. ( there is a check in http.js that do socket clean up when maxSocket is reach, but not for the ghost).

@murvinlai
Copy link
Author

var http = require('http');

var query = 'v=12345';

var agent = http.getAgent('myremote.com', 80);
agent.maxSockets = 10;

var options = {
agent:agent,
host:'myremote.com',
port:80,
path:'/getcall?'+query,
method:'POST',
headers:{
"host":'myremote.com',
"user-agent": 'node.js',
"Connection":"keep-alive",
"Keep-Alive": "timeout=10, max=10"
}
};

var callMe = function(counter) {
var client = http.request(options,
function(res) {
console.log(res.statusCode + " : " + counter);
console.log("sock: "+ counter + " :lenght:" + agent.sockets.length);

            if (agent.sockets.length>(agent.maxSockets/10*9)) {
                console.log("keep: "+ counter + " :keepalive is false:");
                client.shouldKeepAlive = false;
            }
            res.on('data', function(chunk) {
                console.log("data: " + counter + " " + chunk);
                //client.destroy();
            });
            res.on('end', function() {
                console.log("done: " + counter + ":socket:" + agent.sockets.length + " :queue:" + agent.queue.length 
                            + ' :finished:' + client.finished
                            + ' :_last:' + client._last
                            + ' :writable:' + client.writable);
                console.log(client.socket);

            });
          }
);
client.end();

}
var index = 0;
setInterval(function() {
console.log("Call Set Interval");
for (var i=0; i<100;index++, i++ ) {
if (index <= 300) {
callMe(index);
}
}
}, 10000);

var cappedArray = [];
var cappedMax = 10;

var allSocketsCleanup = function () {
for (var i=0; i< agent.sockets.length; i++) {
console.log("destroying");
if (!agent.sockets[i]._httpMessage) {
agent.sockets[i].destroy();
}
}
};

setInterval(function() {
console.log("check agent.sockets.length" + agent.sockets.length);
var socketLength = agent.sockets.length;

if (cappedArray.length <= cappedMax) {
    cappedArray.push(socketLength);
    if (socketLength >= agent.maxSockets) {
        allSocketsCleanup();
    }
} else {
    cappedArray.push(socketLength);
    var prevSocketLength = cappedArray.shift();

    if (prevSocketLength == socketLength) {
        allSocketsCleanup();
        console.log('current length ' + agent.sockets.length);
    }
}

},500);

@murvinlai
Copy link
Author

for the above code, it works on the original 0.4.9 http.js not the fix one. (not sure about the http-fix.js has the same socket problem yet).

key point:

  1. for http connection that close to the 90% of the maxium socket, I switch it to "Connection":"close" in order to avoid connection being max out & keep-alive.
  2. I have a routine to constantly check the # of sockets. If the # of sockets do not go down, or max out, then I will start force destroying socket (that have no httpMessage).

I have run some test and haven't seen any error thrown. and I can see the # of socket go back down.

@murvinlai
Copy link
Author

Use this http.js and https.js will fix the problem.

  • copy them to your nodejs/lib folder. replace the old ones.
  • rebuild node again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment