Instantly share code, notes, and snippets.

Embed
What would you like to do?
Problem solving high memory usage with node.js, websockets, socket.io, and redis.

Problem: My Node.js processes (drones on Nodejitsu) are repeatedly growing in memory until they crash and get restarted. They are multiple processes responding to websockets connections over socket.io, using socket.io's Redis store to broker messages across connections and broadcast messages.

Conclusion: that socket.io's Redis store's memory usage likely either grows indefinitely, or is just way too high. We also isolated semi-frequent reconnects to the redis client on the part of each node.js process, though this looks unrelated to the overall memory problem.

Possibly related: https://github.com/LearnBoost/socket.io/issues/1040 https://github.com/LearnBoost/socket.io/issues/463

=== edited chat log of #nodejitsu on 25/11/2012 follows ==

(08:16:46 PM) konklone: huh - could someone check if my drone is dead or hanging? username: konklone, app: sockets
(08:16:50 PM) konklone: getting an 8s response time again
(08:17:14 PM) coderarity: alright
(08:17:26 PM) konklone: I have 2 drones behind it
(08:18:38 PM) coderarity: yeah, they're slow, that's for sure :P
(08:18:54 PM) konklone: both of them?
(08:19:04 PM) coderarity: yeah, let me see if i can figure out why
(08:19:07 PM) konklone: thank you 
(08:36:50 PM) coderarity: konklone: it seems to be using a ton of memory, and then crashing and restarting
(08:37:21 PM) coderarity: you might have a ton of traffic, or maybe you're right and it's a memory leak
(08:39:37 PM) coderarity: konklone: this is happening on both of your drones

(08:41:15 PM) konklone: coderarity: the restarting is a cycle? can you see how long it takes for them to build up memory and crash and restart?
(08:42:20 PM) coderarity: no, but maybe you can with `jitsu logs`
(08:47:26 PM) konklone: wow
(08:47:29 PM) konklone: looks like every 30 minutes
(08:47:51 PM) konklone: or even as fast as every 18 minutes
(08:47:53 PM) konklone: holy crap
(08:49:35 PM) konklone: or even as fast every 6 minutes in some cases.

(08:54:27 PM) coderarity: konklone: i think that your app was using lke 80 MB of memory after it restarted one time, seems like a lot to me
(08:55:30 PM) konklone: seems like a lot to me too, though I'm not doing anything special
(08:55:40 PM) konklone: socket.io + redis-store + websockets
(08:55:52 PM) coderarity: it was at like 345 MB when I checked before it crashed, and i saw about 50 connections - that puts it at 265/50 which is about 5 MB per connection, seems like a lot too
(08:56:32 PM) konklone: just auths a redis server, hands it off to socket.io, and handles new connections and broadcasts a bunch of events
(08:56:40 PM) konklone: the only events though, are "I'm here" and "my mouse position"
(08:56:58 PM) konklone: there are, however, a large number of memory leak threads on the socket.io ticket tracker
(08:57:04 PM) konklone: and a lot of built up frustration about it not being dealt with
(08:57:30 PM) konklone: so I suspect it's not 5MB/connection
(08:57:39 PM) konklone: but that old connections are leaving memory behind after they leave
(08:57:46 PM) coderarity: alright
(08:58:29 PM) AvianFlu: it's quite possible
(08:58:51 PM) coderarity: that must be a lot of memory left behind for 6 minutes :P
(08:59:15 PM) konklone: ...yeah
(08:59:32 PM) coderarity: konklone: if you want to spawn a test app I can watch the memory while you add and remove connections, and we can see if there's a leak
(08:59:55 PM) konklone: okay - what's the easiest way to do a test app for it?
(09:00:07 PM) coderarity: just change the subdomain/appname and `jitsu deploy`
(09:00:28 PM) konklone: oh right, server-side test app, sure
(09:03:26 PM) konklone: okay
(09:03:35 PM) konklone: I have an app called sockets-test
(09:03:38 PM) konklone: username konklone
(09:03:50 PM) konklone: I have two clients connected, both from my own IP (separate tabs in my browser)
(09:04:01 PM) coderarity: okay, lemme ssh in
(09:04:58 PM) coderarity: 26 MB right now
(09:05:41 PM) konklone: that's from 2 connections current, 4 connections total (I re-connected them both)
(09:05:45 PM) AvianFlu: how about now
(09:06:08 PM) coderarity: 25.8 MB
(09:06:17 PM) konklone: okay, I'm gonna reconnect 10 times
(09:06:39 PM) konklone: okay that was 10 re-connections
(09:07:10 PM) coderarity: 25,948KB
(09:07:45 PM) konklone: okay now I'm going to move my mouse a lot
(09:07:58 PM) konklone: this generates a lot of throughput
(09:08:08 PM) AvianFlu: likewise
(09:08:10 PM) AvianFlu: check it now, coderarity
(09:08:31 PM) coderarity: i mean, 25372KB
(09:08:49 PM) coderarity: it was at 26224KV for a second
(09:08:53 PM) konklone: and that's even with several more concurrents, since others joined in
(09:09:14 PM) coderarity: hmm, just saw it at 27480KB
(09:09:20 PM) AvianFlu: strange
(09:09:37 PM) konklone: that's remarkably constant
(09:09:39 PM) coderarity: now it's back at 2536 KB
(09:09:47 PM) coderarity: 25136*
(09:10:22 PM) coderarity: maybe we should get a ton of concurrent connections going, and then disconnect them all at once
(09:10:42 PM) coderarity: just to see how much memory it gets up to with ~50 connections
(09:10:50 PM) konklone: okay - I'll open up, let's say 60 tabs with that URL
(09:10:59 PM) konklone: and then close the whole window at aonce

(09:15:03 PM) coderarity: do you have them up right now?
(09:15:06 PM) konklone: yeah
(09:15:12 PM) coderarity: -_-
(09:15:16 PM) konklone: they're not generating a lot of activity besides the occasional heartbeat
(09:15:18 PM) coderarity: 28524 MB
(09:15:33 PM) coderarity: we must be missing something :O
(09:15:41 PM) konklone: hmm
(09:15:57 PM) konklone: it's using the same Redis URL as the production server was (though the production server is off, so it's got the Redis client to itself)
(09:15:58 PM) coderarity: let me see if the connections are open with netstat -a
(09:16:09 PM) konklone: maybe it's the Redis activity
(09:16:15 PM) coderarity: oh yeah, they are :P
(09:17:20 PM) coderarity: konklone: i am going to add more drones to the main one, i made it so i can go past the limit temporarily
(09:17:47 PM) konklone: okay
(09:17:57 PM) coderarity: we'll see what happens
(09:18:03 PM) konklone: btw - I'm gonna do the mass shut off of the 60 connections now
(09:18:27 PM) konklone: shut the tabs
(09:18:28 PM) coderarity: okay
(09:18:41 PM) coderarity: ah, still at 28832 KB
(09:19:45 PM) coderarity: dropped to 23680 KB, must have been garbage collector taking time :P
(09:19:53 PM) konklone: huh, wow
(09:20:19 PM) konklone: I wonder if the garbage collector is prevented from running so long as there's at least one connection...?
(09:21:37 PM) coderarity: like, if evey connection held a reference to every other connection, so when only 1 was left there was still those references?
(09:21:50 PM) konklone: yeah
(09:22:03 PM) coderarity: could try it
(09:22:04 PM) konklone: that doesn't sound terribly implausible, when you say it like that
(09:22:12 PM) coderarity: yeah :P
(09:22:55 PM) konklone: okay, I have 60 connections open again
(09:23:10 PM) konklone: how's memory?
(09:23:11 PM) coderarity: 30.8 MB
(09:23:13 PM) konklone: okay
(09:23:18 PM) konklone: I'm gonna shut off all but 1
(09:24:41 PM) konklone: okay
(09:24:43 PM) konklone: closed all but 1
(09:24:55 PM) coderarity: okay, let's give the GC a few seconds
(09:25:13 PM) coderarity: dropped to 25504 KB
(09:25:58 PM) konklone: which is basically what it started at
(09:26:08 PM) konklone: okay, just closed the last one
(09:27:47 PM) coderarity: it was still pretty steady
(09:27:53 PM) coderarity: 25.6MB or something
(09:28:04 PM) konklone: yeah, I'm inclined to say we did not find evidence of a memory leak in these tests
(09:28:17 PM) konklone: though neither did we find evidence of single connections taking up a lot of memory
(09:28:40 PM) coderarity: yeah
(09:29:05 PM) konklone: and I really don't think I was taking up more than 120 connections across the 2 drones I had when I first brought this up
(09:29:14 PM) coderarity: indeed
(09:29:22 PM) coderarity: i saw like 90-something
(09:29:39 PM) konklone: shall I start the production one back up and we'll take another look at it?
(09:30:00 PM) coderarity: yeah, put it at 5 drones
(09:30:29 PM) konklone: okay, starting it
(09:31:39 PM) konklone: I suppose one thing that's different about production is I do have Internet Explorer 8/9 clients attempt to start a flashsockets connection to the server
(09:31:54 PM) konklone: they don't get through because nodejitsu's node-http-proxy is not configured to pass on those kind of connections
(09:32:34 PM) konklone: though I suppose they could be dragging the proxy server in some way
(09:32:40 PM) konklone: anyway the production server is started
(09:32:59 PM) konklone: looks like ~20 connections
(09:33:28 PM) coderarity: okay, lemme check it out
(09:34:00 PM) konklone: in the dev console, running "i=0; for (other in others) {i++}; console.log(i)" will print out the number of connections that are broadcasting to each other
(09:34:27 PM) konklone: 23 now
(09:35:48 PM) coderarity: around 52 MB on each drone
(09:36:50 PM) konklone: wow
(09:36:58 PM) konklone: and only 15 concurrent right now
(09:37:01 PM) konklone: the churn is high though
(09:37:06 PM) konklone: lots of people in and out
(09:37:27 PM) konklone: wow on each of 5 drones
(09:37:58 PM) coderarity: seems to be more memory usage than the test, but nothing crazy
(09:38:09 PM) coderarity: no crashes yet, but i'll keep an eye on them
(09:39:34 PM) konklone: thank you
(09:41:44 PM) coderarity: wow, one of these drones is already at 114 MB
(09:43:15 PM) coderarity: konklone: ooops, got them mixed up, all the drones have been staying around 60 MB
(09:43:29 PM) konklone: hah
(09:43:32 PM) konklone: okay
(09:43:42 PM) coderarity: the first one did go up to 114 MB the first time, though
(09:43:50 PM) konklone: hmm, ok
(09:45:50 PM) Sly: konklone: Sorry to see you're still having problems with this. Hopefully it gets worked out, though. :)
(09:46:19 PM) konklone: yeah, I
(09:46:29 PM) konklone: 'm just glad I have enough time to test it out and figure out how to best do this
(09:46:38 PM) konklone: and very glad to have such a responsive support team/channel
(09:48:25 PM) coderarity: okay, so the second drone has stayed at around 100-120 MB consistently
(09:49:45 PM) coderarity: third one was at 60, then down to about 40, now at about 110 MB
(09:50:13 PM) konklone: that's wild
(09:50:50 PM) konklone: stayed up steady for 18 minutes so far
(09:51:34 PM) konklone: traffic down to 234 people on the prod site at any one time (obviously not all of them are establishing socket connections)
(09:51:47 PM) konklone: weird, I'd expect more than 27 out of 234
(09:51:56 PM) konklone: (27 is the concurrent connections I see in my tab)
(09:57:39 PM) konklone: here's something interesting: the test drone, and the 5 production drones, they share the same redis store
(09:58:04 PM) konklone: so the test drone will experience a lot of the load that the system at large is experiencing
(10:00:45 PM) konklone: huh, the test drone is not streaming through the connections from the production drones as I would expect
(10:01:19 PM) coderarity: yeah, and only at 24 MB
(10:02:03 PM) konklone: well fuck
(10:02:12 PM) konklone: apparently I was *not* having any of them run through the redis store
(10:02:24 PM) konklone: I had turned it off in config.js, they're all doing memory store
(10:02:30 PM) konklone: so there's no redis brokering at all
(10:02:31 PM) coderarity: well, that might isolate the location of the problem :P
(10:02:43 PM) konklone: all right, I'll turn on redis brokering and restart the production drones
(10:06:59 PM) konklone: this might make a big difference
(10:07:05 PM) coderarity: i hope it does :P

(10:14:24 PM) konklone: coderarity: production drones are re-deployed with redis brokering
(10:15:16 PM) coderarity: okay
(10:15:27 PM) coderarity: test drone is still at 25 MB
(10:15:32 PM) coderarity: let me go look at production drones
(10:15:52 PM) konklone: I am now redeploying the test drone with redis brokering
(10:16:05 PM) konklone: it will be interesting because the *only* load the test drone will experience is messages sent from redis
(10:16:21 PM) konklone: it won't have any connections of its own (assuming you all closed your connections to it as well)
(10:16:23 PM) AvianFlu: a good isolation test
(10:16:32 PM) AvianFlu: this is what we'll have to do here
(10:16:42 PM) AvianFlu: keep thinking of ways to isolate different pieces of it until one stands out
(10:17:50 PM) konklone: ok, test drone is deployed with redis brokering - will briefly connect just to verify that it's getting the events passing through the production site
(10:17:59 PM) konklone: yep, it is
(10:18:04 PM) konklone: disconnected from it
(10:23:49 PM) coderarity: okay, let me see all the memory stuff
(10:24:42 PM) coderarity: 1st drone - 41 MB, second drone - 40 MB 3rd drone - 65 MB test drone - 37 MB
(10:24:51 PM) coderarity: those were not all done at exactly the same time, btw
(10:25:42 PM) coderarity: 1 - 39 MB 2 - 39 MB 3 - 68 MB test - 32 MB
(10:26:00 PM) konklone: is the production app only using 3/5 drones?
(10:26:13 PM) coderarity: no, that's all i've got open right now
(10:26:26 PM) coderarity: i've cleared up a few terminal windows so i'll go ahead and look at the last two
(10:29:45 PM) konklone: ok
(10:30:35 PM) coderarity: 4 is at 80 MB and 5 is at 58 MB
(10:30:51 PM) coderarity: 1st is at 80 MB too
(10:32:03 PM) konklone: this seems about the same as before
(10:32:06 PM) coderarity: test drone is at about 50 MB
(10:32:13 PM) coderarity: which is almost twice as much as it was before
(10:32:16 PM) coderarity: yeah
(10:32:23 PM) konklone: yeah, but the test drone wasn't seeing any broadcasts at all
(10:32:33 PM) konklone: it's actually handling connections from redis now
(10:32:37 PM) konklone: or messages, rather
(10:33:17 PM) konklone: socket.io is going to make a new connection object when it gets a new connection message from redis, so that's expected
(10:33:27 PM) coderarity: 56, 72, 70, 91, 57 in order, test at 56
(10:33:29 PM) konklone: the raw (actual) connection doesn't take up much memory, it's socket.io's bookkeeping
(10:33:30 PM) coderarity: all in MB
(10:33:31 PM) konklone: hmm
(10:33:44 PM) konklone: yeah, so there's like a baseline load caused by redis
(10:33:51 PM) konklone: though there's no evidence that redis causes a memory leak
(10:34:22 PM) konklone: just that the brokering costs memory, which is expected
(10:34:38 PM) konklone: splitting the load across drones only helps with the front half of the load
(10:34:48 PM) konklone: connections on drone 4 still cause drone 1 to do work
(10:34:52 PM) konklone: just, less work
(10:35:52 PM) coderarity: i see
(10:36:21 PM) konklone: I'm contemplating writing my own brokering using sockjs
(10:36:30 PM) konklone: which is a vastly thinner layer over websockets than is socket.io
(10:36:44 PM) coderarity: i mean, the test drone is now at 67 MB, up like 10 MB, and afaik nothing is affecting that other than redis stuff
(10:36:49 PM) konklone: huh...
(10:37:05 PM) konklone: yeah, only 36 concurrent showing to my in-browser client
(10:37:48 PM) konklone: really shouldn't be generating this much memory usage
(10:38:20 PM) coderarity: 108, 93, 70, 106, 92 in order, test at 79
(10:38:26 PM) konklone: wow
(10:38:30 PM) konklone: that is a direct upward trajectory
(10:38:44 PM) coderarity: yeah
(10:38:51 PM) coderarity: and it only happens with redis working between all the drones
(10:39:20 PM) konklone: this makes a ton of sense though - the most obvious memory leak is one where clients show up and then leave, but their leaving doesn't get reflected across all drones
(10:39:26 PM) konklone: I actually observed this behavior on my dev machine
(10:39:31 PM) konklone: with just two processes
(10:39:45 PM) konklone: have 2 people show up to each one, for 4 total
(10:39:50 PM) konklone: both processes see 4 clients
(10:39:55 PM) konklone: then disconnect the 2 from one process
(10:39:58 PM) coderarity: i see
(10:40:01 PM) konklone: that process sees just 2, but the other one sees 4 still
(10:40:31 PM) AvianFlu: not propagating disconnect events properly, perhaps?
(10:41:20 PM) konklone: this looks about right: https://github.com/LearnBoost/socket.io/issues/1040
(10:41:47 PM) konklone: and the OP says he solved it by switching to SockJS, with the same node version
(10:42:06 PM) coderarity: WOAHHH
(10:42:17 PM) coderarity: 5th drone has a TON of TIME_WAIT connections
(10:42:29 PM) konklone: what does that mean?
(10:42:32 PM) coderarity: hmmm, hold up
(10:43:05 PM) AvianFlu: a socket goes through a number of different states
(10:43:11 PM) AvianFlu: there are at least 8 or 10 in total
(10:43:35 PM) AvianFlu: TIME_WAIT is the state they're in after they're closed, and the kernel keeps them dead to cool down and make sure both ends are really disconnected
(10:43:44 PM) AvianFlu: too many of them, and some kind of socket pooling should be introduced
(10:43:48 PM) AvianFlu: apparently not happening here
(10:44:21 PM) coderarity: konklone: do you use aws at all?
(10:44:39 PM) konklone: not that I know of
(10:44:49 PM) konklone: the Redis server is on redistogo
(10:44:50 PM) AvianFlu: how about redistogo
(10:44:53 PM) konklone: which actually, yeah, is on AWS
(10:44:55 PM) AvianFlu: which is on AWS
(10:44:57 PM) AvianFlu: :D
(10:45:03 PM) konklone: though there's a finite number of connections to redis, though - 3 per drone
(10:45:26 PM) coderarity: i don't see any AWS connections on any other drone, only the last one
(10:45:34 PM) coderarity: and 47 TIME_WAIT connections to AWS on that last drone
(10:46:01 PM) AvianFlu: it's apparently not reusing sockets
(10:46:07 PM) AvianFlu: but 47 isn't that many TIME_WAIT
(10:46:20 PM) AvianFlu: the connection churn on the redis instance, though, could add up
(10:46:36 PM) konklone: I don't think the node drones are re-connecting to redis, though...
(10:46:50 PM) konklone: or at least, it's not like every user connection causes a redis connection
(10:47:08 PM) AvianFlu: try to add some logging around it maybe
(10:58:51 PM) konklone: ok
(11:09:14 PM) coderarity: konklone: 1st drone is now at 217 MB, about to crash
(11:09:19 PM) konklone: wow
(11:09:34 PM) coderarity: second is at 227 MB
(11:10:04 PM) konklone: actually coderarity: the logs show a whole bunch of crashes already
(11:10:06 PM) coderarity: 21 MB for third, only 140 MB for 4th
(11:10:14 PM) coderarity: oh, maybe I missed them :O
(11:10:42 PM) coderarity: test drone is similar, 212 MB
(11:11:31 PM) AvianFlu: coderarity: there you go!
(11:11:44 PM) AvianFlu: it's definitely the pubsub somehow, and dangling references in the JS
(11:11:49 PM) coderarity: yes
(11:14:39 PM) konklone: I guess if we see a lot of excess logging, then the TIME_WAITs from AWS are explained
(11:14:44 PM) konklone: I don't know if that explains all the memory usage though
(11:14:48 PM) konklone: we may be seeing two independent phenomena
(11:14:55 PM) konklone: the TIME_WAITs being less impactful than the memory leaks
(11:14:57 PM) konklone: this is my suspicion anyway
(11:15:06 PM) coderarity: i think it's a JS thing
(11:15:37 PM) konklone: I would certainly like to isolate it because even if I do switch to SockJS, I'll be using a Redis store
(11:16:08 PM) AvianFlu: this could just be a leak in socket.io's redis store
(11:16:13 PM) AvianFlu: the TIME_WAIT sockets wouldn't cause OOM
(11:16:17 PM) AvianFlu: they'd cause other slowdowns
(11:16:30 PM) konklone: yeah I suspect it is a leak in socket.io's redis store, or in some effect of using *any* non-memory store
(11:16:42 PM) konklone: where processes are keeping track of clients that did not originally connect to that client
(11:18:14 PM) konklone: okay, I deployed to the production drones with redis logging
(11:18:16 PM) konklone: deploying to test drone now
(11:19:40 PM) konklone: what's memory usage like?
(11:20:12 PM) konklone: I am seeing some redis reconnects though not with a frenzy or anything
(11:20:18 PM) konklone: once every 40 seconds or so
(11:20:37 PM) AvianFlu: that sounds about right
(11:20:42 PM) AvianFlu: that number of sockets he saw wasn't insane
(11:20:51 PM) AvianFlu: and they only stay in that state for 2 or 3 minutes
(11:21:13 PM) konklone: yeah
(11:21:25 PM) konklone: okay now it's a little faster
(11:21:27 PM) konklone: every 6-8s
(11:21:42 PM) AvianFlu: any idea what they're coming from?
(11:22:30 PM) konklone: interestingly, they're always from the "pub" and "client" clients
(11:22:34 PM) konklone: not the "sub" client (that receives data)
(11:22:43 PM) konklone: "pub" sends data to redis, and "client" is there to issue other commands
(11:26:29 PM) konklone: yeah, lots of reconnects every 6-15s
(11:26:32 PM) konklone: redis reconnects
(11:26:36 PM) konklone: I think that explains the TIME_WAITs to me
(11:26:49 PM) konklone: and by "lots", I mean "about one reconnect every 6-15s"
(11:26:53 PM) konklone: which is unexpected, sure
(11:26:59 PM) coderarity: over 5 drones?
(11:27:03 PM) konklone: right
(11:27:17 PM) konklone: so yeah for each one, probably once every 30s-75s
(11:27:22 PM) konklone: or it's probably staggered there
(11:27:33 PM) konklone: every 40s or so, like I first thought (while the other drones were still starting, maybe)
(11:28:09 PM) konklone: so, okay
(11:28:10 PM) konklone: I think
(11:28:14 PM) konklone: I am going to switch to SockJS
(11:28:18 PM) konklone: with my own very light Redis brokering
(11:28:45 PM) konklone: and probably holding on to nothing extra
(11:29:04 PM) konklone: each drone only manages the connections connected right to it
(11:29:10 PM) konklone: and ignores all irrelevant messages from redis
(11:29:44 PM) coderarity: cool
(11:33:22 PM) AvianFlu: also - this might not necessarily be a leak
(11:33:38 PM) AvianFlu: if all 5 drones are all keeping the total aggregate state of the system in-memory inadvertently 
(11:33:46 PM) AvianFlu: the total pile of memory might just add up that high
(11:34:01 PM) AvianFlu: and they'd be out-of-sync a bit based purely on load and coincidence
(11:34:17 PM) AvianFlu: and leaks here and there would compound it
(11:34:53 PM) AvianFlu: and all our efforts to add more drones would just increase the amount of communication happening, lolz
(11:37:01 PM) coderarity: i mean, adding more drones didn't really help
(11:37:49 PM) konklone: yeah, this is also true
(11:38:04 PM) konklone: but whether it's keeping the state of the system, or an actual leak - both can be fixed by ditching socket.io's redis store
(11:38:16 PM) konklone: and making it so processes do not keep the aggregate state of the system
(11:38:19 PM) AvianFlu: yes.  I support your decision.
(11:38:30 PM) konklone: :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment