This is sample code to help people understand why Signal Handling works the way it does.
We'll be using Resque in the new signal handling mode to better work with POSIX signal handling. Let's set this up on Heroku.
$ git clone https://gist.github.com/54bc80701a83d2cb2913.git
$ heroku create
$ heroku addons:add heroku addons:add rediscloud:20
$ git push heroku master
When a dyno gets cycled due to a restart or having to be killed off Heroku will send a SIGTERM
and then a SIGKILL
10 seconds later if the process hasn't been gracefully shutdown. You'll see a "R12 (Exit timeout)" in your logs if this happens. Now let's set Resque's SIGTERM
timeout to demonstrate this.
$ heroku config:add RESQUE_TERM_TIMEOUT=10
We need to enqueue a job and cause a SIGTERM
to be sent.
$ heroku run "bundle exec ruby enqueue.rb"
$ heroku restart
Now when running heroku logs -t -p worker
you'll see something like this:
013-09-16T23:05:36.422618+00:00 heroku[worker.1]: Starting process with command `env TERM_CHILD=1 QUEUES=* bundle exec rake resque:work`
2013-09-16T23:05:38.249677+00:00 app[worker.1]: PID: 2
2013-09-16T23:05:38.432792+00:00 heroku[worker.1]: State changed from starting to up
2013-09-16T23:06:53.682989+00:00 app[worker.1]: Performing work: 2013-09-16 23:06:53 +0000
2013-09-16T23:06:53.682989+00:00 app[worker.1]: sleep 1
2013-09-16T23:06:54.686517+00:00 app[worker.1]: sleep 2
2013-09-16T23:06:56.687011+00:00 app[worker.1]: sleep 3
2013-09-16T23:06:59.688428+00:00 app[worker.1]: sleep 4
2013-09-16T23:07:03.688640+00:00 app[worker.1]: sleep 5
2013-09-16T23:07:08.689879+00:00 app[worker.1]: sleep 6
2013-09-16T23:07:11.180422+00:00 heroku[worker.1]: Stopping all processes with SIGTERM
2013-09-16T23:07:11.978975+00:00 app[worker.1]: ensure started 2013-09-16 23:07:11 +0000
2013-09-16T23:07:11.978975+00:00 app[worker.1]: sleep 1
2013-09-16T23:07:12.985010+00:00 app[worker.1]: sleep 2
2013-09-16T23:07:14.989352+00:00 app[worker.1]: sleep 3
2013-09-16T23:07:17.993877+00:00 app[worker.1]: sleep 4
2013-09-16T23:07:19.433624+00:00 heroku[worker.1]: Error R12 (Exit timeout) -> At least one process failed to exit within 10 seconds of SIGTERM
2013-09-16T23:07:19.433934+00:00 heroku[worker.1]: Stopping remaining processes with SIGKILL
2013-09-16T23:07:21.716823+00:00 heroku[worker.1]: Process exited with status 137
Notice the "R12" that happens. This is because the SIGTERM
timeout takes longer than the 10 second allotted time by Heroku. "But we set it to 10!" That's a good point, but remember this is the time from SIGTERM
to SIGKILL
and in this time Resque and other code must run. Iwould recommend setting this to some value less than 10, like 5. This gives Resque enough time to clean up after itself as well.
Resque needs time to clean up when a SIGTERM
happens and if it gets SIGKILL
happens, it denies Resque the chance to do this. You'll notice stale workers in your Resque UI. If you deploy this code. And this is due to the fact that Resque can't clean up workers if the worker dies on a distributed system. If you try this code locally, you won't notice stale workers b/c prune_dead_workers
works on a local box but not in a distributed environment like Heroku. Resque can't tell if a worker is alive or dead.
Now let's set the timeout to something more reasonable like 5.
$ heroku config:add RESQUE_TERM_TIMEOUT=5
Re-enquing the job and restarting:
$ heroku run "bundle exec ruby enqueue.rb"
$ heroku restart
Let's check our logs:
2013-09-16T23:53:54.348137+00:00 heroku[worker.1]: Starting process with command `env TERM_CHILD=1 QUEUES=* bundle exec rake resque:work`
2013-09-16T23:53:55.836592+00:00 heroku[worker.1]: State changed from starting to up
2013-09-16T23:53:55.897159+00:00 app[worker.1]: PID: 2
2013-09-16T23:54:36.031549+00:00 app[worker.1]: Performing work: 2013-09-16 23:54:36 +0000
2013-09-16T23:54:36.031549+00:00 app[worker.1]: sleep 1
2013-09-16T23:54:37.031835+00:00 app[worker.1]: sleep 2
2013-09-16T23:54:39.032235+00:00 app[worker.1]: sleep 3
2013-09-16T23:54:39.729718+00:00 heroku[worker.1]: Stopping all processes with SIGTERM
2013-09-16T23:54:40.527182+00:00 app[worker.1]: ensure started 2013-09-16 23:54:40 +0000
2013-09-16T23:54:40.527182+00:00 app[worker.1]: sleep 1
2013-09-16T23:54:41.527583+00:00 app[worker.1]: sleep 2
2013-09-16T23:54:43.528268+00:00 app[worker.1]: sleep 3
2013-09-16T23:54:44.404581+00:00 app[worker.1]: PID: 2
2013-09-16T23:54:41.985160+00:00 heroku[worker.1]: Starting process with command `env TERM_CHILD=1 QUEUES=* bundle exec rake resque:work`
Notice the "R12" error is gone and it doesn't add more stale workers. :)