Skip to content

Instantly share code, notes, and snippets.

@stevenharman
Last active November 9, 2018 11:12
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stevenharman/3987569 to your computer and use it in GitHub Desktop.
Save stevenharman/3987569 to your computer and use it in GitHub Desktop.
Sending user signals to Heroku workers/process...

me:

Is it possible to send a signal to a worker/process? I realize the platform sends SIGTERM and SIGKILL when restarting a dyno, but I need to send a USR1 to one of my workers to tell it to stop picking up new jobs. Normally this is achieved via kill -USR1 <pid>, but on the Heroku platform not only do we not know know the pid, we also don't run one-off commands on the same dyno.

Caio (heroku support):

We had this feature experimentally at some point but it was never productized. I recommend you find other ways to signal your processes, like setting a database flag.

me:

Thanks for the quick reply. It is unfortunate that feature has not been further investigated and productized. The suggested workaround is far from ideal, and not very realistic. The whole idea of a USR1 signal is to signal the current process to take some action - in this case, stop taking action. But a new process, which will be spun up after the app is restarted, will have no knowledge of that signal and will get on with doing it's thing - working jobs. Using a database flag introduces a whole host of complexity around managing that state across app restarts and needing to customize (read: hack or monkey patch) existing tools to be aware of that flag.

Caio (heroku support):

That's very reasonable feedback. I'm routing this to our platform engineers.

@stevenharman
Copy link
Author

Six weeks later the ticket was closed without comment.

@wuputah
Copy link

wuputah commented Oct 31, 2012

Thanks for the feedback. We've looked at this in the past and there wasn't really a use case. For your use case, it seems like heroku scale worker=0 would have the desired effect.

Your ticket was read and closed by our platform product team; they won't necessarily respond to each feedback received.

@stevenharman
Copy link
Author

@wuputah,
Thanks for getting back to me, and sorry for the late reply - I'm not getting notifications when folks reply to my Gists.

To the question at hand, my understanding of how heroku scale works is it would send a SIGTERM to the excess workers (in my case, the 1 running dyno). If the worker has not shut itself down after 10 seconds, heroku sends a SIGKILL to forcefully kill it.

What I need is that ability to send a USR1 signal, which tells this particular worker (Sidekiq) to stop taking on new jobs, finish any in progress, and then gracefully shut down. In my case, the majority of jobs run in 1-2 seconds, but due to network connectivity they may occasionally take 10+ seconds. And I have a few jobs which run 20-30 seconds.

Image something like the following deploy script:

  1. Send USR1 to worker processes
  2. Capture DB backup
  3. Push new code
  4. Run migrations
  5. Scale workers back up

This gives the workers as much time as possible to finish up any work they are doing. You could even imagine a step 2.5 that ensure the workers are stopped before deploying any new code.

Does that all make sense? Any ideas or suggestions?

@wuputah
Copy link

wuputah commented Jan 2, 2013

Hi Steven- in general terms, there's a number of different ways to accomplish this goal:

  1. Halt generation of new jobs into the queue, and allow current queues to empty.
  2. Halt processing of new jobs, allowing current jobs to finish (but queues are unaffected).
  3. Terminate processing of job processing; workers with jobs in progress should gracefully handle this case and terminate and re-queue as appropriate.

I understand it would be convenient in your use case to send a signal to your library of choice to cause #2 to occur. Obviously that's not currently possible on Heroku. However, since workers must handle #3 every 24 hours during dyno cycling, we don't think this is a particularly viable solution.

In cases of large software changes or migrations, where data in the queue is tied to the software implementation handling that data, #1 may actually be the most viable. For some applications, you would enact this by scaling your web workers to zero and put your site into maintenance mode.

More flexibility by queuing libraries could also help. For instance, why is a signal the only way to enact this change in Sidekiq's processing? Perhaps there should be another way to enact these sorts of maintenance modes.

@wuputah
Copy link

wuputah commented Jan 2, 2013

Also, there is some philosophy behind these choices.
http://www.12factor.net/disposability

As noted above, there is an even more aggressive philosophy called crash-only design (particularly in database system or file system design) that notes that abrupt shutdown (e.g. a SIGKILL without chance for cleanup) plus the necessary recovery time at startup is often faster than a graceful shutdown plus graceful startup. This acknowledges that all software, at some point in its life, will be abruptly terminated, whether that is by SIGKILL or by power loss. Ideally, software should be able to recover from this state.

This, too, will inevitably happen to your workers at some point on Heroku (or anywhere for that matter), as the underlying hardware that happens to be running your workers will (eventually) abruptly fail.

@vladmiller
Copy link

Hi wuputah,

Sometimes you might have to issue signals for other reasons, like let's say enable node debug or running app to be able to spot leaks or issues which cannot be found on local machines.

@jchatel
Copy link

jchatel commented Dec 30, 2015

It's not because Heroku likes to send kill signals all the time that we should not be allowed to try to do it gracefully with #2 (and no I can't clear my queue as I got job sets in the future)

http://eng.joingrouper.com/blog/2014/06/27/too-many-signals-resque-on-heroku/

@chrisplusplus
Copy link

I accidentally created an infinite loop with a Messenger bot on Heroku. So my phone was blowing up with about 5 messages per second. I could not kill that process using the Heroku command line. After several thousand messages I ended up pushing a die; (PHP) statement in the offending function. It worked obviously - but I'm not so sure that was the proper way to handle it.

@stevenharman
Copy link
Author

For anyone coming back to this, years later... it seems Heroku has, quietly, increased the SIGTERM timeout from 10 to 30 seconds. I don't recall seeing any announcement to that effect, but it's mentioned in two different spots in the Heroku Dev Center docs:

  1. https://devcenter.heroku.com/articles/dynos#shutdown
  2. https://devcenter.heroku.com/articles/limits#exit-timeout

Still not ideal, but better, I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment