me:
Is it possible to send a signal to a worker/process?
I realize the platform sends SIGTERM
and SIGKILL
when restarting a dyno, but I need to send a USR1
to one of my workers to tell it to stop picking up new jobs. Normally this is achieved via kill -USR1 <pid>
, but on the Heroku platform not only do we not know know the pid, we also don't run one-off commands on the same dyno.
Caio (heroku support):
We had this feature experimentally at some point but it was never productized. I recommend you find other ways to signal your processes, like setting a database flag.
me:
Thanks for the quick reply.
It is unfortunate that feature has not been further investigated and productized. The suggested workaround is far from ideal, and not very realistic.
The whole idea of a USR1
signal is to signal the current process to take some action - in this case, stop taking action. But a new process, which will be spun up after the app is restarted, will have no knowledge of that signal and will get on with doing it's thing - working jobs. Using a database flag introduces a whole host of complexity around managing that state across app restarts and needing to customize (read: hack or monkey patch) existing tools to be aware of that flag.
Caio (heroku support):
That's very reasonable feedback. I'm routing this to our platform engineers.
@wuputah,
Thanks for getting back to me, and sorry for the late reply - I'm not getting notifications when folks reply to my Gists.
To the question at hand, my understanding of how
heroku scale
works is it would send aSIGTERM
to the excess workers (in my case, the 1 running dyno). If the worker has not shut itself down after 10 seconds, heroku sends aSIGKILL
to forcefully kill it.What I need is that ability to send a
USR1
signal, which tells this particular worker (Sidekiq) to stop taking on new jobs, finish any in progress, and then gracefully shut down. In my case, the majority of jobs run in 1-2 seconds, but due to network connectivity they may occasionally take 10+ seconds. And I have a few jobs which run 20-30 seconds.Image something like the following deploy script:
USR1
to worker processesThis gives the workers as much time as possible to finish up any work they are doing. You could even imagine a step 2.5 that ensure the workers are stopped before deploying any new code.
Does that all make sense? Any ideas or suggestions?