When you cancel a Jenkins job
Unfinished draft; do not use until this notice is removed.
We were seeing some unexpected behavior in the processes that Jenkins launches when the Jenkins user clicks "cancel" on their job. Unexpected behaviors like:
- apparently stale lockfiles and pidfiles
- overlapping processes
- jobs apparently ending without performing cleanup tasks
- jobs continuing to run after being reported "aborted"
This is an investigation into what exactly happens when Jenkins cancels a job, and a set of best-practices for writing Jenkins jobs that behave the way you expect.
First, recall the name and purpose of some Unix process signals:
HUP, "hangup"; the controlling terminal disconnected, or the controlling process died.
INT, "interrupt"; the user typed
KILL, "kill"; terminate immediately, no cleanup. Can't be trapped.
TERM, "terminate," but cleanup first. May be trapped. The default when using the
When a Jenkins job is cancelled, it sends
TERM to the process group of the process it spawns, and immediately disconnects, reporting "Finished: ABORTED," regardless of the state of the job.
This causes the spawned process and all its subprocesses (unless they are spawned in new process groups) to receive
This is the same effect as running the job in your terminal and pressing CTRL+C, except in the latter situation
INT is sent, not
Since Jenkins disconnects immediately from the child process and reports no further output:
it can misleadingly appear that signal handlers are not being invoked. (Which has mislead many to think that Jenkins has
KILLed the process.)
it can misleadingly appear that spawned processes have completed or exited, while they continue to run.
When a bash script has trapped a signal, it waits for any processes it spawned to complete before handling it. If a subprocess is stuck or ignoring signals, the controlling script may take a very long time to exit, despite the existence of a signal handler. If one of your subprocesses does not properly exit upon being signaled, it's not enough to create a signal handler in the calling shell script that more forcefully
KILLs it, because bash will wait for the process to complete before attempting that action.
If no hashbang is given on the first line of Jenkins' shell command specification, Jenkins defaults to
/bin/sh -xe to interpret the script. When
-e is active, any spawned process that exits with a nonzero status causes the script to abort. So it can misleadingly appear that
sh signals its subprocesses and automatically exits when signaled even if a trap is set, but behave differently with other languages or when someone pastes in a
#!/bin/sh with no
In some Jenkins jobs, we have a process (script) that invokes another on a different machine through
OpenSSH, unlike RSH, does not pass received signals to the remote process, nor does it provide a mechanism to manually send a signal to the remote process, even though this capability is specified in RFC4254. It will pass a CTRL+C character, but only when a pseudoterminal is allocated (i.e. only when invoked with
-tt.) It will send
HUP to the remote process group when it disconnects, but only when a pseudoterminal is allocated.
Cleanup actions in Bash vs. Python
Recall that scripts may receive signals
TERM (from Jenkins),
HUP (from ssh), or
INT (from you typing CTRL+C while testing).
The default action for shell scripts upon receiving
TERM is to abort, which triggers the
EXIT handler just before execution ends. So for shell scripts, the
EXIT trap is a good place for cleanup code. (
EXIT is not really a signal; it is a POSIX shell mechanism to trigger a handler just before execution ends, regardless of how it ends.)
In contrast, the default action for Python upon receiving
INT is to raise a KeyboardInterrupt exception, which if unhandled causes Python to abort, which triggers any handlers registered with
atexit.register() just before execution ends. However, that handler is by default not invoked upon Python receiving
TERM; instead, Python aborts immediately. So, for your
atexit handler to fire you must also explicitly trap those signals with
signal.signal() and cause execution to end.
If you need to write a wrapper for a process that does not exit when signalled, write it in Python, or use a shell trick (run the subprocess as a background task and call
If you're doing remote orchestration through ssh, always invoke it with
-ttso that the remote processes will receive
HUPif the ssh connection goes away. If a pseudoterminal causes problems, there are other workarounds like an 'EOF to SIGHUP' wrapper.
For shell scripts, trap EXIT and place cleanup code there. For Python, use the
atexitmodule and register a cleanup handler, and also use the
signalmodule to either raise an exception or call sys.exit() upon
TERM, like this:
for s in [signal.SIGHUP, signal.SIGTERM]: signal.signal(s, lambda n, _: sys.exit("Received signal %d" % n))
- Open OpenSSH bug #396: sshd orphans processes when no pty allocated (2002)
- Open OpenSSH bug #1424: Cannot signal a process over a channel (rfc 4254, section 6.9) (2008)
- Overview of standard signals:
man 7 signal
- Jenkins bug #JENKINS-17116 gracefull job termination (Incorrectly states that Jenkins uses
- Jenkins wiki "Aborting a build"
- RFC4254 SSH Connection Protocol (Signals specified in section 6.9.)
- Greg's Wiki: Sending and Trapping Signals: When is the signal handled?
- GNU Bash Manual: Signals
- Verify that Jenkins signals the process group. It doesn't appear so from Jenkins source code. Maybe the shell is reissuing the signal to its own process group?
- Identify Bash-isms; will a system with a different /bin/sh (like dash) behave differently than I have described here?
- Clean up and include my small scripts which demonstrate each of the assertions above.
- What negative side effects could forcing ssh to perform pty allocation have on a non-interactive script?
- Will signals propagate through