Skip to content

Instantly share code, notes, and snippets.

@aronisstav
Last active September 10, 2018 08:41
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aronisstav/1a0e2d5a6ed49cd2f61611bb3e1640f6 to your computer and use it in GitHub Desktop.
Save aronisstav/1a0e2d5a6ed49cd2f61611bb3e1640f6 to your computer and use it in GitHub Desktop.
Workers of a bottom supervisor may be left running when a top supervisor is shutdown

Workers of a bottom supervisor may be left running when a top supervisor is shutdown

This is an unexpected behaviour discovered by Concuerror.

Overview

Four modules are used:

  1. test.erl contains the configuration and entry point of the test.
  2. sup1.erl is the code of a top-level supervisor
  3. sup2.erl is the code of a bottom-level supervisor
  4. worker.erl is the code of a gen_server worker

In test.erl, a supervisor tree is started from a top process, with the top-level supervisor having a shutdown timeout and the bottom-level supervisor not having one. The top process then asks the supervisor to exit, by sending it a shutdown signal.

The finding is that the worker may be left running, even though no suspicious code is executed in the test.

To reproduce:

  1. Get Concuerror
  2. Compile all modules
  3. In the same directory as the four .beam files run: concuerror -m test -t test_hanging_worker -k (option names are shorthands for --module, --test and --keep_going)
  4. In the generated concuerror_report.txt you can see a deadlock reported.

Explanation

A copy of the concuerror report is part of this Gist (zz-concuerror_report.txt). The "action" starts after step 29, after the shutdown exit signal is delivered to the top supervisor.

  1. The top supervisor begins exiting.
  2. Preparing for a gentle shutdown, the top supervisor sets a monitor and unlinks from the bottom one.
  3. The top supervisor sends a shutdown exit signal to the bottom one and waits for a DOWN message from the monitor.
  4. The bottom supervisor traps the signal and also begins exiting.
  5. Preparing for a gentle shutdown, the bottom supervisor sets a monitor and unlinks from the worker.
  6. The top supervisor's gentle shutdown timeout expires, so it sends a kill signal to the bottom one.
  7. The bottom supervisor receives the kill signal and is killed, while unlinked from the worker, leaving the worker in limbo.

The important context switch is the one between steps 5 and 6.

-module(sup1).
-behaviour(supervisor).
%% API
-export([start_link/1]).
%% Supervisor callbacks
-export([init/1]).
-define(SERVER, ?MODULE).
%%%===================================================================
%%% API functions
%%%===================================================================
start_link(Args) ->
supervisor:start_link({local, ?SERVER}, ?MODULE, Args).
%%%===================================================================
%%% Supervisor callbacks
%%%===================================================================
init(Args) ->
#{sup1_shutdown := Sup1Shutdown} = Args,
SupFlags = #{strategy => one_for_one,
intensity => 1,
period => 5},
AChild = #{id => sup2,
start => {sup2, start_link, [Args]},
restart => permanent,
shutdown => Sup1Shutdown,
type => supervisor,
modules => [sup2]},
{ok, {SupFlags, [AChild]}}.
-module(sup2).
-behaviour(supervisor).
%% API
-export([start_link/1]).
%% Supervisor callbacks
-export([init/1]).
-define(SERVER, ?MODULE).
%%%===================================================================
%%% API functions
%%%===================================================================
start_link(Args) ->
supervisor:start_link({local, ?SERVER}, ?MODULE, Args).
%%%===================================================================
%%% Supervisor callbacks
%%%===================================================================
init(Args) ->
#{sup2_shutdown := Sup2Shutdown} = Args,
SupFlags = #{strategy => one_for_one,
intensity => 1,
period => 5},
AChild = #{id => worker,
start => {worker, start_link, [Args]},
restart => permanent,
shutdown => Sup2Shutdown,
type => worker,
modules => [worker]},
{ok, {SupFlags, [AChild]}}.
-module(test).
-compile(export_all).
%% In this test we are only interested in deadlocks. Processes may
%% exit abnormally as they are killed by the supervisors.
-concuerror_options_forced([{ignore_error, crash}]).
scenarios() ->
[ test_hanging_worker
].
test_hanging_worker() ->
Args =
#{ sup1_shutdown => 5000
, sup2_shutdown => infinity
, worker_trap_exit => false
},
{ok, TopSup} = sup1:start_link(Args),
%% Unlink from the top supervisor
unlink(TopSup),
%% And ask it to exit, gently
exit(TopSup, shutdown).
-module(worker).
-behaviour(gen_server).
%% API
-export([start_link/1]).
%% gen_server callbacks
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
terminate/2, code_change/3]).
-define(SERVER, ?MODULE).
-record(state, {}).
%%%===================================================================
%%% API
%%%===================================================================
start_link(Args) ->
gen_server:start_link({local, ?SERVER}, ?MODULE, Args, []).
%%%===================================================================
%%% gen_server callbacks
%%%===================================================================
init(Args) ->
#{worker_trap_exit := TrapExit} = Args,
process_flag(trap_exit, TrapExit),
{ok, #state{}}.
handle_call(_Request, _From, State) ->
Reply = ok,
{reply, Reply, State}.
handle_cast(_Msg, State) ->
{noreply, State}.
handle_info(_Info, State) ->
{noreply, State}.
terminate(_Reason, _State) ->
ok.
code_change(_OldVsn, State, _Extra) ->
{ok, State}.
Concuerror [...] started at 30 Aug 2018 17:29:08
Options:
[{after_timeout,infinity},
{assertions_only,false},
{assume_racing,true},
{depth_bound,500},
{disable_sleep_sets,false},
{dpor,optimal},
{entry_point,{test,test_hanging_worker,[]}},
{exclude_module,[]},
{first_process_errors_only,false},
{ignore_error,[abnormal_exit]},
{instant_delivery,true},
{interleaving_bound,infinity},
{keep_going,true},
{non_racing_system,[]},
{print_depth,20},
{scheduling,round_robin},
{scheduling_bound_type,none},
{show_races,false},
{strict_scheduling,false},
{symbolic_names,true},
{timeout,5000},
{treat_as_normal,[]},
{use_receive_patterns,true}]
################################################################################
Interleaving #6
--------------------------------------------------------------------------------
Errors found:
* Blocked at a 'receive' ("deadlocked"; other processes have exited):
<P.1.1.1/worker> in gen_server.erl line 381
Mailbox contents: []
--------------------------------------------------------------------------------
Event trace:
1: <P>: undefined = erlang:whereis(sup1)
in gen.erl line 299
2: <P>: [] = erlang:process_info(<P>, registered_name)
in proc_lib.erl line 731
3: <P>: <P.1/sup1> = erlang:spawn_opt({proc_lib,init_p,[<P>,[],gen,init_it,[gen_server,<P>,<P>,{local,sup1},supervisor,{{local,sup1},sup1,#{sup1_shutdown=>5000,sup2_shutdown=>infinity,...=>...}},[]]],[link]})
in erlang.erl line 2868
4: <P.1/sup1>: true = erlang:register(sup1, <P.1/sup1>)
in gen.erl line 302
5: <P.1/sup1>: false = erlang:process_flag(trap_exit, true)
in supervisor.erl line 293
6: <P.1/sup1>: undefined = erlang:whereis(sup2)
in gen.erl line 299
7: <P.1/sup1>: {registered_name,sup1} = erlang:process_info(<P.1/sup1>, registered_name)
in proc_lib.erl line 731
8: <P.1/sup1>: <P.1.1/sup2> = erlang:spawn_opt({proc_lib,init_p,[sup1,[<P>],gen,init_it,[gen_server,<P.1/sup1>,<P.1/sup1>,{local,sup2},supervisor,{{local,sup2},sup2,#{sup1_shutdown=>5000,sup2_shutdown=>infinity,...=>...}},[]]],[link]})
in erlang.erl line 2868
9: <P.1.1/sup2>: true = erlang:register(sup2, <P.1.1/sup2>)
in gen.erl line 302
10: <P.1.1/sup2>: false = erlang:process_flag(trap_exit, true)
in supervisor.erl line 293
11: <P.1.1/sup2>: undefined = erlang:whereis(worker)
in gen.erl line 299
12: <P.1.1/sup2>: {registered_name,sup2} = erlang:process_info(<P.1.1/sup2>, registered_name)
in proc_lib.erl line 731
13: <P.1.1/sup2>: <P.1.1.1/worker> = erlang:spawn_opt({proc_lib,init_p,[sup2,[sup1,<P>],gen,init_it,[gen_server,<P.1.1/sup2>,<P.1.1/sup2>,{local,worker},worker,#{sup1_shutdown=>5000,sup2_shutdown=>infinity,worker_trap_exit=>false},[]]],[link]})
in erlang.erl line 2868
14: <P.1.1.1/worker>: true = erlang:register(worker, <P.1.1.1/worker>)
in gen.erl line 302
15: <P.1.1.1/worker>: false = erlang:process_flag(trap_exit, false)
in worker.erl line 29
16: <P.1.1.1/worker>: {ack,<P.1.1.1/worker>,{ok,<P.1.1.1/worker>}} = <P.1.1/sup2> ! {ack,<P.1.1.1/worker>,{ok,<P.1.1.1/worker>}}
in proc_lib.erl line 392
17: <P.1.1/sup2>: receives message ({ack,<P.1.1.1/worker>,{ok,<P.1.1.1/worker>}})
in proc_lib.erl line 348
18: <P.1.1/sup2>: <P.user/user> = erlang:group_leader()
in error_logger.erl line 213
19: <P.1.1/sup2>: {notify,{info_report,<P.user/user>,{<P.1.1/sup2>,progress,[{supervisor,{local,sup2}},{started,[{pid,<P.1.1.1/worker>},{id,worker},{mfargs,{worker,start_link,...}},{restart_type,permanent},{shutdown,infinity},{child_type,...}]}]}}} = error_logger ! {notify,{info_report,<P.user/user>,{<P.1.1/sup2>,progress,[{supervisor,{local,sup2}},{started,[{pid,<P.1.1.1/worker>},{id,worker},{mfargs,{worker,start_link,...}},{restart_type,permanent},{shutdown,infinity},{child_type,...}]}]}}}
in gen_event.erl line 266
20: <P.1.1/sup2>: {ack,<P.1.1/sup2>,{ok,<P.1.1/sup2>}} = <P.1/sup1> ! {ack,<P.1.1/sup2>,{ok,<P.1.1/sup2>}}
in proc_lib.erl line 392
21: <P.1/sup1>: receives message ({ack,<P.1.1/sup2>,{ok,<P.1.1/sup2>}})
in proc_lib.erl line 348
22: <P.1/sup1>: <P.user/user> = erlang:group_leader()
in error_logger.erl line 213
23: <P.1/sup1>: {notify,{info_report,<P.user/user>,{<P.1/sup1>,progress,[{supervisor,{local,sup1}},{started,[{pid,<P.1.1/sup2>},{id,sup2},{mfargs,{sup2,start_link,...}},{restart_type,permanent},{shutdown,5000},{child_type,...}]}]}}} = error_logger ! {notify,{info_report,<P.user/user>,{<P.1/sup1>,progress,[{supervisor,{local,sup1}},{started,[{pid,<P.1.1/sup2>},{id,sup2},{mfargs,{sup2,start_link,...}},{restart_type,permanent},{shutdown,5000},{child_type,...}]}]}}}
in gen_event.erl line 266
24: <P.1/sup1>: {ack,<P.1/sup1>,{ok,<P.1/sup1>}} = <P> ! {ack,<P.1/sup1>,{ok,<P.1/sup1>}}
in proc_lib.erl line 392
25: <P>: receives message ({ack,<P.1/sup1>,{ok,<P.1/sup1>}})
in proc_lib.erl line 348
26: <P>: true = erlang:unlink(<P.1/sup1>)
in test.erl line 21
27: <P>: true = erlang:exit(<P.1/sup1>, shutdown)
in test.erl line 23
28: <P>: exits normally
29: <P.1/sup1>: receives message ({'EXIT',<P>,shutdown})
in gen_server.erl line 381
30: <P.1/sup1>: #Ref<0.1935575814.2529427457.35973> = erlang:monitor(process, <P.1.1/sup2>)
in supervisor.erl line 978
31: <P.1/sup1>: true = erlang:unlink(<P.1.1/sup2>)
in supervisor.erl line 979
32: <P.1/sup1>: receive timeout expires after 0 ms
in supervisor.erl line 981
33: <P.1/sup1>: true = erlang:exit(<P.1.1/sup2>, shutdown)
in supervisor.erl line 954
34: <P.1/sup1>: receive timeout expires after 5000 ms
in supervisor.erl line 955
35: <P.1.1/sup2>: receives message ({'EXIT',<P.1/sup1>,shutdown})
in gen_server.erl line 381
36: <P.1.1/sup2>: #Ref<0.1935575814.2529427457.36059> = erlang:monitor(process, <P.1.1.1/worker>)
in supervisor.erl line 978
37: <P.1.1/sup2>: true = erlang:unlink(<P.1.1.1/worker>)
in supervisor.erl line 979
38: <P.1.1/sup2>: receive timeout expires after 0 ms
in supervisor.erl line 981
39: <P.1/sup1>: true = erlang:exit(<P.1.1/sup2>, kill)
in supervisor.erl line 961
40: <P.1.1/sup2>: exits abnormally (killed)
41: <P.1.1/sup2>: {'DOWN',#Ref<0.1935575814.2529427457.35973>,process,<P.1.1/sup2>,killed} = erlang:send(<P.1/sup1>, {'DOWN',#Ref<0.1935575814.2529427457.35973>,process,<P.1.1/sup2>,killed})
(while exiting)
42: <P.1/sup1>: receives message ({'DOWN',#Ref<0.1935575814.2529427457.35973>,process,<P.1.1/sup2>,killed})
in supervisor.erl line 962
43: <P.1/sup1>: <P.user/user> = erlang:group_leader()
in error_logger.erl line 127
44: <P.1/sup1>: {notify,{error_report,<P.user/user>,{<P.1/sup1>,supervisor_report,[{supervisor,{local,sup1}},{errorContext,shutdown_error},{reason,killed},{offender,[{pid,<P.1.1/sup2>},{id,sup2},{mfargs,{...}},{restart_type,...},{...}|...]}]}}} = error_logger ! {notify,{error_report,<P.user/user>,{<P.1/sup1>,supervisor_report,[{supervisor,{local,sup1}},{errorContext,shutdown_error},{reason,killed},{offender,[{pid,<P.1.1/sup2>},{id,sup2},{mfargs,{...}},{restart_type,...},{...}|...]}]}}}
in gen_event.erl line 266
45: <P.1/sup1>: exits abnormally (shutdown)
################################################################################
Exploration completed!
################################################################################
Warnings:
--------------------------------------------------------------------------------
* Some errors were ignored ('--ignore_error').
################################################################################
Tips:
--------------------------------------------------------------------------------
* Running without a scheduling_bound corresponds to verification and may take a long time.
* An abnormal exit signal killed a process. This is probably the worst thing that can happen race-wise, as any other side-effecting operation races with the arrival of the signal. If the test produces too many interleavings consider refactoring your code.
* A process exited with reason 'shutdown'. This may happen when a supervisor is terminating its children. You can use '--treat_as_normal shutdown' if this is expected behaviour.
* Increase '--print_depth' if output/graph contains "...".
################################################################################
Info:
--------------------------------------------------------------------------------
* Showing progress (-h progress, for details)
* Writing results in concuerror_report.txt
* Automatically instrumented module io_lib
* Showing PIDs as "<symbolic name(/last registered name)>" ('-h symbolic_names').
* "--ignore_error crash" converted to "--ignore_error abnormal_exit"
* Automatically instrumented module error_handler
* Automatically instrumented module test
* Automatically instrumented module sup1
* Automatically instrumented module supervisor
* Automatically instrumented module gen_server
* Automatically instrumented module gen
* Automatically instrumented module proc_lib
* Automatically instrumented module erlang
* Automatically instrumented module lists
* Automatically instrumented module sup2
* Automatically instrumented module worker
* Automatically instrumented module error_logger
* Automatically instrumented module gen_event
* You can see pairs of racing instructions (in the report and '--graph') with '--show_races true'
* Continuing after error (-k)
################################################################################
Done at 31 Aug 2018 16:54:42 (Exit status: error)
Summary: 1 errors, 9/9 interleavings explored
@aronisstav
Copy link
Author

This has now been submitted as https://bugs.erlang.org/browse/ERL-724.

@aronisstav
Copy link
Author

Resulting patch: erlang/otp@3b50bb6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment