[erlang-bugs] Supervisor terminate_child race
Mon Apr 29 03:52:32 CEST 2013
Hi. I've been digging into an issue filed against Riak Pipe for the
last couple of weeks (https://github.com/basho/riak_pipe/issues/49),
and I've finally tracked it all the way to supervisor.erl.
The issue manifests itself as a supervisor complaining about its child
exiting "with reason noproc in context shutdown_error". Comments in
supervisor:monitor_child/1 warn that this might happen if a "naughty"
child unlinks from its parent. But, the child I'm working with doesn't
What is happening is that the child is choosing to exit on its own,
while some other process is asking the supervisor to terminate it. The
sequence of monitoring, unlinking, and receiving with zero timeout in
supervisor:monitor_child/1 is insufficient to guarantee catching the
child's EXIT signal. After the supervisor misses the EXIT signal, it
receives the DOWN instead, which has reason noproc.
This is not limited to 'normal' child exits. Any exit reason might be
missed, so this is worse than just log spam, and can inhibit reporting
I have written two tests to demonstrate the behavior:
They use EQC PULSE to make the race happen more often (and
deterministically for repeated runs). The exitrace_sup.erl test uses
the supervisor module (with PULSE disabled) or the pulse_supervisor
module (with PULSE enabled) to show the code behaving in-place (these
modules share the same monitor_child/1 function). The exitrace.erl
test extracts the relevant code to show its behavior specifically. To
demonstrate how a non-normal child exit's reason can be lost, change
exitrace_sup:start_fake_child_link/0 or exitrace:child/0 to include a
call to exit(foobar).
I have not yet attempted to patch the behavior. It wasn't obvious to
me why the code hassled with monitors instead of just relying on the
existing link, so I thought I'd ask for clarification first.
More information about the erlang-bugs