[erlang-bugs] Supervisor terminate_child race

Tim Watson watson.timothy@REDACTED
Mon Apr 29 12:08:14 CEST 2013


Would it be sufficient to add a clause explicitly handling `{'DOWN', MRef, process, Child, noproc}' to the receive block? As I understand it, monitor/2 works synchronously and the 'DOWN' message will be enqueued straight away if the target process is unknown. Or is that assumption unreliable - I'd like to know if it is - and there's more to it?

Unfortunately I don't have an EQC/PULSE license to be able to test this. It'd be really nice if PULSE was available in the free/mini QC for selected open source projects.

Cheers,
Tim

29 Apr 2013, at 02:52, Bryan Fink wrote:

> Hi. I've been digging into an issue filed against Riak Pipe for the
> last couple of weeks (https://github.com/basho/riak_pipe/issues/49),
> and I've finally tracked it all the way to supervisor.erl.
> 
> The issue manifests itself as a supervisor complaining about its child
> exiting "with reason noproc in context shutdown_error". Comments in
> supervisor:monitor_child/1 warn that this might happen if a "naughty"
> child unlinks from its parent. But, the child I'm working with doesn't
> do that.
> 
> What is happening is that the child is choosing to exit on its own,
> while some other process is asking the supervisor to terminate it. The
> sequence of monitoring, unlinking, and receiving with zero timeout in
> supervisor:monitor_child/1 is insufficient to guarantee catching the
> child's EXIT signal. After the supervisor misses the EXIT signal, it
> receives the DOWN instead, which has reason noproc.
> 
> This is not limited to 'normal' child exits. Any exit reason might be
> missed, so this is worse than just log spam, and can inhibit reporting
> and debugging.
> 
> I have written two tests to demonstrate the behavior:
> 
> https://gist.github.com/beerriot/28258f2a44fc482016b1
> 
> They use EQC PULSE to make the race happen more often (and
> deterministically for repeated runs). The exitrace_sup.erl test uses
> the supervisor module (with PULSE disabled) or the pulse_supervisor
> module (with PULSE enabled) to show the code behaving in-place (these
> modules share the same monitor_child/1 function). The exitrace.erl
> test extracts the relevant code to show its behavior specifically. To
> demonstrate how a non-normal child exit's reason can be lost, change
> exitrace_sup:start_fake_child_link/0 or exitrace:child/0 to include a
> call to exit(foobar).
> 
> I have not yet attempted to patch the behavior. It wasn't obvious to
> me why the code hassled with monitors instead of just relying on the
> existing link, so I thought I'd ask for clarification first.
> 
> Cheers,
> Bryan
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs




More information about the erlang-bugs mailing list