Hi Robin!<div><br></div><div>I have started looking at this, and indeed it seems like a problem we need to investigate further... I do think that your patch is a bit too simple :(  The main problem is that the supervisor does not know where the shutdown message comes from, and I believe it may cause some unexpected behavior if the shutdown is received from a different process than the supervisor's parent. If you were to continue on this idea, maybe you could look at a way to leave the control back to the gen_server (which the supervisor is built on top of) - since this knows who it's parent is. It is only an idea, and I do not know if it is possible to do it in a good way.</div>

<div><br></div><div>First of all, I think I will look more into your idea about a shutdown timer in the application_controller. I'll get back to you when I have some more thoughts around this...</div><div><br></div><div>

Regards</div><div>/siri</div><div><br></div><div><br></div><div><div><br><div class="gmail_quote">2011/9/12 Robin Haberkorn <span dir="ltr"><<a href="mailto:rh@travelping.com">rh@travelping.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Hi!<br>

<br>

I've just observed a very peculiar behaviour of the<br>

OTP supervisor and application controller, on one of our<br>

embedded Erlang nodes:<br>

<br>

When a supervisor gets stuck in an infinite process<br>

restart loop, it does not (and cannot) respond to shutdown<br>

signals. More specifically this happens if the supervised<br>

process crashes in its start function (and it's not the<br>

initial process start, of course).<br>

I know that a supervisor shouldn't get stuck in<br>

an endless loop and that you probably have good reason to<br>

handle restarts that way. I nevertheless would like to<br>

hear your opinion.<br>

<br>

Now, if the erlang node is to shut down (either because<br>

it's told so or a permanent application terminates),<br>

the application controller will signal all application<br>

masters to shut down. However it does so without any timeout<br>

after which a kill signal would be sent.<br>

If there exists a supervisor stuck in a restart loop like<br>

the one described above, the application controller will<br>

dead lock.<br>

<br>

One of the reasons why this may happen is the start function<br>

(e.g. init/1 in the gen_server callback module), taking<br>

too long before failing, which may be because of a gen_server<br>

call timeout (about which the supervised process does not<br>

necessarily know anything).<br>

<br>

It may even be that by unfavourable timing / race condition,<br>

the application controller terminates while a supervisor is<br>

just restarting a process doing an application module call<br>

(e.g. application:set_env/3) in its init which then has<br>

to time out, resulting in a perfect dead lock.<br>

Indeed this is exactly what has happened to me.<br>

<br>

A test case for reproducing this behaviour can be downloaded<br>

from github (it's a small OTP application):<br>

<br>

<a href="https://github.com/downloads/travelping/otp/supervisor_deadlock_testcase.tar.gz" target="_blank">https://github.com/downloads/travelping/otp/supervisor_deadlock_testcase.tar.gz</a><br>

<br>

Call deadlock_app:provoke_deadlock/0 to start it up.<br>

It does contain some comments as well.<br>

<br>

A patch to be discussed can be fetched here:<br>

<br>

git fetch git@github.com:travelping/otp.git fix_shutdown_supervisor<br>

<br>

<a href="https://github.com/travelping/otp/compare/fix_shutdown_supervisor" target="_blank">https://github.com/travelping/otp/compare/fix_shutdown_supervisor</a><br>

<a href="https://github.com/travelping/otp/compare/fix_shutdown_supervisor.patch" target="_blank">https://github.com/travelping/otp/compare/fix_shutdown_supervisor.patch</a><br>

<br>

It basically checks the message queue for shutdown messages<br>

before any attempted restart and shuts down if it finds<br>

one.<br>

This does not of course handle cases in which the process<br>

start function hangs indefinitely, if this is to be<br>

handled at all (!?).<br>

<br>

I also thought about the application controller termination<br>

behaviour. Wouldn't it be better if it had an application<br>

shutdown timeout - analogous to the supervisor child shutdown<br>

timeouts - after which it kills the application master?<br>

Such a timeout could be infinite by default to ensure backward<br>

compatibility and be configurable by a kernel environment<br>

variable.<br>

<br>

Thanks in advance,<br>

Robin Haberkorn<br>

<br>

--<br>

--<br>

------------------ managed broadband access ------------------<br>

<br>

Travelping GmbH               phone:           <a href="tel:%2B49-391-8190990" value="+493918190990">+49-391-8190990</a><br>

Roentgenstr. 13               fax:           <a href="tel:%2B49-391-819099299" value="+49391819099299">+49-391-819099299</a><br>

D-39108 Magdeburg             email:       <a href="mailto:info@travelping.com">info@travelping.com</a><br>

GERMANY                       web:   <a href="http://www.travelping.com" target="_blank">http://www.travelping.com</a><br>

<br>

<br>

Company Registration: Amtsgericht Stendal Reg No.:   HRB 10578<br>

Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780<br>

--------------------------------------------------------------<br>

_______________________________________________<br>

erlang-patches mailing list<br>

<a href="mailto:erlang-patches@erlang.org">erlang-patches@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-patches" target="_blank">http://erlang.org/mailman/listinfo/erlang-patches</a><br>

</blockquote></div><br></div></div>