Hi Robin!<div><br></div><div>I have started looking at this, and indeed it seems like a problem we need to investigate further... I do think that your patch is a bit too simple :( The main problem is that the supervisor does not know where the shutdown message comes from, and I believe it may cause some unexpected behavior if the shutdown is received from a different process than the supervisor's parent. If you were to continue on this idea, maybe you could look at a way to leave the control back to the gen_server (which the supervisor is built on top of) - since this knows who it's parent is. It is only an idea, and I do not know if it is possible to do it in a good way.</div>
<div><br></div><div>First of all, I think I will look more into your idea about a shutdown timer in the application_controller. I'll get back to you when I have some more thoughts around this...</div><div><br></div><div>
Regards</div><div>/siri</div><div><br></div><div><br></div><div><div><br><div class="gmail_quote">2011/9/12 Robin Haberkorn <span dir="ltr"><<a href="mailto:rh@travelping.com">rh@travelping.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi!<br>
<br>
I've just observed a very peculiar behaviour of the<br>
OTP supervisor and application controller, on one of our<br>
embedded Erlang nodes:<br>
<br>
When a supervisor gets stuck in an infinite process<br>
restart loop, it does not (and cannot) respond to shutdown<br>
signals. More specifically this happens if the supervised<br>
process crashes in its start function (and it's not the<br>
initial process start, of course).<br>
I know that a supervisor shouldn't get stuck in<br>
an endless loop and that you probably have good reason to<br>
handle restarts that way. I nevertheless would like to<br>
hear your opinion.<br>
<br>
Now, if the erlang node is to shut down (either because<br>
it's told so or a permanent application terminates),<br>
the application controller will signal all application<br>
masters to shut down. However it does so without any timeout<br>
after which a kill signal would be sent.<br>
If there exists a supervisor stuck in a restart loop like<br>
the one described above, the application controller will<br>
dead lock.<br>
<br>
One of the reasons why this may happen is the start function<br>
(e.g. init/1 in the gen_server callback module), taking<br>
too long before failing, which may be because of a gen_server<br>
call timeout (about which the supervised process does not<br>
necessarily know anything).<br>
<br>
It may even be that by unfavourable timing / race condition,<br>
the application controller terminates while a supervisor is<br>
just restarting a process doing an application module call<br>
(e.g. application:set_env/3) in its init which then has<br>
to time out, resulting in a perfect dead lock.<br>
Indeed this is exactly what has happened to me.<br>
<br>
A test case for reproducing this behaviour can be downloaded<br>
from github (it's a small OTP application):<br>
<br>
<a href="https://github.com/downloads/travelping/otp/supervisor_deadlock_testcase.tar.gz" target="_blank">https://github.com/downloads/travelping/otp/supervisor_deadlock_testcase.tar.gz</a><br>
<br>
Call deadlock_app:provoke_deadlock/0 to start it up.<br>
It does contain some comments as well.<br>
<br>
A patch to be discussed can be fetched here:<br>
<br>
git fetch git@github.com:travelping/otp.git fix_shutdown_supervisor<br>
<br>
<a href="https://github.com/travelping/otp/compare/fix_shutdown_supervisor" target="_blank">https://github.com/travelping/otp/compare/fix_shutdown_supervisor</a><br>
<a href="https://github.com/travelping/otp/compare/fix_shutdown_supervisor.patch" target="_blank">https://github.com/travelping/otp/compare/fix_shutdown_supervisor.patch</a><br>
<br>
It basically checks the message queue for shutdown messages<br>
before any attempted restart and shuts down if it finds<br>
one.<br>
This does not of course handle cases in which the process<br>
start function hangs indefinitely, if this is to be<br>
handled at all (!?).<br>
<br>
I also thought about the application controller termination<br>
behaviour. Wouldn't it be better if it had an application<br>
shutdown timeout - analogous to the supervisor child shutdown<br>
timeouts - after which it kills the application master?<br>
Such a timeout could be infinite by default to ensure backward<br>
compatibility and be configurable by a kernel environment<br>
variable.<br>
<br>
Thanks in advance,<br>
Robin Haberkorn<br>
<br>
--<br>
--<br>
------------------ managed broadband access ------------------<br>
<br>
Travelping GmbH phone: <a href="tel:%2B49-391-8190990" value="+493918190990">+49-391-8190990</a><br>
Roentgenstr. 13 fax: <a href="tel:%2B49-391-819099299" value="+49391819099299">+49-391-819099299</a><br>
D-39108 Magdeburg email: <a href="mailto:info@travelping.com">info@travelping.com</a><br>
GERMANY web: <a href="http://www.travelping.com" target="_blank">http://www.travelping.com</a><br>
<br>
<br>
Company Registration: Amtsgericht Stendal Reg No.: HRB 10578<br>
Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780<br>
--------------------------------------------------------------<br>
_______________________________________________<br>
erlang-patches mailing list<br>
<a href="mailto:erlang-patches@erlang.org">erlang-patches@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-patches" target="_blank">http://erlang.org/mailman/listinfo/erlang-patches</a><br>
</blockquote></div><br></div></div>