[erlang-patches] Improving supervisor shutdown procedure
Tue Sep 20 12:09:06 CEST 2011
Hi again Robin!
Just a quick follow-up on this. After some thinking and discussing, the
conclusion is to go for both ideas - i.e. both the shutdown timer in
application_controller, and the supervisor leaving the control back to
gen_server between each restart attempt. Any thoughts, objections or
contributions are of course still welcome!
2011/9/15 Siri Hansen <>
> Hi Robin!
> I have started looking at this, and indeed it seems like a problem we need
> to investigate further... I do think that your patch is a bit too simple :(
> The main problem is that the supervisor does not know where the shutdown
> message comes from, and I believe it may cause some unexpected behavior if
> the shutdown is received from a different process than the supervisor's
> parent. If you were to continue on this idea, maybe you could look at a way
> to leave the control back to the gen_server (which the supervisor is built
> on top of) - since this knows who it's parent is. It is only an idea, and I
> do not know if it is possible to do it in a good way.
> First of all, I think I will look more into your idea about a shutdown
> timer in the application_controller. I'll get back to you when I have some
> more thoughts around this...
> 2011/9/12 Robin Haberkorn <>
>> I've just observed a very peculiar behaviour of the
>> OTP supervisor and application controller, on one of our
>> embedded Erlang nodes:
>> When a supervisor gets stuck in an infinite process
>> restart loop, it does not (and cannot) respond to shutdown
>> signals. More specifically this happens if the supervised
>> process crashes in its start function (and it's not the
>> initial process start, of course).
>> I know that a supervisor shouldn't get stuck in
>> an endless loop and that you probably have good reason to
>> handle restarts that way. I nevertheless would like to
>> hear your opinion.
>> Now, if the erlang node is to shut down (either because
>> it's told so or a permanent application terminates),
>> the application controller will signal all application
>> masters to shut down. However it does so without any timeout
>> after which a kill signal would be sent.
>> If there exists a supervisor stuck in a restart loop like
>> the one described above, the application controller will
>> dead lock.
>> One of the reasons why this may happen is the start function
>> (e.g. init/1 in the gen_server callback module), taking
>> too long before failing, which may be because of a gen_server
>> call timeout (about which the supervised process does not
>> necessarily know anything).
>> It may even be that by unfavourable timing / race condition,
>> the application controller terminates while a supervisor is
>> just restarting a process doing an application module call
>> (e.g. application:set_env/3) in its init which then has
>> to time out, resulting in a perfect dead lock.
>> Indeed this is exactly what has happened to me.
>> A test case for reproducing this behaviour can be downloaded
>> from github (it's a small OTP application):
>> Call deadlock_app:provoke_deadlock/0 to start it up.
>> It does contain some comments as well.
>> A patch to be discussed can be fetched here:
>> git fetch :travelping/otp.git fix_shutdown_supervisor
>> It basically checks the message queue for shutdown messages
>> before any attempted restart and shuts down if it finds
>> This does not of course handle cases in which the process
>> start function hangs indefinitely, if this is to be
>> handled at all (!?).
>> I also thought about the application controller termination
>> behaviour. Wouldn't it be better if it had an application
>> shutdown timeout - analogous to the supervisor child shutdown
>> timeouts - after which it kills the application master?
>> Such a timeout could be infinite by default to ensure backward
>> compatibility and be configurable by a kernel environment
>> Thanks in advance,
>> Robin Haberkorn
>> ------------------ managed broadband access ------------------
>> Travelping GmbH phone: +49-391-8190990
>> Roentgenstr. 13 fax: +49-391-819099299
>> D-39108 Magdeburg email:
>> GERMANY web: http://www.travelping.com
>> Company Registration: Amtsgericht Stendal Reg No.: HRB 10578
>> Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780
>> erlang-patches mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-patches