[erlang-patches] Improving supervisor shutdown procedure

Tue Sep 20 12:09:06 CEST 2011

Hi again Robin!

Just a quick follow-up on this. After some thinking and discussing, the
conclusion is to go for both ideas - i.e. both the shutdown timer in
application_controller, and the supervisor leaving the control back to
gen_server between each restart attempt. Any thoughts, objections or
contributions are of course still welcome!

Regards
/siri

2011/9/15 Siri Hansen <erlangsiri@REDACTED>

> Hi Robin!
>
> I have started looking at this, and indeed it seems like a problem we need
> to investigate further... I do think that your patch is a bit too simple :(
>  The main problem is that the supervisor does not know where the shutdown
> message comes from, and I believe it may cause some unexpected behavior if
> the shutdown is received from a different process than the supervisor's
> parent. If you were to continue on this idea, maybe you could look at a way
> to leave the control back to the gen_server (which the supervisor is built
> on top of) - since this knows who it's parent is. It is only an idea, and I
> do not know if it is possible to do it in a good way.
>
> First of all, I think I will look more into your idea about a shutdown
> timer in the application_controller. I'll get back to you when I have some
> more thoughts around this...
>
> Regards
> /siri
>
>
>
> 2011/9/12 Robin Haberkorn <rh@REDACTED>
>
>> Hi!
>>
>> I've just observed a very peculiar behaviour of the
>> OTP supervisor and application controller, on one of our
>> embedded Erlang nodes:
>>
>> When a supervisor gets stuck in an infinite process
>> restart loop, it does not (and cannot) respond to shutdown
>> signals. More specifically this happens if the supervised
>> process crashes in its start function (and it's not the
>> initial process start, of course).
>> I know that a supervisor shouldn't get stuck in
>> an endless loop and that you probably have good reason to
>> handle restarts that way. I nevertheless would like to
>> hear your opinion.
>>
>> Now, if the erlang node is to shut down (either because
>> it's told so or a permanent application terminates),
>> the application controller will signal all application
>> masters to shut down. However it does so without any timeout
>> after which a kill signal would be sent.
>> If there exists a supervisor stuck in a restart loop like
>> the one described above, the application controller will
>> dead lock.
>>
>> One of the reasons why this may happen is the start function
>> (e.g. init/1 in the gen_server callback module), taking
>> too long before failing, which may be because of a gen_server
>> call timeout (about which the supervised process does not
>> necessarily know anything).
>>
>> It may even be that by unfavourable timing / race condition,
>> the application controller terminates while a supervisor is
>> just restarting a process doing an application module call
>> (e.g. application:set_env/3) in its init which then has
>> to time out, resulting in a perfect dead lock.
>> Indeed this is exactly what has happened to me.
>>
>> A test case for reproducing this behaviour can be downloaded
>> from github (it's a small OTP application):
>>
>>
>> https://github.com/downloads/travelping/otp/supervisor_deadlock_testcase.tar.gz
>>
>> Call deadlock_app:provoke_deadlock/0 to start it up.
>> It does contain some comments as well.
>>
>> A patch to be discussed can be fetched here:
>>
>> git fetch git@REDACTED:travelping/otp.git fix_shutdown_supervisor
>>
>> https://github.com/travelping/otp/compare/fix_shutdown_supervisor
>> https://github.com/travelping/otp/compare/fix_shutdown_supervisor.patch
>>
>> It basically checks the message queue for shutdown messages
>> before any attempted restart and shuts down if it finds
>> one.
>> This does not of course handle cases in which the process
>> start function hangs indefinitely, if this is to be
>> handled at all (!?).
>>
>> I also thought about the application controller termination
>> behaviour. Wouldn't it be better if it had an application
>> shutdown timeout - analogous to the supervisor child shutdown
>> timeouts - after which it kills the application master?
>> Such a timeout could be infinite by default to ensure backward
>> compatibility and be configurable by a kernel environment
>> variable.
>>
>> Thanks in advance,
>> Robin Haberkorn
>>
>> --
>> --
>> ------------------ managed broadband access ------------------
>>
>> Travelping GmbH               phone:           +49-391-8190990
>> Roentgenstr. 13               fax:           +49-391-819099299
>> D-39108 Magdeburg             email:       info@REDACTED
>> GERMANY                       web:   http://www.travelping.com
>>
>>
>> Company Registration: Amtsgericht Stendal Reg No.:   HRB 10578
>> Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780
>> --------------------------------------------------------------
>> _______________________________________________
>> erlang-patches mailing list
>> erlang-patches@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-patches
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-patches/attachments/20110920/da7b4987/attachment.htm>