[erlang-patches] Improving supervisor shutdown procedure

Thu Sep 15 16:36:42 CEST 2011

Hi Robin!

I have started looking at this, and indeed it seems like a problem we need
to investigate further... I do think that your patch is a bit too simple :(
 The main problem is that the supervisor does not know where the shutdown
message comes from, and I believe it may cause some unexpected behavior if
the shutdown is received from a different process than the supervisor's
parent. If you were to continue on this idea, maybe you could look at a way
to leave the control back to the gen_server (which the supervisor is built
on top of) - since this knows who it's parent is. It is only an idea, and I
do not know if it is possible to do it in a good way.

First of all, I think I will look more into your idea about a shutdown timer
in the application_controller. I'll get back to you when I have some more
thoughts around this...

Regards
/siri

2011/9/12 Robin Haberkorn <rh@REDACTED>

> Hi!
>
> I've just observed a very peculiar behaviour of the
> OTP supervisor and application controller, on one of our
> embedded Erlang nodes:
>
> When a supervisor gets stuck in an infinite process
> restart loop, it does not (and cannot) respond to shutdown
> signals. More specifically this happens if the supervised
> process crashes in its start function (and it's not the
> initial process start, of course).
> I know that a supervisor shouldn't get stuck in
> an endless loop and that you probably have good reason to
> handle restarts that way. I nevertheless would like to
> hear your opinion.
>
> Now, if the erlang node is to shut down (either because
> it's told so or a permanent application terminates),
> the application controller will signal all application
> masters to shut down. However it does so without any timeout
> after which a kill signal would be sent.
> If there exists a supervisor stuck in a restart loop like
> the one described above, the application controller will
> dead lock.
>
> One of the reasons why this may happen is the start function
> (e.g. init/1 in the gen_server callback module), taking
> too long before failing, which may be because of a gen_server
> call timeout (about which the supervised process does not
> necessarily know anything).
>
> It may even be that by unfavourable timing / race condition,
> the application controller terminates while a supervisor is
> just restarting a process doing an application module call
> (e.g. application:set_env/3) in its init which then has
> to time out, resulting in a perfect dead lock.
> Indeed this is exactly what has happened to me.
>
> A test case for reproducing this behaviour can be downloaded
> from github (it's a small OTP application):
>
>
> https://github.com/downloads/travelping/otp/supervisor_deadlock_testcase.tar.gz
>
> Call deadlock_app:provoke_deadlock/0 to start it up.
> It does contain some comments as well.
>
> A patch to be discussed can be fetched here:
>
> git fetch git@REDACTED:travelping/otp.git fix_shutdown_supervisor
>
> https://github.com/travelping/otp/compare/fix_shutdown_supervisor
> https://github.com/travelping/otp/compare/fix_shutdown_supervisor.patch
>
> It basically checks the message queue for shutdown messages
> before any attempted restart and shuts down if it finds
> one.
> This does not of course handle cases in which the process
> start function hangs indefinitely, if this is to be
> handled at all (!?).
>
> I also thought about the application controller termination
> behaviour. Wouldn't it be better if it had an application
> shutdown timeout - analogous to the supervisor child shutdown
> timeouts - after which it kills the application master?
> Such a timeout could be infinite by default to ensure backward
> compatibility and be configurable by a kernel environment
> variable.
>
> Thanks in advance,
> Robin Haberkorn
>
> --
> --
> ------------------ managed broadband access ------------------
>
> Travelping GmbH               phone:           +49-391-8190990
> Roentgenstr. 13               fax:           +49-391-819099299
> D-39108 Magdeburg             email:       info@REDACTED
> GERMANY                       web:   http://www.travelping.com
>
>
> Company Registration: Amtsgericht Stendal Reg No.:   HRB 10578
> Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780
> --------------------------------------------------------------
> _______________________________________________
> erlang-patches mailing list
> erlang-patches@REDACTED
> http://erlang.org/mailman/listinfo/erlang-patches
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-patches/attachments/20110915/5197d447/attachment.htm>