[erlang-patches] Improving supervisor shutdown procedure
Thu Sep 15 16:36:42 CEST 2011
I have started looking at this, and indeed it seems like a problem we need
to investigate further... I do think that your patch is a bit too simple :(
The main problem is that the supervisor does not know where the shutdown
message comes from, and I believe it may cause some unexpected behavior if
the shutdown is received from a different process than the supervisor's
parent. If you were to continue on this idea, maybe you could look at a way
to leave the control back to the gen_server (which the supervisor is built
on top of) - since this knows who it's parent is. It is only an idea, and I
do not know if it is possible to do it in a good way.
First of all, I think I will look more into your idea about a shutdown timer
in the application_controller. I'll get back to you when I have some more
thoughts around this...
2011/9/12 Robin Haberkorn <>
> I've just observed a very peculiar behaviour of the
> OTP supervisor and application controller, on one of our
> embedded Erlang nodes:
> When a supervisor gets stuck in an infinite process
> restart loop, it does not (and cannot) respond to shutdown
> signals. More specifically this happens if the supervised
> process crashes in its start function (and it's not the
> initial process start, of course).
> I know that a supervisor shouldn't get stuck in
> an endless loop and that you probably have good reason to
> handle restarts that way. I nevertheless would like to
> hear your opinion.
> Now, if the erlang node is to shut down (either because
> it's told so or a permanent application terminates),
> the application controller will signal all application
> masters to shut down. However it does so without any timeout
> after which a kill signal would be sent.
> If there exists a supervisor stuck in a restart loop like
> the one described above, the application controller will
> dead lock.
> One of the reasons why this may happen is the start function
> (e.g. init/1 in the gen_server callback module), taking
> too long before failing, which may be because of a gen_server
> call timeout (about which the supervised process does not
> necessarily know anything).
> It may even be that by unfavourable timing / race condition,
> the application controller terminates while a supervisor is
> just restarting a process doing an application module call
> (e.g. application:set_env/3) in its init which then has
> to time out, resulting in a perfect dead lock.
> Indeed this is exactly what has happened to me.
> A test case for reproducing this behaviour can be downloaded
> from github (it's a small OTP application):
> Call deadlock_app:provoke_deadlock/0 to start it up.
> It does contain some comments as well.
> A patch to be discussed can be fetched here:
> git fetch :travelping/otp.git fix_shutdown_supervisor
> It basically checks the message queue for shutdown messages
> before any attempted restart and shuts down if it finds
> This does not of course handle cases in which the process
> start function hangs indefinitely, if this is to be
> handled at all (!?).
> I also thought about the application controller termination
> behaviour. Wouldn't it be better if it had an application
> shutdown timeout - analogous to the supervisor child shutdown
> timeouts - after which it kills the application master?
> Such a timeout could be infinite by default to ensure backward
> compatibility and be configurable by a kernel environment
> Thanks in advance,
> Robin Haberkorn
> ------------------ managed broadband access ------------------
> Travelping GmbH phone: +49-391-8190990
> Roentgenstr. 13 fax: +49-391-819099299
> D-39108 Magdeburg email:
> GERMANY web: http://www.travelping.com
> Company Registration: Amtsgericht Stendal Reg No.: HRB 10578
> Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780
> erlang-patches mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-patches