[erlang-patches] Improving supervisor shutdown procedure

Mon Sep 12 16:31:56 CEST 2011

Hi!

I've just observed a very peculiar behaviour of the
OTP supervisor and application controller, on one of our
embedded Erlang nodes:

When a supervisor gets stuck in an infinite process
restart loop, it does not (and cannot) respond to shutdown
signals. More specifically this happens if the supervised
process crashes in its start function (and it's not the
initial process start, of course).
I know that a supervisor shouldn't get stuck in
an endless loop and that you probably have good reason to
handle restarts that way. I nevertheless would like to
hear your opinion.

Now, if the erlang node is to shut down (either because
it's told so or a permanent application terminates),
the application controller will signal all application
masters to shut down. However it does so without any timeout
after which a kill signal would be sent.
If there exists a supervisor stuck in a restart loop like
the one described above, the application controller will
dead lock.

One of the reasons why this may happen is the start function
(e.g. init/1 in the gen_server callback module), taking
too long before failing, which may be because of a gen_server
call timeout (about which the supervised process does not
necessarily know anything).

It may even be that by unfavourable timing / race condition,
the application controller terminates while a supervisor is
just restarting a process doing an application module call
(e.g. application:set_env/3) in its init which then has
to time out, resulting in a perfect dead lock.
Indeed this is exactly what has happened to me.

A test case for reproducing this behaviour can be downloaded
from github (it's a small OTP application):

https://github.com/downloads/travelping/otp/supervisor_deadlock_testcase.tar.gz

Call deadlock_app:provoke_deadlock/0 to start it up.
It does contain some comments as well.

A patch to be discussed can be fetched here:

git fetch git@REDACTED:travelping/otp.git fix_shutdown_supervisor

https://github.com/travelping/otp/compare/fix_shutdown_supervisor
https://github.com/travelping/otp/compare/fix_shutdown_supervisor.patch

It basically checks the message queue for shutdown messages
before any attempted restart and shuts down if it finds
one.
This does not of course handle cases in which the process
start function hangs indefinitely, if this is to be
handled at all (!?).

I also thought about the application controller termination
behaviour. Wouldn't it be better if it had an application
shutdown timeout - analogous to the supervisor child shutdown
timeouts - after which it kills the application master?
Such a timeout could be infinite by default to ensure backward
compatibility and be configurable by a kernel environment
variable.

Thanks in advance,
Robin Haberkorn

-- 
-- 
------------------ managed broadband access ------------------

Travelping GmbH               phone:           +49-391-8190990
Roentgenstr. 13               fax:           +49-391-819099299
D-39108 Magdeburg             email:       info@REDACTED
GERMANY                       web:   http://www.travelping.com

Company Registration: Amtsgericht Stendal Reg No.:   HRB 10578
Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780
--------------------------------------------------------------