[erlang-questions] Supervisor Death Kills Workers?

Mon Aug 24 17:27:44 CEST 2009

David Mercer wrote:
> As workers are linked to their supervisors, the behaviour of a supervisor,
> therefore, is to kill its workers if it itself dies.  I had thought that it
> was the job - and about the only job - of the supervisor to restart workers
> when they stop, not to stop the workers if not working under supervision.
> To my thinking, this introduces a single point of failure where previously
> there wasn't: if the top-level supervisor terminates, then you've lost your
> entire system.
> 
> Am I misunderstanding supervision trees and the supervisor behaviour, or is
> there a reason for introducing a single point of failure into what was a
> distributed fault-tolerant system?

There is a rule of thumb, sometimes cited by Joe Armstrong
among others (although I think it was Martin Björklund who
may have first formulated it):

   There are processes that can be allowed to fail, and
   processes that cannot. You have to make your mind up.

Supervisors are processes that must be assumed correct,
much like the VM must be assumed correct. Thus, if a supervisor
dies, it had better be because it was told to, either explicitly
or because its parent died.

You can view this as an invariant of sorts. It does have the
nice property that terminating an OTP application can be done
simply by telling the top supervisor to shut down. It will
pass on the shutdown order to all its children, and if they do
not respond, it will kill them without mercy. This is in part
to ensure that it is indeed possible to terminate the system.

Joe Armstrong writes about this in his thesis (ch 5):
http://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf

BR,
Ulf W
-- 
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com