[erlang-questions] Supervisor Death Kills Workers?

Mon Aug 24 19:17:21 CEST 2009

Witold Baryluk wrote:

> But if there is any reason it will crash (like dropped messages or
> memory exhausted) , the best want anyone can do is kill all supervisor
> tree under it, because previous assumption is wrong, and restart it from
> scratch (eventually terminating/restarting whole system after multiple
> restarts in small period of time).

My thought is that if the supervisor crashes, the best you can want is that
the workers continue working until either (1) a stand-by supervisor takes
over supervision duties, or (2) the original supervisor comes back up.  When
either (1) or (2) occurs, the supervisor checks to see if any worker
terminated during the interregnum and restarts it, if necessary.

I'm not saying my approach is right and the current supervisory model is all
wrong.  On the contrary, I think the decades of experience with Erlang has
evolved the current supervisor behaviour, and I am trying to understand why.
Hynek's comment earlier ("It is little bit uncommon have distributed
supervisor tree") may give a clue as to why OTP's supervisor is the way it
is: that common usage of supervisors only uses them to supervise same-node
workers, so it has been adequate and workable to make certain assumptions
about the resilience of supervisors.

Am I on the right track?

Thanks,

David

> -----Original Message-----
> From: Witold Baryluk [mailto:baryluk@REDACTED]
> Sent: Monday, August 24, 2009 10:30 AM
> To: David Mercer
> Cc: 'Erlang'
> Subject: Re: [erlang-questions] Supervisor Death Kills Workers?
> 
> Dnia 2009-08-24, pon o godzinie 10:06 -0500, David Mercer pisze:
> > As workers are linked to their supervisors, the behaviour of a
> supervisor,
> > therefore, is to kill its workers if it itself dies.  I had thought that
> it
> > was the job - and about the only job - of the supervisor to restart
> workers
> > when they stop, not to stop the workers if not working under
> supervision.
> > To my thinking, this introduces a single point of failure where
> previously
> > there wasn't: if the top-level supervisor terminates, then you've lost
> your
> > entire system.
> 
> This is the reason why you should not do any hackerish work with
> supervisors. You will for sure introduce some bugs and bring eventually
> whole system down.
> 
> Keep it simple, keep it working.
> 
> As it is now, it is well tested and do the right job. Any input to the
> supervisor (like child specification) is first full tested before using
> it in any way.
> So we can trust that supervisor code is correct an can be trusted, to
> never crash due to the errors there.
> 
> But if there is any reason it will crash (like dropped messages or
> memory exhausted) , the best want anyone can do is kill all supervisor
> tree under it, because previous assumption is wrong, and restart it from
> scratch (eventually terminating/restarting whole system after multiple
> restarts in small period of time).
> 
> 
> 
> --
> Witold Baryluk