[erlang-questions] Supervisor Death Kills Workers?

David Mercer dmercer@REDACTED
Mon Aug 24 19:09:29 CEST 2009


Hynek Vychodil wrote:

 

It is little bit uncommon have distributed supervisor tree but distributed
application. In which case supervisor can fail? It is total failure of HW or
Erlang runtime and such. I think it is best what you can do to give up whole
HW node.

 

Yes, that was one of scenarios I had in my head, which was a supervisor on a
different node than its workers.  If the supervisor's node fails (could be
an OS or HW failure), then it would result in workers on other nodes also
terminating, which seems unnecessary and suboptimal.  When I thought about
this over the weekend, my conclusion was that I would only use supervisors
to supervise workers on the same node, so as to avoid this possibility.

 

Do I take it from your comment that that is the intended use of supervisors
- that they are not intended to supervise workers on other nodes?

 

Thanks,

 

David

 

  _____  

From: hynek@REDACTED [mailto:hynek@REDACTED] On Behalf Of Hynek
Vychodil
Sent: Monday, August 24, 2009 10:27 AM
To: David Mercer
Cc: Erlang
Subject: Re: [erlang-questions] Supervisor Death Kills Workers?

 

It is little bit uncommon have distributed supervisor tree but distributed
application. In which case supervisor can fail? It is total failure of HW or
Erlang runtime and such. I think it is best what you can do to give up whole
HW node.

On Mon, Aug 24, 2009 at 5:06 PM, David Mercer <dmercer@REDACTED> wrote:

As workers are linked to their supervisors, the behaviour of a supervisor,
therefore, is to kill its workers if it itself dies.  I had thought that it
was the job - and about the only job - of the supervisor to restart workers
when they stop, not to stop the workers if not working under supervision.
To my thinking, this introduces a single point of failure where previously
there wasn't: if the top-level supervisor terminates, then you've lost your
entire system.



Am I misunderstanding supervision trees and the supervisor behaviour, or is
there a reason for introducing a single point of failure into what was a
distributed fault-tolerant system?



Thanks for your help in understanding this.  I was wondering about this all
weekend.



Cheers,



David




-- 
--Hynek (Pichi) Vychodil

Analyze your data in minutes. Share your insights instantly. Thrill your
boss.  Be a data hero!
Try Good Data now for free: www.gooddata.com



More information about the erlang-questions mailing list