[erlang-questions] to supervise or not to supervise

Lennart Öhman <>
Fri Mar 20 22:23:16 CET 2009


Hi, as you have discovered there are a few different restart strategies available when designing a supervisor (e.g: one_for_one, one_for all). One can of course come up with more or less an infinite number of such strategies, each one with its own twist.
The main idea and problem at the time when the supervisor behaviour was constructed was that you have a set of more or less permanent processes that implements a subsystem. There should not be any 'illegal' terminations (such that causes the supervisor to act) amongst the children. But, as we know, no non trivial system is completely correct, hence an occasional failure and following restart must be allowed. If we have repeated failures it may indicate that the problem concerns more than this subsystem, therefore the need to eventually escalate the restarts (or as you have discovered kill all children and itself).

If you really want to achieve a situation where failures never escalates above a certain supervisor you can bump up the max-restart threshold and at the same time shorten the sliding window. It is a not uncommon mistake (in normal usage of supervisors :) ) to have a too high max-restart-intensity in combination with a too short sliding window at a higher level supervisor. It may then be too long between escalation attempts by lower level supervisor for the same error, making a higher level supervisor not consider two failures amongst its children being the same error, and therefore not eventually escalate to its superior supervisor.

Best Regards
Lennart


-------------------------------------------------------------

Lennart Öhman                   direct  : +46 8 587 623 27

Sjöland & Thyselius Telecom AB  cellular: +46 70 552 6735

Hälsingegatan 43, 10 th floor   fax     : +46 8 667 82 30

SE-113 31 STOCKHOLM, SWEDEN     email   : 



From:  [mailto:] On Behalf Of steve ellis
Sent: den 20 mars 2009 20:42
To: 
Subject: [erlang-questions] to supervise or not to supervise

New to supervision trees and trying to figure out when to use them (and when not to)...

I have bunch of spawned processes created through spawn_link. Want these processes to say running indefinitely. If one exits in an error state, we want to restart it N times. After N, we want to error log it, and stop trying to restart it. Perfect job for a one_to_one supervisor right?

Well sort of. The problem is that when the max restarts for the error process is reached, the supervisor terminates all its children and itself. Ouch! (At least in our case). We'd rather that the supervisor just keep supervising all the children that are ok and not swallow everything up.

The Design Principles appear to be saying that swallowing everything up is what supervisors are supposed to do when max restarts is reached which leaves me a little puzzled. Why would you want to kill the supervisor just because a child process is causing trouble? Seems a little harsh.

Is this a case of me thinking supervisors are good for too many things? Is it that our case is better handled by simply spawning these processes and trapping exits on them, and restarting/error logging in the trap exit?

Thanks!

Steve

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20090320/a4d8a17b/attachment.html>


More information about the erlang-questions mailing list