[erlang-questions] to supervise or not to supervise
Fri Mar 20 22:23:16 CET 2009
Hi, as you have discovered there are a few different restart strategies available when designing a supervisor (e.g: one_for_one, one_for all). One can of course come up with more or less an infinite number of such strategies, each one with its own twist.
The main idea and problem at the time when the supervisor behaviour was constructed was that you have a set of more or less permanent processes that implements a subsystem. There should not be any 'illegal' terminations (such that causes the supervisor to act) amongst the children. But, as we know, no non trivial system is completely correct, hence an occasional failure and following restart must be allowed. If we have repeated failures it may indicate that the problem concerns more than this subsystem, therefore the need to eventually escalate the restarts (or as you have discovered kill all children and itself).
If you really want to achieve a situation where failures never escalates above a certain supervisor you can bump up the max-restart threshold and at the same time shorten the sliding window. It is a not uncommon mistake (in normal usage of supervisors :) ) to have a too high max-restart-intensity in combination with a too short sliding window at a higher level supervisor. It may then be too long between escalation attempts by lower level supervisor for the same error, making a higher level supervisor not consider two failures amongst its children being the same error, and therefore not eventually escalate to its superior supervisor.
Lennart Öhman direct : +46 8 587 623 27
Sjöland & Thyselius Telecom AB cellular: +46 70 552 6735
Hälsingegatan 43, 10 th floor fax : +46 8 667 82 30
SE-113 31 STOCKHOLM, SWEDEN email : lennart.ohman@REDACTED
From: erlang-questions-bounces@REDACTED [mailto:erlang-questions-bounces@REDACTED] On Behalf Of steve ellis
Sent: den 20 mars 2009 20:42
Subject: [erlang-questions] to supervise or not to supervise
New to supervision trees and trying to figure out when to use them (and when not to)...
I have bunch of spawned processes created through spawn_link. Want these processes to say running indefinitely. If one exits in an error state, we want to restart it N times. After N, we want to error log it, and stop trying to restart it. Perfect job for a one_to_one supervisor right?
Well sort of. The problem is that when the max restarts for the error process is reached, the supervisor terminates all its children and itself. Ouch! (At least in our case). We'd rather that the supervisor just keep supervising all the children that are ok and not swallow everything up.
The Design Principles appear to be saying that swallowing everything up is what supervisors are supposed to do when max restarts is reached which leaves me a little puzzled. Why would you want to kill the supervisor just because a child process is causing trouble? Seems a little harsh.
Is this a case of me thinking supervisors are good for too many things? Is it that our case is better handled by simply spawning these processes and trapping exits on them, and restarting/error logging in the trap exit?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions