Thanks Ulf and Mihai! Interesting stuff!<br><br>Mihai I share your trust in OTP and am weary of extending it or using it in a way that's not conventional.I can see the rationale behind Ulf's tweak.<br><br>I'm hoping our needs don't require doing anything outside of OTP (let me know what you think). Let my clarify what we're doing to try and test this assumption/hope. Here is the pattern we have:<br>

<br>1. A need to spawn a series of processes that stay running unless they exit abnormally more than N times.<br><br>2. The spawned processes aren't used/relied on by other processes for the system as a whole to function properly. They are autonomous (writing results to a central data store).<br>

<br>3. If one of these processes goes down we want to know about it, but it shouldn't bring the whole system down by cascading down a supervision tree.<br><br>4. If one of these processes dies, it shouldn't bring down a sibling process.<br>

<br>5. All of these spawned processes have the same restart strategy.<br><br>To do this each process has its own supervisor (per Mihai'ssuggestion).<br><br>This means we write two modules, one implementing the supervisor behavior and one implementing the gen_server behavior. We start our autonomous/standalone processes by starting a link to a "standalone" child supervisor (from a gen_server parent) and add a dynamic child gen_server process to the child supervisor. The children do their work. If one fails, we try to restart it a few times per the supervision strategy. If the max restarts are reached, the autonomous/standalone supervisor automatically sends an {'EXIT', DeadSupervisorPid, reached_max_restart_intensity} message to the gen_server parent that started it (this is the default behavior of any linked process; we don't write any code for it). We get the message in the parent gen_server's handle_info function and do something with it.<br>

<br>I'm thinking this is all standard OTP stuff and that we're not doing anything that is a "bad idea". Am I right about this? Thanks!<br><br>Steve<br><br><div class="gmail_quote"><span style="font-size: large; font-weight: bold;">Forwarded conversation</span><br>

Subject: <b class="gmail_sendername">to supervise or not to supervise</b><br>------------------------<br><br><span class="undefined"><font color="#000000">From: <b class="undefined">steve ellis</b> <span dir="ltr"><<a href="mailto:steve.e.123@gmail.com">steve.e.123@gmail.com</a>></span><br>

Date: Fri, Mar 20, 2009 at 3:42 PM<br>To: <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br></font><br></span><br>New to supervision trees and trying to figure out when to use them (and when not to)...<br>

<br>I have bunch of spawned processes created through spawn_link. Want these processes to say running indefinitely. If one exits in an error state, we want to restart it N times. After N, we want to error log it, and stop trying to restart it. Perfect job for a one_to_one supervisor right?<br>


<br>Well sort of. The problem is that when the max restarts for the error process is reached, the supervisor terminates all its children and itself. Ouch! (At least in our case). We'd rather that the supervisor just keep supervising all the children that are ok and not swallow everything up.<br>


<br>The Design Principles appear to be saying that swallowing everything up is what supervisors are supposed to do when max restarts is reached which leaves me a little puzzled. Why would you want to kill the supervisor just because a child process is causing trouble? Seems a little harsh.<br>


<br>Is this a case of me thinking supervisors are good for too many things? Is it that our case is better handled by simply spawning these processes and trapping exits on them, and restarting/error logging in the trap exit?<br>


<br>Thanks!<br><font color="#888888"><br>Steve<br><br><br>

</font><br>----------<br><span class="undefined"><font color="#000000">From: <b class="undefined">Lennart Öhman</b> <span dir="ltr"><<a href="mailto:Lennart.Ohman@st.se">Lennart.Ohman@st.se</a>></span><br>Date: Fri, Mar 20, 2009 at 5:23 PM<br>

To: steve ellis <<a href="mailto:steve.e.123@gmail.com">steve.e.123@gmail.com</a>>, "<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a>" <<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a>><br>

</font><br></span><br>


<div link="blue" vlink="purple" lang="SV">


<div>


<p><span style="font-size: 11pt; font-family: "Courier New";" lang="EN-GB">Hi,

as you have discovered there are a few different restart strategies available

when designing a supervisor (e.g: one_for_one, one_for all). One can of course come

up with more or less an infinite number of such strategies, each one with its

own twist.</span></p>


<p><span style="font-size: 11pt; font-family: "Courier New";" lang="EN-GB">The

main idea and problem at the time when the supervisor behaviour was constructed

was that you have a set of more or less permanent processes that implements a

subsystem. There should not be any ‘illegal’ terminations (such

that causes the supervisor to act) amongst the children. But, as we know, no

non trivial system is completely correct, hence an occasional failure and

following restart must be allowed. If we have repeated failures it may indicate

that the problem concerns more than this subsystem, therefore the need to

eventually escalate the restarts (or as you have discovered kill all children

and itself).</span></p>


<p><span style="font-size: 11pt; font-family: "Courier New";" lang="EN-GB"> </span></p>


<p><span style="font-size: 11pt; font-family: "Courier New";" lang="EN-GB">If

you really want to achieve a situation where failures never escalates above a

certain supervisor you can bump up the max-restart threshold and at the same

time shorten the sliding window. It is a not uncommon mistake (in normal usage

of supervisors </span><span style="font-size: 11pt; font-family: Wingdings;" lang="EN-GB">J</span><span style="font-size: 11pt; font-family: "Courier New";" lang="EN-GB">

) to have a too high max-restart-intensity in combination with a too short

sliding window at a higher level supervisor. It may then be too long between

escalation attempts by lower level supervisor for the same error, making a higher

level supervisor not consider two failures amongst its children being the same

error, and therefore not eventually escalate to its superior supervisor.</span></p>


<p><span style="font-size: 11pt; font-family: "Courier New";" lang="EN-GB"> </span></p>


<p><span style="font-size: 11pt; font-family: "Courier New";" lang="EN-GB">Best

Regards</span></p>


<p><span style="font-size: 11pt; font-family: "Courier New";" lang="EN-GB">Lennart</span></p>


<p><span style="font-size: 11pt;" lang="EN-GB"> </span></p>


<p><span lang="EN-GB">-------------------------------------------------------------</span></p>


<p><span lang="EN-GB">Lennart Öhman                   direct 

: +46 8 587 623 27</span></p>


<p><span lang="EN-GB">Sjöland & Thyselius Telecom AB 

cellular: +46 70 552 6735</span></p>


<p>Hälsingegatan 43, 10 th floor   fax     : +46 8 667 82 30</p>


<p>SE-113 31 STOCKHOLM, SWEDEN     email   :

<a href="mailto:lennart.ohman@st.se" target="_blank">lennart.ohman@st.se</a></p>


<p> </p>


<p><span style="font-size: 11pt; color: rgb(31, 73, 125);" lang="EN-GB"> </span></p>


<div style="border-style: solid none none; border-color: rgb(181, 196, 223) -moz-use-text-color -moz-use-text-color; border-width: 1pt medium medium; padding: 3pt 0cm 0cm;">


<p><b><span style="font-size: 10pt;" lang="EN-US">From:</span></b><span style="font-size: 10pt;" lang="EN-US"> <a href="mailto:erlang-questions-bounces@erlang.org" target="_blank">erlang-questions-bounces@erlang.org</a>

[mailto:<a href="mailto:erlang-questions-bounces@erlang.org" target="_blank">erlang-questions-bounces@erlang.org</a>] <b>On Behalf Of </b>steve ellis<br>

<b>Sent:</b> den 20 mars 2009 20:42<br>

<b>To:</b> <a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>

<b>Subject:</b> [erlang-questions] to supervise or not to supervise</span></p>


</div><div><div></div></div></div>


</div>


<br>----------<br><span class="undefined"><font color="#000000">From: <b class="undefined">Mihai Balea</b> <span dir="ltr"><<a href="mailto:mihai@hates.ms">mihai@hates.ms</a>></span><br>Date: Fri, Mar 20, 2009 at 5:29 PM<br>

To: steve ellis <<a href="mailto:steve.e.123@gmail.com">steve.e.123@gmail.com</a>><br></font><br></span><br><div><div></div></div>

As far as I know, the standard supervisor cannot behave the way you want it to.<br>

<br>

So, at least until this type of behavior is added to the standard supervisor, you can work around it with double layers of supervision.  Basically have one dedicated supervisor for each process you want to supervise and, in turn, each dedicated supervisor is set up as a transient child to one big supervisor.<br>

<font color="#888888">

<br>

Mihai<br>

<br>

</font><br>----------<br><span class="undefined"><font color="#000000">From: <b class="undefined">steve ellis</b> <span dir="ltr"><<a href="mailto:steve.e.123@gmail.com">steve.e.123@gmail.com</a>></span><br>Date: Sun, Mar 22, 2009 at 10:58 AM<br>

To: Mihai Balea <<a href="mailto:mihai@hates.ms">mihai@hates.ms</a>>, <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br></font><br></span><br>Thanks Lennart and Mihai! Very helpful information. Lennart it's good to know about the intent behind supervisor's orignial design.<br>

<br>I like Mihai's suggestion of having one supervisor supervise each process. This would get us most of the way there and it would be easy to implement.<br>

<br>But is there any way in OTP to see when a supervisor reaches its max restarts? I know this is logged by the sasl error logger. But how would I trap/detect this event in my code to do something with it?<br><br>It doesn't look like supervisor has a function like gen_server's handy terminate/2.<br>


<br>Maybe it would make more sense in our case to have one gen_server process monitor a child gen_server process. The child could call a function in the parent when it terminates. This way we'd have access to the terminate function of the monitoring/supervising gen_server. The problem with this though is that we'd have to implement our own restart strategy behavior, which is what is so great about supervisor.<br>


<br>This might be related to something more general that I've been wondering about (which I should post as a question in a new thread). How to tap into the sasl error logger so my system can do stuff with those events. For example I'd like to send these events to another machine via tcp.<br>


<br>Thanks!<br><font color="#888888"><br>Steve</font><div><div></div></div><br>----------<br><span class="undefined"><font color="#000000">From: <b class="undefined">steve ellis</b> <span dir="ltr"><<a href="mailto:steve.e.123@gmail.com">steve.e.123@gmail.com</a>></span><br>

Date: Sun, Mar 22, 2009 at 12:24 PM<br>To: <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br></font><br></span><br>I just realized that the process that spawns my standalone supervisors would linked by default to the supervisors through its call to start_link to start the supervisors in the first place. So when a supervisor dies because it has reached its max restarts, the calling gen_server process will get an exit signal in its handle_info callback of {'EXIT', DeadSupervisorPid, reached_max_restart_intensity}. This is basic error handling stuff and it is where i would write my code to do something with the error.<br>


<br>And now as I read the docs on handle_info/2 i see that that is where all system messages get sent which seems to answer my other question.<br><br>So I think I'm on the right track. Please someone let me know if I'm missing something. Thanks!<br>

<font color="#888888">

<br>Steve</font><div><div></div></div><br>----------<br><span class="undefined"><font color="#000000">From: <b class="undefined">Mihai Balea</b> <span dir="ltr"><<a href="mailto:mihai@hates.ms">mihai@hates.ms</a>></span><br>

Date: Sun, Mar 22, 2009 at 6:11 PM<br>To: steve ellis <<a href="mailto:steve.e.123@gmail.com">steve.e.123@gmail.com</a>><br>Cc: <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br></font><br>

</span><br>

You are on the right track.  However, keep in mind a few things:<br>

 - You need to do this only if you need some sort of special handling of that error condition.  If you don't need to do anything with it, except maybe some reporting, then a standard supervisor will do the trick.  Personally, I would attempt to use standard OTP behaviours whenever I could, simply because the code was extensively tested and can be considered error-free for all intents and purposes. However, sometimes that is not always possible.<br>


 - If you do end up writing code to supervise your own supervisors, make sure you handle all error conditions and shutdown sequences.<br>

<br>

Basically, I guess what I'm trying to say is, unless you know exactly what you're doing, it's a good idea to let OTP do its job.<br>

<br>

Cheers,<br><font color="#888888">

Mihai<br>

<br>

</font><br>----------<br><span class="undefined"><font color="#000000">From: <b class="undefined">Ulf Wiger</b> <span dir="ltr"><<a href="mailto:ulf@wiger.net">ulf@wiger.net</a>></span><br>Date: Sun, Mar 22, 2009 at 6:28 PM<br>

To: steve ellis <<a href="mailto:steve.e.123@gmail.com">steve.e.123@gmail.com</a>><br>Cc: <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br></font><br></span><br>So when this question comes up it is customary for me to mention<br>


my extension of the supervisor behavior to allow tracking the number<br>

of restarts... (:<br>

<br>

<a href="http://erlang.org/pipermail/erlang-questions/2003-November/010763.html" target="_blank">http://erlang.org/pipermail/erlang-questions/2003-November/010763.html</a><br>

<br>

The way restart escalation currently works, I think it's wise in most cases<br>

to escalate all the way up as soon as the nearest supervisor is unable<br>

to resolve the situation. I've rarely seen an escalated issue resolve itself<br>

in the middle management layer. You either solve it close to the problem<br>

itself, or you solve it at the top - and try to expedite the work in the<br>

middle.<br>

<br>

(We're of course only talking Erlang supervisors here.)<br>

<br>

BR,<br>

Ulf W<br>

<br>

<br>

2009/3/22 steve ellis <<a href="mailto:steve.e.123@gmail.com">steve.e.123@gmail.com</a>>:<br>

<div><div></div></div>> _______________________________________________<br>

> erlang-questions mailing list<br>

> <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

> <a href="http://www.erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://www.erlang.org/mailman/listinfo/erlang-questions</a><br>

><br>

<br></div><br>