[erlang-questions] catch supervisor failure

Fri Aug 26 10:52:10 CEST 2011

On 25 Aug 2011, at 20:56, Max Lapshin wrote:

> I need to fetch url once a minute. It is ok for remote server to fail
> to reply with this url several times, so I want to use supervisor
> mechanism for it:
> 
> set 60 restarts in 5 seconds as a limit and if supervisor fails, it
> should be stopped for some time. After 10 minutes it should be
> restarted.
> 
> Question is: should I use supervisors for this mechanism or I need to
> write my own failure tracker?

This is a bit convoluted with standard supervisors, but one way to do it, is to have layers of supervisors with rest_for_one supervision. The request process can die and get restarted immediately; if it hits the limit, its supervisor is restarted, which will also restart a 'delay' process before the request process. Given that the start sequence is synchronous, the 'delay' process could wait during init/1, but this is considered bad form. A better solution might be for the request process to ask the delay process for permission.

The next problem becomes to avoid waiting 10 minutes the first time. Common tricks is to create an ets table on initial start (have it owned by a higher supervisor), or repeat the same pattern in the layer above, so that the 'delay' process can ask another process if it's an initial start, or escalated restart.

I once wrote a modified supervisor behaviour which could keep track of the number of restarts, including escalated restarts. I usually refer to it every once a year or so, but I wrote the code back in 2001, so I'm not sure I can find it anymore.

OTOH, writing your own custom supervisor is not *that* hard. The main complication may be that the OTP supervisors also tie into the release handler and appmon.

BR,
Ulf W

Ulf Wiger, CTO, Erlang Solutions, Ltd.
http://erlang-solutions.com