delayed child restart with incremental back-off

Michael Truog mjtruog@REDACTED
Mon May 3 03:15:44 CEST 2021

To put as many error checks in the initialization phase as possible, we 
should be able to have connections established during process 
initialization.  That is best to keep the logic simple and reliable 
(establishing requirements for the runtime as clear constraints).  To 
facilitate that use of Erlang processes it is advantageous to have 
backoff in the supervisor source code with the understanding that it is 
meant to be used for external failures (normally associated with network 
connections, like a database not being up, that is determined to be 
critical to the operation of the Erlang process).

The backoff would provide an increasing delay to the restart and the 
Shutdown timeout value can remain constant (the termination time 
wouldn't relate to external failures).

On 5/2/21 5:08 PM, Tristan Sloughter wrote:
> I still think supervisors are the wrong place for this and Fred's blog post about it from back then is still the best explanation
> On Sun, May 2, 2021, at 13:00, Nicolas Martyanoff wrote:
>> Hi,
>> I originally posted this email on erlang-patches, but I just realized
>> most developers are on erlang-questions instead. I believe this could be
>> of interest.
>> Nine years ago, an interesting patch [1] was submitted by Richard Carlsson
>> allowing to delay the re-creation of failed children in supervisors.
>> After a quick discussions, the official answer was that the OTP team
>> would discuss about it [2]. There is no further message on the mailing
>> list.
>> Was there an official response ?
>> I have various supervisors whose children handle network connections.
>> When something goes wrong with the connection, children die and are
>> immediately restarted. Most of the times, errors are transient (remote
>> server restarting, temporary network issue, etc.), but retrying without
>> any delay is pretty much guaranteed to fail again. And of course after a
>> few retries, the application dies which is unacceptable.
>> This kind of behaviour is a huge problem: it fills logs with multiple
>> copies of identical errors and causes a system failure.
>> In general, if I could, I would use restart delays with exponential
>> backoff everywhere because in practice, restarting immediately is almost
>> never the right approach: code errors do not disappear when restarting
>> so they are going to get triggered again immediately, and external errors
>> are not magically fixed by retrying without any delay.
>> Is there still interest for this patch ?
>> [1]
>> [2]
>> -- 
>> Nicolas Martyanoff
>> khaelin@REDACTED

More information about the erlang-questions mailing list