delayed child restart with incremental back-off

Mon May 3 09:33:00 CEST 2021

I don't disagree with the article.

* Connect (or other) outside of init: yes
* Callers getting a response consistent with the state: yes

But not everything is managing a connection with callers.

I can see restart delays in the supervisor to be very useful in those cases:

* The process that is (re)started is just a worker. For example a 
process that synchronizes data between two nodes (over the distribution, 
or not; with/without handshake)

* The process uses a third party library that does an operation that may 
crash and leave this process in a bad state (so it has to restart)

I can also see restart delays to be useful in the case where you just do 
a file:open or similar, which can get you hitting a resource limit. Sure 
you could do the backoff in your process, but doing a backoff in every 
process that may get an emfile is a bit much.

The advantage of having this option in the supervisor is that you don't 
have to implement the backoff everywhere, you can just implement it 
where it provides value (such as HTTP/database connections).

Cheers,

On 03/05/2021 02:08, Tristan Sloughter wrote:
> I still think supervisors are the wrong place for this and Fred's blog post about it from back then is still the best explanation https://ferd.ca/it-s-about-the-guarantees.html
> 
> On Sun, May 2, 2021, at 13:00, Nicolas Martyanoff wrote:
>>
>> Hi,
>>
>> I originally posted this email on erlang-patches, but I just realized
>> most developers are on erlang-questions instead. I believe this could be
>> of interest.
>>
>>
>> Nine years ago, an interesting patch [1] was submitted by Richard Carlsson
>> allowing to delay the re-creation of failed children in supervisors.
>>
>> After a quick discussions, the official answer was that the OTP team
>> would discuss about it [2]. There is no further message on the mailing
>> list.
>>
>> Was there an official response ?
>>
>> I have various supervisors whose children handle network connections.
>> When something goes wrong with the connection, children die and are
>> immediately restarted. Most of the times, errors are transient (remote
>> server restarting, temporary network issue, etc.), but retrying without
>> any delay is pretty much guaranteed to fail again. And of course after a
>> few retries, the application dies which is unacceptable.
>>
>> This kind of behaviour is a huge problem: it fills logs with multiple
>> copies of identical errors and causes a system failure.
>>
>> In general, if I could, I would use restart delays with exponential
>> backoff everywhere because in practice, restarting immediately is almost
>> never the right approach: code errors do not disappear when restarting
>> so they are going to get triggered again immediately, and external errors
>> are not magically fixed by retrying without any delay.
>>
>> Is there still interest for this patch ?
>>
>> [1] https://erlang.org/pipermail/erlang-patches/2012-January/002575.html
>> [2] https://erlang.org/pipermail/erlang-patches/2012-January/002597.html
>>
>> -- 
>> Nicolas Martyanoff
>> http://snowsyn.net
>> khaelin@REDACTED
>>

-- 
Loïc Hoguin
https://ninenines.eu