delayed child restart with incremental back-off

Mon May 10 15:27:18 CEST 2021

This would still be incredibly useful.

Not having it throws a whole bunch of complexity every time a database or other type of potentially failing connection must be opened. Even more so if this process must start before other processes should start.

The current self() ! finish_startup recommendation drops all the startup ordering and failure recovery synchronisation on the programmer, and that is hard to get right.

gen_server:init is such a natural place to put this synchronous opening logic. The supervisor can then take care of the synchronous start of the connection and managing dependent processes.

/Sean

> On 10 May 2021, at 15:04, Richard Carlsson <carlsson.richard@REDACTED> wrote:
> 
> What happened at the time was that I met up with the OTP team and discussed it, and they eventually agreed that this was a good thing. However, it needed more work to be accepted (and I realized a couple of weaknesses in the implementation that I needed to address), but I never found time to do more work on it.
> 
>         /Richard
> 
> 
> Den sön 2 maj 2021 kl 21:01 skrev Nicolas Martyanoff <khaelin@REDACTED <mailto:khaelin@REDACTED>>:
> 
> Hi,
> 
> I originally posted this email on erlang-patches, but I just realized
> most developers are on erlang-questions instead. I believe this could be
> of interest.
> 
> 
> Nine years ago, an interesting patch [1] was submitted by Richard Carlsson
> allowing to delay the re-creation of failed children in supervisors.
> 
> After a quick discussions, the official answer was that the OTP team
> would discuss about it [2]. There is no further message on the mailing
> list.
> 
> Was there an official response ?
> 
> I have various supervisors whose children handle network connections.
> When something goes wrong with the connection, children die and are
> immediately restarted. Most of the times, errors are transient (remote
> server restarting, temporary network issue, etc.), but retrying without
> any delay is pretty much guaranteed to fail again. And of course after a
> few retries, the application dies which is unacceptable.
> 
> This kind of behaviour is a huge problem: it fills logs with multiple
> copies of identical errors and causes a system failure.
> 
> In general, if I could, I would use restart delays with exponential
> backoff everywhere because in practice, restarting immediately is almost
> never the right approach: code errors do not disappear when restarting
> so they are going to get triggered again immediately, and external errors
> are not magically fixed by retrying without any delay.
> 
> Is there still interest for this patch ?
> 
> [1] https://erlang.org/pipermail/erlang-patches/2012-January/002575.html <https://erlang.org/pipermail/erlang-patches/2012-January/002575.html>
> [2] https://erlang.org/pipermail/erlang-patches/2012-January/002597.html <https://erlang.org/pipermail/erlang-patches/2012-January/002597.html>
> 
> -- 
> Nicolas Martyanoff
> http://snowsyn.net <http://snowsyn.net/>
> khaelin@REDACTED <mailto:khaelin@REDACTED>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20210510/c513de39/attachment.htm>