delayed child restart with incremental back-off

Mon May 3 11:03:37 CEST 2021

Hi!

See answer below,

Den sön 2 maj 2021 kl 21:01 skrev Nicolas Martyanoff <khaelin@REDACTED>:

>
> Hi,
>
> I originally posted this email on erlang-patches, but I just realized
> most developers are on erlang-questions instead. I believe this could be
> of interest.
>
>
>
Erlang-patches is legacy, we use GitHub instead, and yes erlang-questions
is still a place for discussions.

> Nine years ago, an interesting patch [1] was submitted by Richard Carlsson
> allowing to delay the re-creation of failed children in supervisors.
>
> After a quick discussions, the official answer was that the OTP team
> would discuss about it [2]. There is no further message on the mailing
> list.
>
> Was there an official response ?
>
>
Well, this was some time ago so I am unsure of how it was communicated. But
the conclusion was that we did see merit in the idea but that we were not
able to include something that would be backwards incompatible by default.
To be able to change defaults we need to have a phasing out mechanism and
period of testing what
problems it might cause legacy code. We also did not have an immediate own
use case for this that could motivate it to be prioritized for us to put
much of our own time into it, and hence it requires a bigger effort from
the contributor to motivate and test and think through all scenarios.
Alas, we do not have the luxury to persue all ideas that we think are good
ones. One example of something that we had wanted to do for a long time,
and actually finally got to do, is gen_statem.  The recent contribution to
supervisors
of significant children is an example of a successful Open Source
contribution where we also happened to have an immediate use case.

Regards Ingela - Erlang OTP/Team - Ericsson AB

> I have various supervisors whose children handle network connections.
> When something goes wrong with the connection, children die and are
> immediately restarted. Most of the times, errors are transient (remote
> server restarting, temporary network issue, etc.), but retrying without
> any delay is pretty much guaranteed to fail again. And of course after a
> few retries, the application dies which is unacceptable.
>
> This kind of behaviour is a huge problem: it fills logs with multiple
> copies of identical errors and causes a system failure.
>
> In general, if I could, I would use restart delays with exponential
> backoff everywhere because in practice, restarting immediately is almost
> never the right approach: code errors do not disappear when restarting
> so they are going to get triggered again immediately, and external errors
> are not magically fixed by retrying without any delay.
>
> Is there still interest for this patch ?
>
> [1] https://erlang.org/pipermail/erlang-patches/2012-January/002575.html
> [2] https://erlang.org/pipermail/erlang-patches/2012-January/002597.html
>
> --
> Nicolas Martyanoff
> http://snowsyn.net
> khaelin@REDACTED
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20210503/28002ff5/attachment.htm>