delayed child restart with incremental back-off

Nicolas Martyanoff khaelin@REDACTED
Sun May 2 21:00:57 CEST 2021


Hi,

I originally posted this email on erlang-patches, but I just realized
most developers are on erlang-questions instead. I believe this could be
of interest.


Nine years ago, an interesting patch [1] was submitted by Richard Carlsson
allowing to delay the re-creation of failed children in supervisors.

After a quick discussions, the official answer was that the OTP team
would discuss about it [2]. There is no further message on the mailing
list.

Was there an official response ?

I have various supervisors whose children handle network connections.
When something goes wrong with the connection, children die and are
immediately restarted. Most of the times, errors are transient (remote
server restarting, temporary network issue, etc.), but retrying without
any delay is pretty much guaranteed to fail again. And of course after a
few retries, the application dies which is unacceptable.

This kind of behaviour is a huge problem: it fills logs with multiple
copies of identical errors and causes a system failure.

In general, if I could, I would use restart delays with exponential
backoff everywhere because in practice, restarting immediately is almost
never the right approach: code errors do not disappear when restarting
so they are going to get triggered again immediately, and external errors
are not magically fixed by retrying without any delay.

Is there still interest for this patch ?

[1] https://erlang.org/pipermail/erlang-patches/2012-January/002575.html
[2] https://erlang.org/pipermail/erlang-patches/2012-January/002597.html

-- 
Nicolas Martyanoff
http://snowsyn.net
khaelin@REDACTED


More information about the erlang-questions mailing list