EEP proposal - Delayed restarts of supervisor children

Thu Jun 17 14:34:21 CEST 2021

I am against this proposal, for similar reasons I have opposed similar ones
in the past. Most of my opposition has been written up before so I'm just
going to link to it here: https://ferd.ca/it-s-about-the-guarantees.html

I dislike the possibility of running vs. active children because arguably
the caller would need to have a way to check for that and it would be
absolutely terrible to have to ask the supervisor on every hot call path;
that semantic distinction should IMO be implemented in the worker, as per
my post above.

I'm also not sure of using the max delay of all children to assign it when
it applies to many. This makes sense as a preventive measure to be too
aggressive, but actively prevents being able to consider distinct tasks as
actually distinct. Take the following example where you have a
configuration handler starting with a very short backoff, in a one_for_all
situation with a database client that relies on the configuration handler
to start. The configuration handler may also be used by other workers for
information and you want it available. This is set as a one_for_all
configuration such that if the client to the database goes down, the whole
ensemble restarts to provide a fresh DB config (in case the config changed!)
With these retry policies, the ability to provide a fresh config is now
limited by the delay of the client wanting to retry connecting to the
database, even though they refer to fundamentally different operations with
different load profiles deserving different timers. The end result is that
you'd end up having to split your supervision tree to make sure the
timeouts in one child do not affect other ones. That's messy.

On Thu, Jun 17, 2021 at 7:27 AM Viktor Söderqvist <viktor@REDACTED>
wrote:

> If we add delays, then how about exponential backoff? e.g. doubling the
> delay for each failed restart attempt. Is it worth considering too? It
> has been suggested before and it's common for network re-attempts.
>

Even exponential backoffs are not necessarily good enough; a better
approach you would likely look for is exponential backoff with jitters,
which now require a bit more configuration. And in most advanced cases, the
thing you end up wanting is not necessarily just backoffs, but also
mechanisms for circuit-breaking, which also have various implementation
approaches (solid vs. gradual cooldowns, trickle probing for repair, manual
open/closing of the breaker, local-only or shared counts, etc) that can
provide much better and richer semantics there, and in most case with
libraries already available.

If anything of this proposal goes through I would argue in favor of not
supporting incremental backoffs because this is either guaranteeing you're
gonna get a basic, subpar implementation that still needs replacing with
nicer libraries in use cases that need some refinement (which might be
worth embracing for simplicity's sake), or sending you on the way of having
supervisors which have most of their logic dedicated to not actually being
supervisors but to actually doing good circuit breaking or exponential
backoff with triggers (and no coordination) if you want to provide a more
solid implementation.

The thing with the supervisors as they are right now is that they *can* and
*should* compose nicely with backoffs and circuit breakers if you use the
init call for guaranteed behaviour rather than for retries. The thing with
supervisors following this proposal is that they start overlapping and
don't compose anymore unless you make sure not to use the features.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/eeps/attachments/20210617/3d185f8d/attachment.htm>