EEP proposal - Delayed restarts of supervisor children

Fred Hebert mononcqc@REDACTED
Thu Jun 17 19:36:23 CEST 2021

On Thu, Jun 17, 2021 at 10:22 AM Maria Scott <maria-12648430@REDACTED>

> Hi Fred,
> > I am against this proposal, for similar reasons I have opposed similar
> ones in the past. Most of my opposition has been written up before so I'm
> just going to link to it here:
> As Loic (sorry, no idea how you put those double dots over an i ^^;;;)
> pointed out, just because a feature exists does not mean that it must be
> used, or that it is a good fit for everything, or that it has to be the one
> way by which all things have to be done. The feature takes away nothing.

My view of this is that the feature creates an expectation. It is going to
be seen as the blessed implementation and the one people default to because
the language provides it as a feature of its core component of its core
framework. When maps are added to the language, it's expected that they're
going to meet a level of quality and contain an inherent level of trust
given to it by the community.

I can't properly put into words the sort of impact saying "oh the
supervisor, the one thing that is so core to OTP? it's actually not that
good and you should just use parts of them" would have. Even if there's
absolutely no good way to measure it, OTP has always represented this solid
core I know I can definitely build on *in any stateful application I ever
worked on in Erlang over the last decade*. This is less and less true as we
add features that are not that useful to everyone and not that solid, just
because maybe sometimes it's good and convenient. This is the core of the
whole robust state management structure here.

To me, the reasonable result of shipping a feature that is inadequate to be
properly useful is not "it's alright, people won't use it", it's "we're
going to improve it at some point in time, and we have to bear the cost of
ownership in its future maintenance." When it's this sort of core, you
either do it right or don't do it, and remain explicit about what you
offer. If it's not done right, it has to either be taken out or improved,
so I'm not feeling good about hearing "well if it's not good just don't use
it" in this very specific context.

My discourse would be different if it were something else, not in the
supervisors. For example, the 'global' library has lots of edge cases that
are documented, with inherent limitations to their design, but that's fine
because it is far less central to the whole philosophy of the platform.

> Also, I presume that delayed restarts are really most useful and the
> primary field of use cases in the one_for_one strategies. TBH, figuring
> something out for one_for_all and rest_for_one was the biggest headache in
> all this, and we would actually have been glad if we could have ignored
> them. Nevertheless, they are there.
For me they have nothing to do with the type of supervision, but with the
type of load protection you're trying to offer. I will want to use a very
distinct retry strategy when I'm talking to a constrained resource from
distributed workers than if I'm talking to a local one where everything is
sharing the same underlying resources.
The delayed restart is less useful in a simple_one_for_one scenario because
a circuit breaker buys you more for less *as long as all the workers
contact the same endpoint*, but this is no longer true if each worker can
contact a different place.

Some core variable here are:
a) what is the thing you are protecting by backing off and is there no
other way of protecting it or no inherent ways of doing it?
b) is there an ability to do an effective level of coordination across
c) is there an expected ability to increase scale on the receiving end or
is it uncontrollable, and if so what's the rate?
d) is there a local cost to the operation taking place?
e) what is the meaning of different faults and the way you should react to
f) how do you handle invalidating the state across failures, and how much
do you keep?

All of these variables are about the worker, the endpoint it talks to, how
it talks to it, and what it tells it. It's about the relationship between
the worker and the thing it tries to talk to. It is not (or only
superficially) related to supervision structures, and the supervision
structure is consequently not the proper place for this decision to live in.

> > I dislike the possibility of running vs. active children because
> arguably the caller would need to have a way to check for that and it would
> be absolutely terrible to have to ask the supervisor on every hot call
> path; that semantic distinction should IMO be implemented in the worker, as
> per my post above.
> Sorry, I don't understand that one ^^; On what kind of calls would you
> want to ask the supervisor if a child is running or not?

Well there's a distinction between "I can't talk to the remote end because
it's failing on its side" and "I can't talk to the remote state because
we're failing on our side." This distinction is represented in the
"running" vs. "active" status by defining whether I'm in the process of
retrying, or whether I have given up. If both situations are represented by
the process being gone, obtaining the answer to that question requires me
to ask the supervisor whether the child is "running" or "active", otherwise
the ambiguity is unresolvable from a caller's point of view.

> I'm also not sure of using the max delay of all children to assign it
> when it applies to many. This makes sense as a preventive measure to be too
> aggressive, but actively prevents being able to consider distinct tasks as
> actually distinct.
> Sure, but it ensures that tasks (or parts thereof) which _do_ belong
> together _stay_ (in the sense of, are started) together.
Sure. I think it's the least problematic way of doing things, but I'll
restate my point that retry policy isn't a question of workers vs. workers,
but of worker vs. the endpoints it talks to. So they're grouped together as
a convenience but I don't think it's semantically meaningful. It's a lesser
evil, and I appreciate the thoroughness of having covered this really funky
edge case in the proposal, that's very considerate and I don't think I
could propose a better approach that wouldn't require a fundamental change
to the retry mechanism.
