<div dir="ltr"><div>I am against this proposal, for similar reasons I have opposed similar ones in the past. Most of my opposition has been written up before so I'm just going to link to it here: <a href="https://ferd.ca/it-s-about-the-guarantees.html">https://ferd.ca/it-s-about-the-guarantees.html</a></div><div><br></div><div>I dislike the possibility of running vs. active children because arguably the caller would need to have a way to check for that and it would be absolutely terrible to have to ask the supervisor on every hot call path; that semantic distinction should IMO be implemented in the worker, as per my post above.<br></div><div><br></div><div>I'm also not sure of using the max delay of all children to assign it when it applies to many. This makes sense as a preventive measure to be too aggressive, but actively prevents being able to consider distinct tasks as actually distinct. Take the following example where you have a configuration handler starting with a very short backoff, in a one_for_all situation with a database client that relies on the configuration handler to start. The configuration handler may also be used by other workers for information and you want it available. This is set as a one_for_all configuration such that if the client to the database goes down, the whole ensemble restarts to provide a fresh DB config (in case the config changed!)</div><div> With these retry policies, the ability to provide a fresh config is now limited by the delay of the client wanting to retry connecting to the database, even though they refer to fundamentally different operations with different load profiles deserving different timers. The end result is that you'd end up having to split your supervision tree to make sure the timeouts in one child do not affect other ones. That's messy.<br></div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 17, 2021 at 7:27 AM Viktor Söderqvist <<a href="mailto:viktor@zuiderkwast.se">viktor@zuiderkwast.se</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
If we add delays, then how about exponential backoff? e.g. doubling the <br>
delay for each failed restart attempt. Is it worth considering too? It <br>
has been suggested before and it's common for network re-attempts.<br></blockquote><div><br></div><div>Even exponential backoffs are not necessarily good enough; a better approach you would likely look for is exponential backoff with jitters, which now require a bit more configuration. And in most advanced cases, the thing you end up wanting is not necessarily just backoffs, but also mechanisms for circuit-breaking, which also have various implementation approaches (solid vs. gradual cooldowns, trickle probing for repair, manual open/closing of the breaker, local-only or shared counts, etc) that can provide much better and richer semantics there, and in most case with libraries already available.</div><div><br></div><div>If anything of this proposal goes through I would argue in favor of not supporting incremental backoffs because this is either guaranteeing you're gonna get a basic, subpar implementation that still needs replacing with nicer libraries in use cases that need some refinement (which might be worth embracing for simplicity's sake), or sending you on the way of having supervisors which have most of their logic dedicated to not actually being supervisors but to actually doing good circuit breaking or exponential backoff with triggers (and no coordination) if you want to provide a more solid implementation.</div><div><br></div><div>The thing with the supervisors as they are right now is that they <i>can</i> and <i>should</i> compose nicely with backoffs and circuit breakers if you use the init call for guaranteed behaviour rather than for retries. The thing with supervisors following this proposal is that they start overlapping and don't compose anymore unless you make sure not to use the features.<br></div></div></div>