<div dir="ltr"><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 17, 2021 at 10:22 AM Maria Scott <<a href="mailto:maria-12648430@hnc-agency.org">maria-12648430@hnc-agency.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Fred,<br>
<br>
> I am against this proposal, for similar reasons I have opposed similar ones in the past. Most of my opposition has been written up before so I'm just going to link to it here: <a href="https://ferd.ca/it-s-about-the-guarantees.html" rel="noreferrer" target="_blank">https://ferd.ca/it-s-about-the-guarantees.html</a><br>
<br>
As Loic (sorry, no idea how you put those double dots over an i ^^;;;) pointed out, just because a feature exists does not mean that it must be used, or that it is a good fit for everything, or that it has to be the one way by which all things have to be done. The feature takes away nothing.<br></blockquote><div><br></div><div>My view of this is that the feature creates an expectation. It is going to be seen as the blessed implementation and the one people default to because the language provides it as a feature of its core component of its core framework. When maps are added to the language, it's expected that they're going to meet a level of quality and contain an inherent level of trust given to it by the community. <br></div><div><br></div><div>I can't properly put into words the sort of impact saying "oh the supervisor, the one thing that is so core to OTP? it's actually not that good and you should just use parts of them" would have. Even if there's absolutely no good way to measure it, OTP has always represented this solid core I know I can definitely build on <i>in any stateful application I ever worked on in Erlang over the last decade</i>. This is less and less true as we add features that are not that useful to everyone and not that solid, just because maybe sometimes it's good and convenient. This is the core of the whole robust state management structure here.<br></div><div><br></div><div>To me, the reasonable result of shipping a feature that is inadequate to be properly useful is not "it's alright, people won't use it", it's "we're going to improve it at some point in time, and we have to bear the cost of ownership in its future maintenance." When it's this sort of core, you either do it right or don't do it, and remain explicit about what you offer. If it's not done right, it has to either be taken out or improved, so I'm not feeling good about hearing "well if it's not good just don't use it" in this very specific context.</div><div><br></div><div>My discourse would be different if it were something else, not in the supervisors. For example, the 'global' library has lots of edge cases that are documented, with inherent limitations to their design, but that's fine because it is far less central to the whole philosophy of the platform.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Also, I presume that delayed restarts are really most useful and the primary field of use cases in the one_for_one strategies. TBH, figuring something out for one_for_all and rest_for_one was the biggest headache in all this, and we would actually have been glad if we could have ignored them. Nevertheless, they are there.<br>
<br></blockquote><div><br></div><div>For me they have nothing to do with the type of supervision, but with the type of load protection you're trying to offer. I will want to use a very distinct retry strategy when I'm talking to a constrained resource from distributed workers than if I'm talking to a local one where everything is sharing the same underlying resources.</div><div>The delayed restart is less useful in a simple_one_for_one scenario because a circuit breaker buys you more for less <i>as long as all the workers contact the same endpoint</i>, but this is no longer true if each worker can contact a different place.</div><div><br></div><div>Some core variable here are: <br></div><div>a) what is the thing you are protecting by backing off and is there no other way of protecting it or no inherent ways of doing it?</div><div>b) is there an ability to do an effective level of coordination across workers?</div><div>c) is there an expected ability to increase scale on the receiving end or is it uncontrollable, and if so what's the rate?</div><div>d) is there a local cost to the operation taking place?</div><div>e) what is the meaning of different faults and the way you should react to them?</div><div>f) how do you handle invalidating the state across failures, and how much do you keep?<br></div><div><br></div><div>All of these variables are about the worker, the endpoint it talks to, how it talks to it, and what it tells it. It's about the relationship between the worker and the thing it tries to talk to. It is not (or only superficially) related to supervision structures, and the supervision structure is consequently not the proper place for this decision to live in.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> I dislike the possibility of running vs. active children because arguably the caller would need to have a way to check for that and it would be absolutely terrible to have to ask the supervisor on every hot call path; that semantic distinction should IMO be implemented in the worker, as per my post above.<br>
<br>
Sorry, I don't understand that one ^^; On what kind of calls would you want to ask the supervisor if a child is running or not? <br></blockquote><div><br></div><div>Well there's a distinction between "I can't talk to the remote end because it's failing on its side" and "I can't talk to the remote state because we're failing on our side." This distinction is represented in the "running" vs. "active" status by defining whether I'm in the process of retrying, or whether I have given up. If both situations are represented by the process being gone, obtaining the answer to that question requires me to ask the supervisor whether the child is "running" or "active", otherwise the ambiguity is unresolvable from a caller's point of view.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> I'm also not sure of using the max delay of all children to assign it when it applies to many. This makes sense as a preventive measure to be too aggressive, but actively prevents being able to consider distinct tasks as actually distinct.<br>
<br>
Sure, but it ensures that tasks (or parts thereof) which _do_ belong together _stay_ (in the sense of, are started) together.<br>
<br></blockquote><div><br></div><div>Sure. I think it's the least problematic way of doing things, but I'll restate my point that retry policy isn't a question of workers vs. workers, but of worker vs. the endpoints it talks to. So they're grouped together as a convenience but I don't think it's semantically meaningful. It's a lesser evil, and I appreciate the thoroughness of having covered this really funky edge case in the proposal, that's very considerate and I don't think I could propose a better approach that wouldn't require a fundamental change to the retry mechanism.<br></div><div> </div></div></div>