delayed child restart with incremental back-off

Tue May 4 02:20:22 CEST 2021

On 2021/05/03 18:04, Nicolas Martyanoff wrote:
> Ingela Andin <ingela@REDACTED> writes:
> 
>> Erlang-patches is legacy, we use GitHub instead, and yes erlang-questions
>> is still a place for discussions.
> Got it. It would make sense to send an email to people posting on
> erlang-questions to inform them (instead of just telling them that the
> mailing list is "moderated").
> 
>> Well, this was some time ago so I am unsure of how it was
>> communicated. But the conclusion was that we did see merit in the idea
>> but that we were not able to include something that would be backwards
>> incompatible by default. To be able to change defaults we need to have
>> a phasing out mechanism and period of testing what problems it might
>> cause legacy code. We also did not have an immediate own use case for
>> this that could motivate it to be prioritized for us to put much of
>> our own time into it, and hence it requires a bigger effort from the
>> contributor to motivate and test and think through all scenarios.
> 
> Thank you for explaining.
> 
> While I understand your point, I fear that this line of reasoning leads
> to lots of developers having to skip various OTP components because they
> simply cannot be patched. Backward compatibility is important; but
> pushed to the extreme, it is tentamount to stagnation and death.
> 
> In this case, I am going to have to write a new supervisor module and
> apparently I'm not the first one to do so. In addition of a new
> gen_server so that I can get the right types and the infinite call
> timeout by default, among other things.

You don't have to implement your own supervisor to get this kind of 
behavior, simply move connection out of initialization. As a general 
rule initialization should never be dependent on anything outside your 
node's control -- especially not something across the network.

It is less complicated to either:
1. Write a service manager: A connection manager process whose job it is 
to know what connections have failed and how long ago and implements 
*exactly* the kind of backoff you want by having the workers start up 
disconnected and have a connect/0 call.
2. Write smarter workers: The connections processes themselves written 
to handle the case where the connection is lost and implement reconnect 
backoff themselves.

Which way you choose to do this is up to you. Neither is very complicated.

Losing an external resource is not a *fault* in your program, but rather 
an expected case that you know about and are discussing right now.

Putting this into supervisors is overloading and specializing 
supervisors to handle a state management task that belongs either in a 
sub-service manager process or inside the state of the workers 
themselves. I tend to opt for the "write a service manager" approach 
when we have a simple_one_for_one type supervisor structure (typical of 
the case where we have multiple incoming connections, the 
service->worker pattern), and the "write smarter workers" approach when 
we have a predetermined number of connections of various types (often 
meaning named workers connecting to specific external resources like one 
connection each to a DB, an upstream feed, and a presence service, all 
of which have totally different code internally).

"Write smarter children" sometimes becomes "write a backoff connection 
behavior" so that the details of backoff can be implemented just once, 
but if it is three or fewer modules... meh.

-Craig