EEP proposal - Delayed restarts of supervisor children

Viktor Söderqvist viktor@REDACTED
Mon Jun 21 18:23:46 CEST 2021


On 2021-06-21 17:21, Fred Hebert wrote:
> At the very least, we should find ways to provide guidance, some 
> libraries, demos or samples, or see if there could be a way to create a 
> "client" behaviour that could take that common state machine of 
> disconnected --> [connecting -->] connected and augment it with backoff 
> or even circuit breaker mechanisms ("give the name of the shared circuit 
> breaker your clients are using"), which would far more easily let people 
> put the fault-handling behaviour close to the error-handling mechanisms 
> and bring the decision making to app-specific concerns, and create 
> extensible mechanisms.

EEP XXX: New behaviour "gen_client"

Very nice Fred! Feeling up for it?

How do you feel about the pattern where you have a manager process 
alongside a supervisor? Connection pools typically have this structure. 
Are such manager processes a reasonable place for delay logic you think?

I had this scenario some years ago: There are a few replicas of a 
database, which are used for read-only access to offload a master 
database. To each of these replicas, you have a connection pool (poolboy 
or some other). Each db replica may be down, but it may also just be 
lagging behind too much in replication (there's a way to query this) in 
which case you don't want to use it until it has caught up.

I used a manager worker process alongside a supervisor of all the pools. 
The manager could start/stop the connection pools by adding/removing 
them to the supervisor and additionally keep some other data of which 
are usable or which aren't. If a replica is down, there's no point in 
having all its connection processes stuck in reconnect-loops, so I'd 
stop them and remove them from the supervision tree. Any pitfalls with 
this design?

A different note regarding automatic reconnects in clients: They may be 
problematic, since there may be some state associated with the 
connection (such as an ongoing database transaction) which is lost if 
automatic reconnect is done without care. Crashing instead of 
reconnecting makes this handling way simpler (or at least it moves the 
problem to somewhere else). How would you best solve this using the 
hypothetical gen_client behaviour?

Viktor


More information about the eeps mailing list