[erlang-questions] simple_one_for_one supervisor - what happens at restart? (also: gen_tcp)

Tue May 17 10:04:21 CEST 2016

Hello, Fred.

> Funnily enough, the supervision structure isn't all that is being
> trusted though. When an app is shut down, the application controller (or
> is it the master?) also runs through all of the processes on the node
> and looks for those for whose it is the group leader and then force
> kills them -- preventing the terminate function from being called.

Thanks for this information, I did not know that!

Soooo... 

Do supervisors with strategies _other than_ simple_one_for_one restart dynamically started children? Like: I add a transient child dynamically with its full child spec. Will one_for_one restart it with its original parameters if it fails?

Beyond that, my scenario is the following: 

I have thousands of clients at any time. These may run semi-complex procedures, setting stuff up, changing their centrally managed communication, quitting. 

I personally thought this was exactly the scenario transient was made for - cleaning up behind workers who exit normally (I mean: releasing the child info from the supervisor data structures), allowing restart of those who crashed. But in my system even those exiting normally seem to persist in the supervisor DS, which (I thought) according to the very definition of transient should not happen and will leak memory (and performance when walking lists) eventually.

Now, since I call terminate_child in the supervisor itself on the children that may cause the problem. This may be a bug in my design, which is like this:

Level A) Supervisor (one_for_one strategy) 
           -- 1 to n relationship --> 
Level B)    Semi-permanent Worker that runs individual procedures in parallel and acts as monitor 
              -- 1 to n relationship --> 
Level C)         Individual short-lived procedure spawned to run one short message sequence that ends with an update of system or client state

Now, I want my Semi-Permanent Worker (B) as stateless as possible. It maintains a set of flags that can be reloaded from DB at any time so that it can be determined which procedures are allowed to start. It starts C procedures (or routes messages to running ones) and monitors them. It acts like a supervisor that needs to know more about the children because it is designed to start the right ones, so I implemented it as a worker.

The individual C procedures are short, either honoring the OTP principles or simple gen_fsms, walking through one or several steps of message exchange in the predefined protocol with the client. 

This would all be fine and dandy, but it is an individual procedure that is handling client shutdown (and hence the need to terminate its "boss" B). Currently I finish it by making a call to the supervisor to terminate the boss child and then cleanup occurs. This works but requires me still to call delete_child in order to make sure the supervisor data structures are not full of zombie children.

I could also signal in the C child procedure exit to its B boss monitor that this shutdown means a takedown of B altogether ("no more interaction with this client"). I don't want to have an error message pop up because of this (as some EXIT signals produce automatically, it seems) as this is supposed to be a regular case. So, maybe exit with shutdown?

The worst solution to me would be if the boss worker needs to track if one of the procedures he runs is a shutdown of a client and finish when it finishes. That would imply giving it more state and logic which I hoped to avoid.

The end result, however, is that I want the supervisor to keep as clean an internal state as possible because not only there will be 1000s of clients in the system at any time, hopefully this will sum up to millions of clients during its uptime due to the transient nature of these clients.

Now,
can this requirement be fulfilled with existing one_for_one supervisor? Is there any scenario where one_for_one guaranteedly cleans up the child from its data structures when it exits normally when it is configured transient? Or is there no such scenario? Previous emails left me confused about this.

Thank you and cheers,
Oliver