[erlang-questions] simple_one_for_one supervisor - what happens at restart? (also: gen_tcp)

Tue May 17 01:49:18 CEST 2016

On 05/16, Chandru wrote:
>
>No, it's not. The reason a terminate callback is provided in a gen_server
>is so that a process can clean up when it terminates, not to delegate it to
>other processes.
>

I'm gonna side with Loïc here. The terminate callback is good for any 
process-local cleanup or optimistic work, but is by no means a safe way 
to terminate anything.

For example, if you have many children to terminate and through some 
interleaving brutall_kill is triggered (or anyone calls exit(Pid, 
kill)), whatever work you wanted to do in terminate will be skipped by a 
non-trappable exit signal.

Using terminate as your sole termination clean up is risky. It is better 
to assume that it will not be called every time, only in controlled 
terminations and some accidental ones. This is especially true of 
non-collected resources -- not ports nor ETS tables -- specifically live 
dependencies such as other processes mid-discussion.

The other side has to be able to cope with the termination of its peer; 
this can be done through monitors, sometimes through link+trap_exit. If 
recovery is not possible, just dying is appropriate.

>
>No, it's not. From the manual:
>
>The supervisor is responsible for starting, stopping and monitoring its
>child processes. The basic idea of a supervisor is that it shall keep its
>child processes alive by restarting them when necessary.
>

In practice, the release handling mechanisms will make use of that 
supervision structure to walk the tree: that's why you declare whether a 
supervisor's child are workers or supervisors (leaf or inner node!)

The tree is being walked the entire way through.

That being said, I personally try to avoid calling the supervisor to 
know who its children are and prefer named nodes. For me the supervisor 
is first and foremost a definition of a unit of failure, of dependencies 
between workers or subtrees.

>
>Look carefully at the example I provided in the gist and Oliver's use case.
>It is perfectly sound advice. If you are ever walking your supervisor
>hierarchy do something with your application, you are doing it wrong.
>

See release upgrades; if you need to walk your entire system at once, 
doing it through supervisors is not a bad idea.

Funnily enough, the supervision structure isn't all that is being 
trusted though. When an app is shut down, the application controller (or 
is it the master?) also runs through all of the processes on the node 
and looks for those for whose it is the group leader and then force 
kills them -- preventing the terminate function from being called.

Regards,
Fred.