[erlang-questions] simple_one_for_one supervisor - what happens at restart? (also: gen_tcp)

Sat May 14 17:11:14 CEST 2016

On Sat, May 14, 2016 at 5:21 AM, Oliver Korpilla <Oliver.Korpilla@REDACTED> wrote:
> Hello,
>
> and thank you all for your responses.
>
> I originally adopted simple_one_for_one supervisor because I had a problem with how other supervisors clean up processes.
>
> For the TCP connectors simple_one_for_one will be fine. As noted by others, they cannot really come back unless they reconnect, so that is fine. So, a simple_one_for_one supervisor acts like every child, regardless of child spec, as if it was temporary?
>
> I have another big batch of processes independent of the connectors. These serve individual requests emanating from the TCP layer, where an ID establishes which handler belongs to which batch of messages (i.e. each TCP payload contains an ID in its own proprietary header). Now, I originally saw these as transient workers I would like to have restarted, but since they are stateless and can be created on demand, I either can supervise them simple_one_for_one (and create them on demand when the one for a given ID is missing) or I can create them as transient children under a one_for_one and let that restart it on a crash.

If these processes only ever act on behalf of the TCP connection,
consider not using them at all. Just let the TCP connections do the
work.

Processes should correspond to _real world_ independent threads of
execution, not mental abstractions.

If you do have separate threads of execution (e.g. TCP connection is
providing updates to the client while it waits on these spawned
workers) use a separate simple_one_for_one (sofo) supervisor for the
workers and link your connection/worker processes.

> I originally went for simple_one_for_one because of the better performance and because it cleans up children after they terminate. I guess in case of one_for_one I have to clean up all children which shut down normally by calling terminate_child and delete_child on them. (I originally hoped one_for_one would do this if a child exited normally, but either I bungled my tests or it simply doesn't, even for transient children).

If you're ever routinely "cleaning up" after a supervisor, it's a bad
sign. Configure (one-time init payload) your supervisors and let them
do their thing. If you're accumulating a lot of terminated child
processes, you want a sofo supervisor.

> Any recommendations?

It sounds like you're motivated to get a "restart" scenario here. What
is your goal from the end-user (client of your app) point of view
here? Without a specific goal that you understand and can defend, your
default approach I think is always crash - and let the client
reestablish a connection.

Some worthy goals:

- Don't abruptly close the connection but return a well formed error
(e.g. HTTP 500, etc.)
- Handle specific well understood error conditions with limited
retries (e.g. reconnect to a database with the hope the outage is
short term)
- Tell the client to retry a different end-point (e.g. HTTP 302)

Each of these needs goals needs to be implemented - you're not going
to get any of them with a supervisor process restart. Short of a
worthy goal, just crash, maintaining your system integrity for
processing new connections, and rely on the client (outside your
system) to perform the "restart".