[erlang-questions] Designing supervision trees

Tue May 4 14:27:14 CEST 2010

Hi Bernard,

On Mon, May 3, 2010 at 11:23 PM, Bernard Duggan <bernie@REDACTED> wrote:
> Hi list,
>    I'm in the process of going through the design of the supervision tree
> for our application and it's rapidly becoming obvious to me that I could be
> a lot clearer on how supervision trees are "meant" to be structured to make
> a system as fault-tolerant as possible.  Let me ask a couple of concrete
> questions and maybe it will help:
>
> * Is it kosher to have a supervised process (say a gen_server) start its own
> helper process(es) simply using spawn_link()?  It seems like it should be
> fine - any failure of either will propagate over the link, causing them both
> to be shut down and the supervisor will then restart the main one which in
> turn will restart the helper.  I say it "seems like it should be fine", but
> after reading all the supervisor and OTP docs I could lay my hands on I'm
> not really sure if there isn't some good reason to avoid this arrangement.

I think you'll find different opinions here. I prefer to use
supervisors to start and manage processes. If a process needs to
"spawn" something, it might make sense to call add_child on an
appropriate supervisor. You can create a custom add_xxx_child function
on your supervisor module to act somewhat as a factory function.

This lets you fire-and-forget (in particular, using
simple_one_for_one, which can remove children when they terminate) and
avoids messing around with trap_exits in you gen_servers.

> * Let's say I have two processes, A and B.  The state of B is dependent on
> the messages it has received from A.  The particular example I'm dealing
> with is a process (B) who is responsible for starting and stopping apps and
> another (A) which is responsible for synchronising data with a remote store.
>  The apps should be started when we are synced and stopped if we lose
> connection with the store.  I don't necessarily want to merge them in to one
> process because A needs to be relatively responsive to the remote store, but
> the process of starting/stopping apps can take some time.  There may be a
> much better way to arrange this, but I'm not exactly sure what it is...
> So we're up and running, remote store is synced, all the apps are running,
> and B crashes.  I'm trying to figure out the "right" way to manage recovery
> - possibilities I can think of:
> - Have A and B under a one-for-all supervisor so that we just nuke the
> broader state and start it all again (seems like we should be able to
> recover with less impact than this).

This would be my starting point. If you have efficiency concerns about
tearing everything down and rebuilding it, consider breaking what's
expensive *and* independent into a separate supervisory hierarchy --
or its own app.

> - Have B's state stored in an ETS table owned by its parent so that it can
> recover into it's previous state (that seems far too much like global data
> to me).

If it's costly to grab the external state and you need the process
state to recover, this would make sense, I think. Though dets or
mnesia might be a better option for recovery. This might be that
independent component that you move outside your A/B hierarchy that
would survive crashes in those processes.

> - Have B query A for the current state on startup - that would work, except
> that it leaves us with multiple communication methods between A and B, one
> where B asks for the state and one where A pushes state updates - that seems
> a little redundant (and like extra code to maintain which nobody wants).
> - Get rid of B entirely and perhaps have A spawn a temporary processes to do
> app start/stop when required - seems a little messy, and prone to race
> conditions if we do multiple start/stop operations in quick succession...

Yes, that's why the first option (total restart) makes more sense,
IMO. You don't need to get into the game of trying to restore your
state piecemeal. You just trash it all and start from scratch.

> Any thoughts (and pointers to docs that deal with this stuff) are
> appreciated.
>
> As always, thanks if you read this far :)
>
> Cheers,
>
> Bernard

Garrett