Designing supervision trees

Tue May 4 06:23:22 CEST 2010

Hi list,
     I'm in the process of going through the design of the supervision 
tree for our application and it's rapidly becoming obvious to me that I 
could be a lot clearer on how supervision trees are "meant" to be 
structured to make a system as fault-tolerant as possible.  Let me ask a 
couple of concrete questions and maybe it will help:

* Is it kosher to have a supervised process (say a gen_server) start its 
own helper process(es) simply using spawn_link()?  It seems like it 
should be fine - any failure of either will propagate over the link, 
causing them both to be shut down and the supervisor will then restart 
the main one which in turn will restart the helper.  I say it "seems 
like it should be fine", but after reading all the supervisor and OTP 
docs I could lay my hands on I'm not really sure if there isn't some 
good reason to avoid this arrangement.

* Let's say I have two processes, A and B.  The state of B is dependent 
on the messages it has received from A.  The particular example I'm 
dealing with is a process (B) who is responsible for starting and 
stopping apps and another (A) which is responsible for synchronising 
data with a remote store.  The apps should be started when we are synced 
and stopped if we lose connection with the store.  I don't necessarily 
want to merge them in to one process because A needs to be relatively 
responsive to the remote store, but the process of starting/stopping 
apps can take some time.  There may be a much better way to arrange 
this, but I'm not exactly sure what it is...
So we're up and running, remote store is synced, all the apps are 
running, and B crashes.  I'm trying to figure out the "right" way to 
manage recovery - possibilities I can think of:
- Have A and B under a one-for-all supervisor so that we just nuke the 
broader state and start it all again (seems like we should be able to 
recover with less impact than this).
- Have B's state stored in an ETS table owned by its parent so that it 
can recover into it's previous state (that seems far too much like 
global data to me).
- Have B query A for the current state on startup - that would work, 
except that it leaves us with multiple communication methods between A 
and B, one where B asks for the state and one where A pushes state 
updates - that seems a little redundant (and like extra code to maintain 
which nobody wants).
- Get rid of B entirely and perhaps have A spawn a temporary processes 
to do app start/stop when required - seems a little messy, and prone to 
race conditions if we do multiple start/stop operations in quick 
succession...

Any thoughts (and pointers to docs that deal with this stuff) are 
appreciated.

As always, thanks if you read this far :)

Cheers,

Bernard