Designing supervision trees
Bernard Duggan
bernie@REDACTED
Tue May 4 06:23:22 CEST 2010
Hi list,
I'm in the process of going through the design of the supervision
tree for our application and it's rapidly becoming obvious to me that I
could be a lot clearer on how supervision trees are "meant" to be
structured to make a system as fault-tolerant as possible. Let me ask a
couple of concrete questions and maybe it will help:
* Is it kosher to have a supervised process (say a gen_server) start its
own helper process(es) simply using spawn_link()? It seems like it
should be fine - any failure of either will propagate over the link,
causing them both to be shut down and the supervisor will then restart
the main one which in turn will restart the helper. I say it "seems
like it should be fine", but after reading all the supervisor and OTP
docs I could lay my hands on I'm not really sure if there isn't some
good reason to avoid this arrangement.
* Let's say I have two processes, A and B. The state of B is dependent
on the messages it has received from A. The particular example I'm
dealing with is a process (B) who is responsible for starting and
stopping apps and another (A) which is responsible for synchronising
data with a remote store. The apps should be started when we are synced
and stopped if we lose connection with the store. I don't necessarily
want to merge them in to one process because A needs to be relatively
responsive to the remote store, but the process of starting/stopping
apps can take some time. There may be a much better way to arrange
this, but I'm not exactly sure what it is...
So we're up and running, remote store is synced, all the apps are
running, and B crashes. I'm trying to figure out the "right" way to
manage recovery - possibilities I can think of:
- Have A and B under a one-for-all supervisor so that we just nuke the
broader state and start it all again (seems like we should be able to
recover with less impact than this).
- Have B's state stored in an ETS table owned by its parent so that it
can recover into it's previous state (that seems far too much like
global data to me).
- Have B query A for the current state on startup - that would work,
except that it leaves us with multiple communication methods between A
and B, one where B asks for the state and one where A pushes state
updates - that seems a little redundant (and like extra code to maintain
which nobody wants).
- Get rid of B entirely and perhaps have A spawn a temporary processes
to do app start/stop when required - seems a little messy, and prone to
race conditions if we do multiple start/stop operations in quick
succession...
Any thoughts (and pointers to docs that deal with this stuff) are
appreciated.
As always, thanks if you read this far :)
Cheers,
Bernard
More information about the erlang-questions
mailing list