[erlang-questions] supervisors & slow init's

Thu Dec 5 18:57:33 CET 2013

Answers inline.

On 12/05, Sean Cribbs wrote:
> I'd echo Jesper's comments in saying that is most important to make sure
> the supervisor tree starts up quickly. There are several options I see:
> 

Outside of the current's case context, I don't necessarily agree with
that. The supervision tree can take as long as necessary to start as
long as it's in a stable state. The requirement for speed is
application-specific. If you need to do data syncing for 10 minutes
before starting to boot, I prefer to lock up the supervision tree than
having to implement 12 child applications than need to synchronize on
things -- if I can afford it, of course.

To me the most important thing is figuring out what you can or can't do,
and picking the boot and supervision strategy the most adequate to that.
If booting fast is counter-intuitive to the results you want, don't do
it.

Of course, requirements may change as usage grows over time, which means
you may very well end up refactoring towards one way or another as
people make use of the system and decide what they can and can't have.

> 1) Change A into an FSM (optional, but useful IMHO). Have its initial state
> be 'connect_to_hardware' with a timeout of 0 returned from init/1, e.g.
> {ok, connect_to_hardware, State, 0}. Then in 'connect_to_hardware', match
> timeout and do the connection there, then transition to the 'ready' state.
> Note that this state will be entered before any other messages are
> received, meaning that B and C should probably use sync_send_event to
> communicate with A.
> 
> 2) Keep A as a gen_server, but do the same timeout trick in init/1. Have A
> connect in handle_info when receiving 'timeout', and then notify B and C
> that it's ready after.
> 

No adversarial opinion on this, I agree with that as a good approach to
quick boots and whatnot.

> 3) Use Loic's proc_lib:init_ack + gen_server:enter_loop hack instead of the
> regular gen_server/gen_fsm flow. This is less clean, but allows you to do
> those slower blocky things at startup.
> 

I tend to prefer the 'send myself a message' approach in init. I
especially like it because if I do, say a reconnect on a 'reconnect'
event (or in this case a 'check_for_hardware' message), I trigger that
event as part of the init callback, and can keep using the same
mechanism and code path during regular operations after.

They're both entirely valid, of course, and for the few years I've been
around in the community, everybody has ended up picking their own
favorite and defending it without yielding. The truth is that they all
work fine and they all come down to aesthetic preferences.

> I think the moral of the story is that starting up your system and
> implementing a protocol between processes should not be conflated. If
> there's a sequence of steps to be done with potential exit points or
> branches at each step, FSMs plus messages feels the most sane to me.
> 

Agreed.

The opposite rule is that if potential exit points are unforseeable
and should (according to spec) not happen (say not being able to open a
UDP port to localhost, for example), then you may want to skip the
protocol design step entirely (all code is a risk of bugs!). This means
you *may* suffer unexpected failures, in which case your choice will be
to take the necessary means to make the preconditions to your system's
functionality be respected, or relax them and go with the protocols as
you start needing them while you grow and groom your system.

Production systems I end up working with often end up being a mix of
both approaches.

Things like configuration files, accessibility to the file system (say
for logging purposes), local resources that can be depended on (opening
UDP ports for logs, again), restoring a stable state from disk or
network, and so on, are things I'll readily put into requirements of a
supervisor and may decide to synchronously load no matter how long it
takes (some applications may just end up having over 10 minutes boot
times in rare cases, but that's okay because we're possibly syncing
gigabytes that we *need* to work with as a base state if we don't want
to serve incorrect information.)

On the other hand, code that depends on non-local databases and external
services will have the partial startup with quick tree booting because
if the failure is expected to happen often during regular operations,
then there's no difference between now and later. You gotta handle it
the same, and for these parts of the system, internal protocols and far
less strict guarantees are the solution.

Regards,
Fred.