[erlang-questions] Supervisor post start update of restart intensity and period

Tue Oct 20 13:33:34 CEST 2015

Hi Torben,

I did wonder about this as a solution, but I'm not terribly keen.

Take the case of 10 sup_10 supervisors with a restart intensity of 10, each
with 10 children. If there are 11 child deaths for children concentrated on
one of those supervisors, it will trigger a sup_10 restart, but if the 11
children that die are distributed across 2 or more sup_10 supervisors, it
won't... The sup_10 restart probably isn't a problem of course, but the
number of total deaths in a period of time that will cause a sup_sup to
restart is now variable, depending on exactly which of the children across
the sup_10 supervisors die.

In fact, in this situation, 11 child deaths could cause a sup_10 death, or
100 child deaths could just about cause no sup_10 to die.

Now, I accept that what you suggest is a pragmatic solution that could work
very well, because PROBABLY statistically the probabilities of getting much
variance in the number of child deaths causing a sup_sup death (for
sensible choices of sup to sup and sup to child ratios) may be low, but the
non deterministic / inconsistent / unpredictable nature just makes me
wrinkle my nose a bit.

I say PROBABLY above because it assumes that the distribution of failures
is random, but of course it's as least possible that longer running
children, or children started at a similar time, are grouped on one
supervisor more often than not, so the difficulties I suggest above might
be more realistic than it seems... Depends on the reason for the crash,
which of course we never know.

Other concerns I have are that  if the number of children varies by orders
of magnitude, our sup_N might have to have an N that's not too large, but
that means there might be 1000 of them, and which_children/1 becomes quite
a trawl, and if you start 101 children with sup_10 supervisors there will
be one lonely child

I could write a relatively small supervisor that fits my use case and
requirements exactly (probably easier than trying to make supervisor work
for me as it is too), however because I realised I alone was now coming
across this issue for the second time I thought it was worth checking if
anyone else was interested, or if I'm just weird...

Michael.

On Tue, Oct 20, 2015 at 11:04 AM, Torben Hoffmann <thoffmann@REDACTED>
wrote:

> Hi Michael,
>
> Before diving into changes to the supervisor module there might be a
> quicker fix that
> can give you what you want.
>
> Say that you have a case where 10 children with a restart intensity of 10
> is fine.
> So your sup_10 supervisor fits 10 mod_a with that configuration.
>
> Now you create a sup_sup supervisor that supervises your sup_10
> supervisors.
>
> Before you start a new mod_a worker you determine if you need to start
> another sup_10
> supervisor. Then you start the mod_a as a child of the appropriate sup_10
> supervisor.
>
> It requires a bit of interrogation of the supervision tree under sup_sup
> (using
> which_children/1) before starting. But I would say that it beats forking
> supervisor.
>
> I haven't done the math to see if this two level solution would give you
> adequate
> control over the restart intensity... something for the interested reader
> ;-)
>
> Cheers,
> Torben
>
> Michael Wright writes:
>
> > Does anyone have any interest, approval or disapproval in respect of the
> > idea of adding capability to update the restart intensity of a supervisor
> > after start?
> >
> > Currently the only way to change it after start is by way of a release
> > change.
> >
> > My reason for the proposal is to optimise the case of a simple_one_to_one
> > supervisor where:
> >
> >     1. The likely number of children could vary a lot (perhaps by orders
> of
> > magnitude).
> >     2. The children are homogeneous and the criticality of the service
> they
> > collectively provide is shared across all of them.
> >     3. The probability of abnormal termination of any one child is
> > relatively constant (not lessened or known or expected to be lessened by
> > more children being spawned).
> >
> > So for the case of a simple_one_for_one supervisor with 10 children, a
> > restart intensity 10 might be appropriate, but for the same supervisor
> with
> > 10,000 children it might need to be 1,000, or 10,000.
> >
> > In some cases the likely maximum number of children might be known at
> > supervisor start time, but not always, and even then if it varies a lot
> it
> > probably doesn't help.
> >
> > I can't be certain how in demand this feature would be, but I've realised
> > I've needed it before, and compromised by setting the restart intensity
> > high to avoid unnecessary tear down of software infrastructure. It's
> > obviously not ideal though as it could lead to outage or service
> > degradation while a relatively small number of children churn their way
> to
> > an inappropriately large restart intensity.
> >
> > One could have a dynamic intensity value, {ch_multiple, N} say, making
> it N
> > times the number of children, but I slightly worry someone will later
> want
> > {sqrt_ch_mul_ln_moonphase, U, G, H} and then one may as well allow {M, F,
> > A} or add a new callback. However, really I think an API call is probably
> > the most sensible way forward:
> >
> >     supervisor:update_supflags/3    (SupRef, intensity | period,
> NewValue)
> >
> > I prefer this to passing a map since the above is more explicit that not
> > all the supflags are alterable.
> >
> > An API call is simple and low impact, and the only disadvantage is it
> > offers to do nothing clever, making the callback module perform all the
> > management, even if it means calling it every time a new child is
> spawned.
> >
> > Michael.
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
>
> --
> Torben Hoffmann
> Architect, basho.com
> M: +45 25 14 05 38
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20151020/d12eac06/attachment.htm>