[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq

Tue Nov 11 08:03:28 CET 2014

Hi Richard,

Thank you for your help & quick reply!

Follow your suggestion, I do another one-week test with four versions of
erlang.
V1: R16B03.origin with collapse problem
V2: R16B03.fix with my own fix patch
V3: R16B03.backport from github you provide
V4: R17B03 from erlang.com

During the one-week test, V1 collapses twice and lasts for nearly two days.
V2 & V3 & V4 behave the same, balanced as normal.
So I will use R16B03.backport to fix my problem.
Thanks again.

Best Regards,
Songlu Cai

2014-11-05 6:14 GMT+08:00 Rickard Green <rickard@REDACTED>:

> Hi Songlu Cai,
>
> Thanks for your work on this!
>
> Although it is not an error that NONEMPTY and INACTIVE flag is set at
> the same time it should occur rather infrequently. Assuming that this
> would be the case, your fix should have little effect on the behavior,
> so the root cause should be elsewhere. I've found a potential race
> when the NONEMPTY flag could end up on the run-queue of a waiting
> scheduler which could cause the problem you are seeing.
>
> When implementing support for balancing on scheduler utilization for
> OTP 17.0 I rewrote the code that set the NONEMPTY flag. This rewrite
> removes that potential race. I've back-ported this rewrite based on
> R16B03-1. It can be found in the
> rickard/R16B03-1/load_balance/OTP-11385 branch
> <
> https://github.com/rickard-green/otp/tree/rickard/R16B03-1/load_balance/OTP-11385
> >
> of my github repo <https://github.com/rickard-green/otp.git>. Please
> try it out and see if it solves your problem (use the same
> configuration as before).
>
> Something similar to your fix should perhaps in the end be introduced
> anyway (one also has to check the SUSPENDED flag though), since there
> are no good reason to prevent the activation just because the
> run-queue happen to be non-empty. It would, however, be good to see
> that the root issue has been fixed first before introducing this.
>
> Regards,
> Rickard Green, Erlang/OTP, Ericsson AB
>
> On Mon, Nov 3, 2014 at 4:18 PM, songlu cai <caisonglu@REDACTED> wrote:
> > Hi Scott,
> >
> > Last week I fix the bug in a simple way, then I run the fixed-version
> > compared with the old unbalanced-version.
> > The two nodes are with same pressure &  timeline.
> > The unbalanced(collapsed) state comes up several times on the old
> version,
> > but never show up on the fixed version.
> > The pressure spreads averagely among 24 schedulers on the fixed version
> > (especially when with high pressure).
> > In fact, the fixed version is with higher pressure when old version runs
> > into the unbalanced state.
> > Because the old version is only with 4 schedulers, and easily gets to the
> > bottleneck,  its cpu is 400%, at the same time, fixed version is with cpu
> > 1200%.
> > So I can be sure that the root cause of unbalanced scheduler (scheduler
> > collapse) is "using out of date active_no_runq", just as analyzed before.
> >
> > I just modify the chk_wake_sched function, code diff as below:
> >
> > Index: emulator/beam/erl_process.c
> > ===================================================================
> > --- emulator/beam/erl_process.c (revision 298281)
> > +++ emulator/beam/erl_process.c (working copy)
> > @@ -2694,6 +2694,16 @@
> >         return 0;
> >      wrq = ERTS_RUNQ_IX(ix);
> >      flags = ERTS_RUNQ_FLGS_GET(wrq);
> > +
> > +    if ( activate &&
> > +       (flags & ERTS_RUNQ_FLG_NONEMPTY)  &&
> > +       (flags & ERTS_RUNQ_FLG_INACTIVE)) {
> > +       if (try_inc_no_active_runqs(ix+1))
> > +               (void) ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);
> > +       wake_scheduler(wrq, 0);
> > +       return 1;
> > +    }
> > +
> >      if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {
> >         if (activate) {
> >             if (try_inc_no_active_runqs(ix+1))
> >
> > It saves the scheduler from the weird state. It is not a perfect fix,
> but an
> > effective one.
> > Scott, would you please apply this patch to R16B03 and run your test case
> > again?
> > Thank you very much and waiting for your reply.
> > And I will run it for a week to ensure that we do fix the problem.
> >
> > Best Regards,
> > Songlu Cai
> >
> > 2014-10-31 10:55 GMT+08:00 songlu cai <caisonglu@REDACTED>:
> >>
> >> Hi Scott,
> >>
> >>
> >>
> >> Thanks for your attention & quick reply.
> >>
> >> It seems that quite a few people suffer from this problem.
> >>
> >>
> >>
> >> Scott>The best workaround is to use "+scl false" and "+sfwi" with a
> value
> >> of 500 or a bit smaller
> >>
> >> 1, we set +sfwi 500
> >>
> >> 2,at first, we set +scl false, but it causes unbalanced runq length
> among
> >> all runqs on R16B03, then we set +scl true (by default), so +scl false
> is
> >> not a safe choice on R16B03
> >>
> >>
> >>
> >> Our test cmdline:
> >>
> >> /home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short
> >> -swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100
> >> -MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs
> >> 46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct
> >>
> L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1
> >> -swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root --
> -boot
> >> /home/Xxx/dir -mode interactive -config /home/Xxx/sys.config
> -shutdown_time
> >> 30000 -heart -setcookie Xxx -name proxy@REDACTED – console
> >>
> >>
> >>
> >> And , apart from everything else, INACTIVE|NONEMPTY is not a normal
> state
> >> of runq flag.
> >>
> >> Next few days, I will fix the not-yet-be-sure bug in my way based on
> >> R16B03 and run the test cases again.
> >>
> >>
> >> Best Regards,
> >>
> >> Songlu Cai
> >>
> >>
> >> 2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <fritchie@REDACTED
> >:
> >>>
> >>> songlu cai <caisonglu@REDACTED> wrote:
> >>>
> >>> slc> How to fix:
> >>>
> >>> slc> [...]
> >>>
> >>> slc> 3, Or Another Way?
> >>>
> >>> Wow, that's quite a diagnosis.  I'm not a good judge of the race
> >>> condition that you've found or your fix.  I can provide some context,
> >>> however, in case that you weren't aware of it.  It might help to create
> >>> a Real, Final, 100% Correct Fix ... something which does not exist
> right
> >>> now.
> >>>
> >>> The best workaround is to use "+scl false" and "+sfwi" with a value of
> >>> 500 or a bit smaller.  See the discussion last month about it,
> >>>
> >>>
> >>>
> http://erlang.org/pipermail/erlang-questions/2014-September/081017.html
> >>>
> >>> My colleague Joe Blomstedt wrote a demo program that can cause
> scheduler
> >>> collapse to happen pretty quickly.  It might be useful for judging how
> >>> well any fix works ... at Basho we had a terrible time trying to
> >>> reproduce this bug before Joe found a semi-reliable trigger.
> >>>
> >>>     https://github.com/basho/nifwait
> >>>
> >>> It is discussed in this email thread (which is broken across two URLs,
> >>> sorry).
> >>>
> >>>
> http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
> >>>         (first message only, I don't know why)
> >>>
> http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
> >>>         (the rest of the thread)
> >>>
> >>> If your analysis is correct ... then hopefully this can lead quickly to
> >>> a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
> >>> suffer scheduler collapse and then discover that the customer forgot to
> >>> add the magic +sfwi and +scl flags in their runtime configuration to
> >>> work around that !@#$! bug.
> >>>
> >>> -Scott
> >>
> >>
> >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> >
>
> --
> Rickard Green, Erlang/OTP, Ericsson AB
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141111/1c72bf04/attachment.htm>