[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq

Mon Nov 3 16:18:39 CET 2014

Hi Scott,

Last week I fix the bug in a simple way, then I run the fixed-version
compared with the old unbalanced-version.
The two nodes are with same pressure &  timeline.
The unbalanced(collapsed) state comes up several times on the old version,
but never show up on the fixed version.
The pressure spreads averagely among 24 schedulers on the fixed version
(especially when with high pressure).
In fact, the fixed version is with higher pressure when old version runs
into the unbalanced state.
Because the old version is only with 4 schedulers, and easily gets to the
bottleneck,  its cpu is 400%, at the same time, fixed version is with cpu
1200%.
So I can be sure that the root cause of unbalanced scheduler (scheduler
collapse) is "using out of date active_no_runq", just as analyzed before.

I just modify the chk_wake_sched function, code diff as below:

Index: emulator/beam/erl_process.c
===================================================================

--- emulator/beam/erl_process.c (revision 298281)
+++ emulator/beam/erl_process.c (working copy)
@@ -2694,6 +2694,16 @@
        return 0;
     wrq = ERTS_RUNQ_IX(ix);
     flags = ERTS_RUNQ_FLGS_GET(wrq);
+
+    if ( activate &&
+       (flags & ERTS_RUNQ_FLG_NONEMPTY)  &&
+       (flags & ERTS_RUNQ_FLG_INACTIVE)) {
+       if (try_inc_no_active_runqs(ix+1))
+               (void) ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);
+       wake_scheduler(wrq, 0);
+       return 1;
+    }
+
     if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {
        if (activate) {
            if (try_inc_no_active_runqs(ix+1))

It saves the scheduler from the weird state. It is not a perfect fix, but
an effective one.
Scott, would you please apply this patch to R16B03 and run your test case
again?
Thank you very much and waiting for your reply.
And I will run it for a week to ensure that we do fix the problem.

Best Regards,
Songlu Cai

2014-10-31 10:55 GMT+08:00 songlu cai <caisonglu@REDACTED>:

> Hi Scott,
>
>
>
> Thanks for your attention & quick reply.
>
> It seems that quite a few people suffer from this problem.
>
>
>
> Scott>The best workaround is to use "+scl false" and "+sfwi" with a value
> of 500 or a bit smaller
>
> 1, we set +sfwi 500
>
> 2,at first, we set +scl false, but it causes unbalanced runq length among
> all runqs on R16B03, then we set +scl true (by default), so +scl false is
> not a safe choice on R16B03
>
>
>
> Our test cmdline:
>
> /home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short
> -swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100
> -MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs
> 46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct
> L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1
> -swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot
> /home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time
> 30000 -heart -setcookie Xxx -name proxy@REDACTED – console
>
>
>
> And , apart from everything else, INACTIVE|NONEMPTY is not a normal state
> of runq flag.
>
> Next few days, I will fix the not-yet-be-sure bug in my way based on
> R16B03 and run the test cases again.
>
>
> Best Regards,
>
> Songlu Cai
>
> 2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <fritchie@REDACTED>:
>
>> songlu cai <caisonglu@REDACTED> wrote:
>>
>> slc> How to fix:
>>
>> slc> [...]
>>
>> slc> 3, Or Another Way?
>>
>> Wow, that's quite a diagnosis.  I'm not a good judge of the race
>> condition that you've found or your fix.  I can provide some context,
>> however, in case that you weren't aware of it.  It might help to create
>> a Real, Final, 100% Correct Fix ... something which does not exist right
>> now.
>>
>> The best workaround is to use "+scl false" and "+sfwi" with a value of
>> 500 or a bit smaller.  See the discussion last month about it,
>>
>>
>> http://erlang.org/pipermail/erlang-questions/2014-September/081017.html
>>
>> My colleague Joe Blomstedt wrote a demo program that can cause scheduler
>> collapse to happen pretty quickly.  It might be useful for judging how
>> well any fix works ... at Basho we had a terrible time trying to
>> reproduce this bug before Joe found a semi-reliable trigger.
>>
>>     https://github.com/basho/nifwait
>>
>> It is discussed in this email thread (which is broken across two URLs,
>> sorry).
>>
>>     http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
>>         (first message only, I don't know why)
>>     http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
>>         (the rest of the thread)
>>
>> If your analysis is correct ... then hopefully this can lead quickly to
>> a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
>> suffer scheduler collapse and then discover that the customer forgot to
>> add the magic +sfwi and +scl flags in their runtime configuration to
>> work around that !@#$! bug.
>>
>> -Scott
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141103/1ab6e0cb/attachment.htm>