[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq

Tue Nov 4 23:14:25 CET 2014

Hi Songlu Cai,

Thanks for your work on this!

Although it is not an error that NONEMPTY and INACTIVE flag is set at
the same time it should occur rather infrequently. Assuming that this
would be the case, your fix should have little effect on the behavior,
so the root cause should be elsewhere. I've found a potential race
when the NONEMPTY flag could end up on the run-queue of a waiting
scheduler which could cause the problem you are seeing.

When implementing support for balancing on scheduler utilization for
OTP 17.0 I rewrote the code that set the NONEMPTY flag. This rewrite
removes that potential race. I've back-ported this rewrite based on
R16B03-1. It can be found in the
rickard/R16B03-1/load_balance/OTP-11385 branch
<https://github.com/rickard-green/otp/tree/rickard/R16B03-1/load_balance/OTP-11385>
of my github repo <https://github.com/rickard-green/otp.git>. Please
try it out and see if it solves your problem (use the same
configuration as before).

Something similar to your fix should perhaps in the end be introduced
anyway (one also has to check the SUSPENDED flag though), since there
are no good reason to prevent the activation just because the
run-queue happen to be non-empty. It would, however, be good to see
that the root issue has been fixed first before introducing this.

Regards,
Rickard Green, Erlang/OTP, Ericsson AB

On Mon, Nov 3, 2014 at 4:18 PM, songlu cai <caisonglu@REDACTED> wrote:
> Hi Scott,
>
> Last week I fix the bug in a simple way, then I run the fixed-version
> compared with the old unbalanced-version.
> The two nodes are with same pressure &  timeline.
> The unbalanced(collapsed) state comes up several times on the old version,
> but never show up on the fixed version.
> The pressure spreads averagely among 24 schedulers on the fixed version
> (especially when with high pressure).
> In fact, the fixed version is with higher pressure when old version runs
> into the unbalanced state.
> Because the old version is only with 4 schedulers, and easily gets to the
> bottleneck,  its cpu is 400%, at the same time, fixed version is with cpu
> 1200%.
> So I can be sure that the root cause of unbalanced scheduler (scheduler
> collapse) is "using out of date active_no_runq", just as analyzed before.
>
> I just modify the chk_wake_sched function, code diff as below:
>
> Index: emulator/beam/erl_process.c
> ===================================================================
> --- emulator/beam/erl_process.c (revision 298281)
> +++ emulator/beam/erl_process.c (working copy)
> @@ -2694,6 +2694,16 @@
>         return 0;
>      wrq = ERTS_RUNQ_IX(ix);
>      flags = ERTS_RUNQ_FLGS_GET(wrq);
> +
> +    if ( activate &&
> +       (flags & ERTS_RUNQ_FLG_NONEMPTY)  &&
> +       (flags & ERTS_RUNQ_FLG_INACTIVE)) {
> +       if (try_inc_no_active_runqs(ix+1))
> +               (void) ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);
> +       wake_scheduler(wrq, 0);
> +       return 1;
> +    }
> +
>      if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {
>         if (activate) {
>             if (try_inc_no_active_runqs(ix+1))
>
> It saves the scheduler from the weird state. It is not a perfect fix, but an
> effective one.
> Scott, would you please apply this patch to R16B03 and run your test case
> again?
> Thank you very much and waiting for your reply.
> And I will run it for a week to ensure that we do fix the problem.
>
> Best Regards,
> Songlu Cai
>
> 2014-10-31 10:55 GMT+08:00 songlu cai <caisonglu@REDACTED>:
>>
>> Hi Scott,
>>
>>
>>
>> Thanks for your attention & quick reply.
>>
>> It seems that quite a few people suffer from this problem.
>>
>>
>>
>> Scott>The best workaround is to use "+scl false" and "+sfwi" with a value
>> of 500 or a bit smaller
>>
>> 1, we set +sfwi 500
>>
>> 2,at first, we set +scl false, but it causes unbalanced runq length among
>> all runqs on R16B03, then we set +scl true (by default), so +scl false is
>> not a safe choice on R16B03
>>
>>
>>
>> Our test cmdline:
>>
>> /home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short
>> -swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100
>> -MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs
>> 46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct
>> L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1
>> -swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot
>> /home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time
>> 30000 -heart -setcookie Xxx -name proxy@REDACTED – console
>>
>>
>>
>> And , apart from everything else, INACTIVE|NONEMPTY is not a normal state
>> of runq flag.
>>
>> Next few days, I will fix the not-yet-be-sure bug in my way based on
>> R16B03 and run the test cases again.
>>
>>
>> Best Regards,
>>
>> Songlu Cai
>>
>>
>> 2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <fritchie@REDACTED>:
>>>
>>> songlu cai <caisonglu@REDACTED> wrote:
>>>
>>> slc> How to fix:
>>>
>>> slc> [...]
>>>
>>> slc> 3, Or Another Way?
>>>
>>> Wow, that's quite a diagnosis.  I'm not a good judge of the race
>>> condition that you've found or your fix.  I can provide some context,
>>> however, in case that you weren't aware of it.  It might help to create
>>> a Real, Final, 100% Correct Fix ... something which does not exist right
>>> now.
>>>
>>> The best workaround is to use "+scl false" and "+sfwi" with a value of
>>> 500 or a bit smaller.  See the discussion last month about it,
>>>
>>>
>>> http://erlang.org/pipermail/erlang-questions/2014-September/081017.html
>>>
>>> My colleague Joe Blomstedt wrote a demo program that can cause scheduler
>>> collapse to happen pretty quickly.  It might be useful for judging how
>>> well any fix works ... at Basho we had a terrible time trying to
>>> reproduce this bug before Joe found a semi-reliable trigger.
>>>
>>>     https://github.com/basho/nifwait
>>>
>>> It is discussed in this email thread (which is broken across two URLs,
>>> sorry).
>>>
>>>     http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
>>>         (first message only, I don't know why)
>>>     http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
>>>         (the rest of the thread)
>>>
>>> If your analysis is correct ... then hopefully this can lead quickly to
>>> a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
>>> suffer scheduler collapse and then discover that the customer forgot to
>>> add the magic +sfwi and +scl flags in their runtime configuration to
>>> work around that !@#$! bug.
>>>
>>> -Scott
>>
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>

-- 
Rickard Green, Erlang/OTP, Ericsson AB