[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq

Fri Oct 31 03:55:48 CET 2014

Hi Scott,

Thanks for your attention & quick reply.

It seems that quite a few people suffer from this problem.

Scott>The best workaround is to use "+scl false" and "+sfwi" with a value
of 500 or a bit smaller

1, we set +sfwi 500

2,at first, we set +scl false, but it causes unbalanced runq length among
all runqs on R16B03, then we set +scl true (by default), so +scl false is
not a safe choice on R16B03

Our test cmdline:

/home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short
-swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100
-MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs
46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct
L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1
-swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot
/home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time
30000 -heart -setcookie Xxx -name proxy@REDACTED – console

And , apart from everything else, INACTIVE|NONEMPTY is not a normal state
of runq flag.

Next few days, I will fix the not-yet-be-sure bug in my way based on R16B03
and run the test cases again.

Best Regards,

Songlu Cai

2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <fritchie@REDACTED>:

> songlu cai <caisonglu@REDACTED> wrote:
>
> slc> How to fix:
>
> slc> [...]
>
> slc> 3, Or Another Way?
>
> Wow, that's quite a diagnosis.  I'm not a good judge of the race
> condition that you've found or your fix.  I can provide some context,
> however, in case that you weren't aware of it.  It might help to create
> a Real, Final, 100% Correct Fix ... something which does not exist right
> now.
>
> The best workaround is to use "+scl false" and "+sfwi" with a value of
> 500 or a bit smaller.  See the discussion last month about it,
>
>
> http://erlang.org/pipermail/erlang-questions/2014-September/081017.html
>
> My colleague Joe Blomstedt wrote a demo program that can cause scheduler
> collapse to happen pretty quickly.  It might be useful for judging how
> well any fix works ... at Basho we had a terrible time trying to
> reproduce this bug before Joe found a semi-reliable trigger.
>
>     https://github.com/basho/nifwait
>
> It is discussed in this email thread (which is broken across two URLs,
> sorry).
>
>     http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
>         (first message only, I don't know why)
>     http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
>         (the rest of the thread)
>
> If your analysis is correct ... then hopefully this can lead quickly to
> a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
> suffer scheduler collapse and then discover that the customer forgot to
> add the magic +sfwi and +scl flags in their runtime configuration to
> work around that !@#$! bug.
>
> -Scott
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141031/9d486927/attachment.htm>