[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq

songlu cai caisonglu@REDACTED
Fri Oct 31 03:55:48 CET 2014

Hi Scott,

Thanks for your attention & quick reply.

It seems that quite a few people suffer from this problem.

Scott>The best workaround is to use "+scl false" and "+sfwi" with a value
of 500 or a bit smaller

1, we set +sfwi 500

2,at first, we set +scl false, but it causes unbalanced runq length among
all runqs on R16B03, then we set +scl true (by default), so +scl false is
not a safe choice on R16B03

Our test cmdline:

/home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short
-swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100
-MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs
46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct
-swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot
/home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time
30000 -heart -setcookie Xxx -name proxy@REDACTED – console

And , apart from everything else, INACTIVE|NONEMPTY is not a normal state
of runq flag.

Next few days, I will fix the not-yet-be-sure bug in my way based on R16B03
and run the test cases again.

Best Regards,

Songlu Cai

2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <fritchie@REDACTED>:

> songlu cai <caisonglu@REDACTED> wrote:
> slc> How to fix:
> slc> [...]
> slc> 3, Or Another Way?
> Wow, that's quite a diagnosis.  I'm not a good judge of the race
> condition that you've found or your fix.  I can provide some context,
> however, in case that you weren't aware of it.  It might help to create
> a Real, Final, 100% Correct Fix ... something which does not exist right
> now.
> The best workaround is to use "+scl false" and "+sfwi" with a value of
> 500 or a bit smaller.  See the discussion last month about it,
> http://erlang.org/pipermail/erlang-questions/2014-September/081017.html
> My colleague Joe Blomstedt wrote a demo program that can cause scheduler
> collapse to happen pretty quickly.  It might be useful for judging how
> well any fix works ... at Basho we had a terrible time trying to
> reproduce this bug before Joe found a semi-reliable trigger.
>     https://github.com/basho/nifwait
> It is discussed in this email thread (which is broken across two URLs,
> sorry).
>     http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
>         (first message only, I don't know why)
>     http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
>         (the rest of the thread)
> If your analysis is correct ... then hopefully this can lead quickly to
> a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
> suffer scheduler collapse and then discover that the customer forgot to
> add the magic +sfwi and +scl flags in their runtime configuration to
> work around that !@#$! bug.
> -Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141031/9d486927/attachment.htm>

More information about the erlang-questions mailing list