[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq
Fri Oct 31 03:55:48 CET 2014
Thanks for your attention & quick reply.
It seems that quite a few people suffer from this problem.
Scott>The best workaround is to use "+scl false" and "+sfwi" with a value
of 500 or a bit smaller
1, we set +sfwi 500
2,at first, we set +scl false, but it causes unbalanced runq length among
all runqs on R16B03, then we set +scl true (by default), so +scl false is
not a safe choice on R16B03
Our test cmdline:
/home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short
-swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100
-MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs
46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct
-swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot
/home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time
30000 -heart -setcookie Xxx -name proxy@REDACTED – console
And , apart from everything else, INACTIVE|NONEMPTY is not a normal state
of runq flag.
Next few days, I will fix the not-yet-be-sure bug in my way based on R16B03
and run the test cases again.
2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <fritchie@REDACTED>:
> songlu cai <caisonglu@REDACTED> wrote:
> slc> How to fix:
> slc> [...]
> slc> 3, Or Another Way?
> Wow, that's quite a diagnosis. I'm not a good judge of the race
> condition that you've found or your fix. I can provide some context,
> however, in case that you weren't aware of it. It might help to create
> a Real, Final, 100% Correct Fix ... something which does not exist right
> The best workaround is to use "+scl false" and "+sfwi" with a value of
> 500 or a bit smaller. See the discussion last month about it,
> My colleague Joe Blomstedt wrote a demo program that can cause scheduler
> collapse to happen pretty quickly. It might be useful for judging how
> well any fix works ... at Basho we had a terrible time trying to
> reproduce this bug before Joe found a semi-reliable trigger.
> It is discussed in this email thread (which is broken across two URLs,
> (first message only, I don't know why)
> (the rest of the thread)
> If your analysis is correct ... then hopefully this can lead quickly to
> a Real, Final, 100% Correct Fix. I'm tired of diagnosing systems that
> suffer scheduler collapse and then discover that the customer forgot to
> add the magic +sfwi and +scl flags in their runtime configuration to
> work around that !@#$! bug.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions