<div dir="ltr"><p class="MsoNormal"><span lang="EN-US">Hi Scott,</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">Thanks for your attention & quick reply.</span></p>
<p class="MsoNormal"><span lang="EN-US">It seems that quite a few people suffer
from this problem.</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">Scott>The best workaround is to use
"+scl false" and "+sfwi" with a value of 500 or a bit
smaller</span></p>
<p class="MsoNormal"><span lang="EN-US">1, we set +sfwi 500</span></p>
<p class="MsoNormal"><span lang="EN-US">2,at first, we set +scl false, but it causes
unbalanced runq length among all runqs on R16B03, then we set +scl true (by default),
so +scl false is not a safe choice on R16B03</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">Our test cmdline:</span></p>
<p class="MsoNormal"><span lang="EN-US">/home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl
8192 -sbt db -sbwt very_short -swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100
-MBmmmbc 100 -MHmmmbc 100 -MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e
50000 -Q 1000000 -hmbs 46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct
L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1
-swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot
/home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time
30000 -heart -setcookie Xxx -name proxy@Xxx – console</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">And , apart from everything else,
INACTIVE|NONEMPTY is not a normal state of runq flag.</span></p>
<p class="MsoNormal"><span lang="EN-US">Next few days, I will fix the not-yet-be-sure
bug in my way based on R16B03 and run the test cases again.</span></p><p class="MsoNormal"><br></p>
<p class="MsoNormal"><span lang="EN-US">Best Regards,</span></p>
<p class="MsoNormal"><span lang="EN-US">Songlu Cai</span></p></div><div class="gmail_extra"><br><div class="gmail_quote">2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <span dir="ltr"><<a href="mailto:fritchie@snookles.com" target="_blank">fritchie@snookles.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">songlu cai <<a href="mailto:caisonglu@gmail.com">caisonglu@gmail.com</a>> wrote:<br>
<br>
slc> How to fix:<br>
<br>
slc> [...]<br>
<br>
slc> 3, Or Another Way?<br>
<br>
Wow, that's quite a diagnosis. I'm not a good judge of the race<br>
condition that you've found or your fix. I can provide some context,<br>
however, in case that you weren't aware of it. It might help to create<br>
a Real, Final, 100% Correct Fix ... something which does not exist right<br>
now.<br>
<br>
The best workaround is to use "+scl false" and "+sfwi" with a value of<br>
500 or a bit smaller. See the discussion last month about it,<br>
<br>
<a href="http://erlang.org/pipermail/erlang-questions/2014-September/081017.html" target="_blank">http://erlang.org/pipermail/erlang-questions/2014-September/081017.html</a><br>
<br>
My colleague Joe Blomstedt wrote a demo program that can cause scheduler<br>
collapse to happen pretty quickly. It might be useful for judging how<br>
well any fix works ... at Basho we had a terrible time trying to<br>
reproduce this bug before Joe found a semi-reliable trigger.<br>
<br>
<a href="https://github.com/basho/nifwait" target="_blank">https://github.com/basho/nifwait</a><br>
<br>
It is discussed in this email thread (which is broken across two URLs,<br>
sorry).<br>
<br>
<a href="http://erlang.org/pipermail/erlang-questions/2012-October/069503.html" target="_blank">http://erlang.org/pipermail/erlang-questions/2012-October/069503.html</a><br>
(first message only, I don't know why)<br>
<a href="http://erlang.org/pipermail/erlang-questions/2012-October/069585.html" target="_blank">http://erlang.org/pipermail/erlang-questions/2012-October/069585.html</a><br>
(the rest of the thread)<br>
<br>
If your analysis is correct ... then hopefully this can lead quickly to<br>
a Real, Final, 100% Correct Fix. I'm tired of diagnosing systems that<br>
suffer scheduler collapse and then discover that the customer forgot to<br>
add the magic +sfwi and +scl flags in their runtime configuration to<br>
work around that !@#$! bug.<br>
<span class="HOEnZb"><font color="#888888"><br>
-Scott<br>
</font></span></blockquote></div><br></div>