<div dir="ltr"><p class="MsoNormal"><span lang="EN-US">Hi Scott,</span></p>


<p class="MsoNormal"><span lang="EN-US"> </span></p>


<p class="MsoNormal"><span lang="EN-US">Thanks for your attention & quick reply.</span></p>


<p class="MsoNormal"><span lang="EN-US">It seems that quite a few people suffer

from this problem.</span></p>


<p class="MsoNormal"><span lang="EN-US"> </span></p>


<p class="MsoNormal"><span lang="EN-US">Scott>The best workaround is to use

"+scl false" and "+sfwi" with a value of 500 or a bit

smaller</span></p>


<p class="MsoNormal"><span lang="EN-US">1, we set +sfwi 500</span></p>


<p class="MsoNormal"><span lang="EN-US">2,at first, we set +scl false, but it causes

unbalanced runq length among all runqs on R16B03, then we set +scl true (by default),

so +scl false is not a safe choice on R16B03</span></p>


<p class="MsoNormal"><span lang="EN-US"> </span></p>


<p class="MsoNormal"><span lang="EN-US">Our test cmdline:</span></p>


<p class="MsoNormal"><span lang="EN-US">/home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl

8192 -sbt db -sbwt very_short -swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100

-MBmmmbc 100 -MHmmmbc 100 -MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e

50000 -Q 1000000 -hmbs 46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct

L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1

-swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot

/home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time

30000 -heart -setcookie Xxx -name proxy@Xxx – console</span></p>


<p class="MsoNormal"><span lang="EN-US"> </span></p>


<p class="MsoNormal"><span lang="EN-US">And , apart from everything else,

INACTIVE|NONEMPTY is not a normal state of runq flag.</span></p>


<p class="MsoNormal"><span lang="EN-US">Next few days, I will fix the not-yet-be-sure

bug in my way based on R16B03 and run the test cases again.</span></p><p class="MsoNormal"><br></p>


<p class="MsoNormal"><span lang="EN-US">Best Regards,</span></p>


<p class="MsoNormal"><span lang="EN-US">Songlu Cai</span></p></div><div class="gmail_extra"><br><div class="gmail_quote">2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <span dir="ltr"><<a href="mailto:fritchie@snookles.com" target="_blank">fritchie@snookles.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">songlu cai <<a href="mailto:caisonglu@gmail.com">caisonglu@gmail.com</a>> wrote:<br>

<br>

slc> How to fix:<br>

<br>

slc> [...]<br>

<br>

slc> 3, Or Another Way?<br>

<br>

Wow, that's quite a diagnosis.  I'm not a good judge of the race<br>

condition that you've found or your fix.  I can provide some context,<br>

however, in case that you weren't aware of it.  It might help to create<br>

a Real, Final, 100% Correct Fix ... something which does not exist right<br>

now.<br>

<br>

The best workaround is to use "+scl false" and "+sfwi" with a value of<br>

500 or a bit smaller.  See the discussion last month about it,<br>

<br>

    <a href="http://erlang.org/pipermail/erlang-questions/2014-September/081017.html" target="_blank">http://erlang.org/pipermail/erlang-questions/2014-September/081017.html</a><br>

<br>

My colleague Joe Blomstedt wrote a demo program that can cause scheduler<br>

collapse to happen pretty quickly.  It might be useful for judging how<br>

well any fix works ... at Basho we had a terrible time trying to<br>

reproduce this bug before Joe found a semi-reliable trigger.<br>

<br>

    <a href="https://github.com/basho/nifwait" target="_blank">https://github.com/basho/nifwait</a><br>

<br>

It is discussed in this email thread (which is broken across two URLs,<br>

sorry).<br>

<br>

    <a href="http://erlang.org/pipermail/erlang-questions/2012-October/069503.html" target="_blank">http://erlang.org/pipermail/erlang-questions/2012-October/069503.html</a><br>

        (first message only, I don't know why)<br>

    <a href="http://erlang.org/pipermail/erlang-questions/2012-October/069585.html" target="_blank">http://erlang.org/pipermail/erlang-questions/2012-October/069585.html</a><br>

        (the rest of the thread)<br>

<br>

If your analysis is correct ... then hopefully this can lead quickly to<br>

a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that<br>

suffer scheduler collapse and then discover that the customer forgot to<br>

add the magic +sfwi and +scl flags in their runtime configuration to<br>

work around that !@#$! bug.<br>

<span class="HOEnZb"><font color="#888888"><br>

-Scott<br>

</font></span></blockquote></div><br></div>