[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq
Scott Lystig Fritchie
fritchie@REDACTED
Thu Oct 30 08:53:22 CET 2014
songlu cai <caisonglu@REDACTED> wrote:
slc> How to fix:
slc> [...]
slc> 3, Or Another Way?
Wow, that's quite a diagnosis. I'm not a good judge of the race
condition that you've found or your fix. I can provide some context,
however, in case that you weren't aware of it. It might help to create
a Real, Final, 100% Correct Fix ... something which does not exist right
now.
The best workaround is to use "+scl false" and "+sfwi" with a value of
500 or a bit smaller. See the discussion last month about it,
http://erlang.org/pipermail/erlang-questions/2014-September/081017.html
My colleague Joe Blomstedt wrote a demo program that can cause scheduler
collapse to happen pretty quickly. It might be useful for judging how
well any fix works ... at Basho we had a terrible time trying to
reproduce this bug before Joe found a semi-reliable trigger.
https://github.com/basho/nifwait
It is discussed in this email thread (which is broken across two URLs,
sorry).
http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
(first message only, I don't know why)
http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
(the rest of the thread)
If your analysis is correct ... then hopefully this can lead quickly to
a Real, Final, 100% Correct Fix. I'm tired of diagnosing systems that
suffer scheduler collapse and then discover that the customer forgot to
add the magic +sfwi and +scl flags in their runtime configuration to
work around that !@#$! bug.
-Scott
More information about the erlang-questions
mailing list