[erlang-questions] Unbalanced Scheduler Problem caused by Using out of date active_no_runq

Thu Oct 30 08:53:22 CET 2014

songlu cai <caisonglu@REDACTED> wrote:

slc> How to fix:

slc> [...]

slc> 3, Or Another Way?

Wow, that's quite a diagnosis.  I'm not a good judge of the race
condition that you've found or your fix.  I can provide some context,
however, in case that you weren't aware of it.  It might help to create
a Real, Final, 100% Correct Fix ... something which does not exist right
now.

The best workaround is to use "+scl false" and "+sfwi" with a value of
500 or a bit smaller.  See the discussion last month about it,

    http://erlang.org/pipermail/erlang-questions/2014-September/081017.html

My colleague Joe Blomstedt wrote a demo program that can cause scheduler
collapse to happen pretty quickly.  It might be useful for judging how
well any fix works ... at Basho we had a terrible time trying to
reproduce this bug before Joe found a semi-reliable trigger.

    https://github.com/basho/nifwait

It is discussed in this email thread (which is broken across two URLs,
sorry).

    http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
        (first message only, I don't know why)
    http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
        (the rest of the thread)

If your analysis is correct ... then hopefully this can lead quickly to
a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
suffer scheduler collapse and then discover that the customer forgot to
add the magic +sfwi and +scl flags in their runtime configuration to
work around that !@#$! bug.

-Scott