[erlang-bugs] Schedulers getting "stuck", part II

Wed May 1 00:13:15 CEST 2013

Patrik, there are a couple of synthetic load cases that have an end
result of what we occasionally see Riak and Riak CS doing in the wild.
Manymany thanks to Joseph Blomstedt for inventing these two modules.

  test10.erl:
    https://gist.github.com/jtuple/0d9ca553b7e58adcb6f4
  test11:erl:
    https://gist.github.com/jtuple/8f12ce9c21471f5d6f01

Both can be used by running the 'go/0' function.

The test10:go() function creates an oscillation between a couple of
workloads: one that tends toward scheduler collapse, and one that tends
to wake them up again.

The test11:go() function uses only a single load that tends toward
scheduler collapse.

Both of them fail mostly regularly on my 8 core MBP using R15B01,
R15B03, and R16B.

The io:format() messages are sent while load is not running, with very
generous pauses before starting the next phase of workload.  If you call
io:format() during unfairly-scheduled workload (which these tests excel
at doing), the messages can be delayed by dozens of seconds.

Note that these synthetic tests are using two different functions to
cause scheduler collapse: test10.erl with crypto:md5_update/2, a NIF,
and test11.erl with erlang:external_size/1, a BIF.  It's quite likely
that erlang:term_to_binary/1 is similarly effective/buggy.

Neither of them fails when using this patch on any of those three VM
versions:

    https://github.com/slfritchie/otp/compare/erlang:maint...disable-scheduler-sleeps
  or
    https://github.com/slfritchie/otp/tree/disable-scheduler-sleeps

... when also using "+scl false +zdnfgtse 500:500".

-Scott