[erlang-bugs] Schedulers getting "stuck", part II

Patrik Nyblom <>
Tue May 7 14:23:22 CEST 2013

Hi Scott (and Joe)!

Thank you for these tests!

I would say Joe's comment at the end of the test10 gist says it all, and 
is spot on:

"This isn't just a NIF problem. Any code that sits in C land and doesn't 
accurately contribute towards scheduler reductions can case this. So, 
BIFs that don't estimate work and perform BIF_TRAPs are also bad. Turns 
out that that the commonly used |term_to_binary| and |external_size| 
BIFs have this problem. "

Joe points out a couple of misbehaving BIF's and NIF's which will cause 
this, breaking the scheduling algorithm. I bet there's more of them. I 
can see several problems that needs to be fixed:

1) OTP should of course not have code (BIF's or NIF's or whatever) that 
does not even bump reductions or trap properly.
2) If writing NIF's, you should have a way to monitor the scheduler 
behavior to easily find long schedules. DTrace is nice, but not 
available everywhere...
3) If writing NIF's, you should have a simple way to put the execution 
of your code in a separate worker thread.

The answer to (1) is that we continue (or intensify) our work when it 
comes to adding proper reductions and trapping to BIF's (and NIF's). A 
first step would be to just add proper reductions to all relevant BIF's, 
which is fairly easy to do. Whenever there's a BIF whose work depends on 
the size of the input, it should also at least add a cost to the process 
that's proportional. Some old BIF's does not do even that, which really 
needs to be fixed. Contributions are always welcome... term_to_binary 
and external_size are already being worked on, but there's most probably 
more problem BIF's out there...

One step towards (2) is the ability to monitor long schedules in the 
system. I've extended erlang:system_monitor/2 to have an option to 
monitor all schedules and port operations that run for more than a 
specified amount of wall clock time. That should at least help in 
identifying such problems (the code is not in maint yet, but will be 
soon). More monitoring options, to see the scheduler behavior may be 
needed, but this is at least a start. As an example, monitoring long 
schedules in test10, will inform you that the processes run 
uninterrupted for a whopping 1,5 *seconds*. Just adding reduction cost 
to the md5 calls will reduce this to a tenth of the scheduling time of 

The answer to (3) is "dirty schedulers", which is in the roadmap for R17.

I think all three things need to be done for the scheduling to work 
properly, but not only for that. A schedule that takes too long, also 
breaks real time properties of the VM, so fixing this by poking the 
schedulers to wake up at certain intervals just handles one symptom, but 
does not remove the cause and does not cure the impact on real time 

So - it's not the scheduling algorithms as such that results in this 
problem, it's still a problem with uninterrupted C-code. These examples 
shows that some (or many) of our BIF's need to be fixed, that we need to 
intensify the work on monitoring options and that we need dirty 
schedulers. At least that's how I see it.


On 05/01/2013 12:13 AM, Scott Lystig Fritchie wrote:
> Patrik, there are a couple of synthetic load cases that have an end
> result of what we occasionally see Riak and Riak CS doing in the wild.
> Manymany thanks to Joseph Blomstedt for inventing these two modules.
>    test10.erl:
>      https://gist.github.com/jtuple/0d9ca553b7e58adcb6f4
>    test11:erl:
>      https://gist.github.com/jtuple/8f12ce9c21471f5d6f01
> Both can be used by running the 'go/0' function.
> The test10:go() function creates an oscillation between a couple of
> workloads: one that tends toward scheduler collapse, and one that tends
> to wake them up again.
> The test11:go() function uses only a single load that tends toward
> scheduler collapse.
> Both of them fail mostly regularly on my 8 core MBP using R15B01,
> R15B03, and R16B.
> The io:format() messages are sent while load is not running, with very
> generous pauses before starting the next phase of workload.  If you call
> io:format() during unfairly-scheduled workload (which these tests excel
> at doing), the messages can be delayed by dozens of seconds.
> Note that these synthetic tests are using two different functions to
> cause scheduler collapse: test10.erl with crypto:md5_update/2, a NIF,
> and test11.erl with erlang:external_size/1, a BIF.  It's quite likely
> that erlang:term_to_binary/1 is similarly effective/buggy.
> Neither of them fails when using this patch on any of those three VM
> versions:
>      https://github.com/slfritchie/otp/compare/erlang:maint...disable-scheduler-sleeps
>    or
>      https://github.com/slfritchie/otp/tree/disable-scheduler-sleeps
> ... when also using "+scl false +zdnfgtse 500:500".
> -Scott

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130507/50246ab7/attachment.html>

More information about the erlang-bugs mailing list