[erlang-bugs] Schedulers getting "stuck", part II
Tue May 7 14:23:22 CEST 2013
Hi Scott (and Joe)!
Thank you for these tests!
I would say Joe's comment at the end of the test10 gist says it all, and
is spot on:
"This isn't just a NIF problem. Any code that sits in C land and doesn't
accurately contribute towards scheduler reductions can case this. So,
BIFs that don't estimate work and perform BIF_TRAPs are also bad. Turns
out that that the commonly used |term_to_binary| and |external_size|
BIFs have this problem. "
Joe points out a couple of misbehaving BIF's and NIF's which will cause
this, breaking the scheduling algorithm. I bet there's more of them. I
can see several problems that needs to be fixed:
1) OTP should of course not have code (BIF's or NIF's or whatever) that
does not even bump reductions or trap properly.
2) If writing NIF's, you should have a way to monitor the scheduler
behavior to easily find long schedules. DTrace is nice, but not
3) If writing NIF's, you should have a simple way to put the execution
of your code in a separate worker thread.
The answer to (1) is that we continue (or intensify) our work when it
comes to adding proper reductions and trapping to BIF's (and NIF's). A
first step would be to just add proper reductions to all relevant BIF's,
which is fairly easy to do. Whenever there's a BIF whose work depends on
the size of the input, it should also at least add a cost to the process
that's proportional. Some old BIF's does not do even that, which really
needs to be fixed. Contributions are always welcome... term_to_binary
and external_size are already being worked on, but there's most probably
more problem BIF's out there...
One step towards (2) is the ability to monitor long schedules in the
system. I've extended erlang:system_monitor/2 to have an option to
monitor all schedules and port operations that run for more than a
specified amount of wall clock time. That should at least help in
identifying such problems (the code is not in maint yet, but will be
soon). More monitoring options, to see the scheduler behavior may be
needed, but this is at least a start. As an example, monitoring long
schedules in test10, will inform you that the processes run
uninterrupted for a whopping 1,5 *seconds*. Just adding reduction cost
to the md5 calls will reduce this to a tenth of the scheduling time of
The answer to (3) is "dirty schedulers", which is in the roadmap for R17.
I think all three things need to be done for the scheduling to work
properly, but not only for that. A schedule that takes too long, also
breaks real time properties of the VM, so fixing this by poking the
schedulers to wake up at certain intervals just handles one symptom, but
does not remove the cause and does not cure the impact on real time
So - it's not the scheduling algorithms as such that results in this
problem, it's still a problem with uninterrupted C-code. These examples
shows that some (or many) of our BIF's need to be fixed, that we need to
intensify the work on monitoring options and that we need dirty
schedulers. At least that's how I see it.
On 05/01/2013 12:13 AM, Scott Lystig Fritchie wrote:
> Patrik, there are a couple of synthetic load cases that have an end
> result of what we occasionally see Riak and Riak CS doing in the wild.
> Manymany thanks to Joseph Blomstedt for inventing these two modules.
> Both can be used by running the 'go/0' function.
> The test10:go() function creates an oscillation between a couple of
> workloads: one that tends toward scheduler collapse, and one that tends
> to wake them up again.
> The test11:go() function uses only a single load that tends toward
> scheduler collapse.
> Both of them fail mostly regularly on my 8 core MBP using R15B01,
> R15B03, and R16B.
> The io:format() messages are sent while load is not running, with very
> generous pauses before starting the next phase of workload. If you call
> io:format() during unfairly-scheduled workload (which these tests excel
> at doing), the messages can be delayed by dozens of seconds.
> Note that these synthetic tests are using two different functions to
> cause scheduler collapse: test10.erl with crypto:md5_update/2, a NIF,
> and test11.erl with erlang:external_size/1, a BIF. It's quite likely
> that erlang:term_to_binary/1 is similarly effective/buggy.
> Neither of them fails when using this patch on any of those three VM
> ... when also using "+scl false +zdnfgtse 500:500".
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-bugs