<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hi Scott (and Joe)!<br>
<br>
Thank you for these tests! <br>
<br>
I would say Joe's comment at the end of the test10 gist says it
all, and is spot on:<br>
<br>
"This isn't just a NIF problem. Any code that sits in C land and
doesn't accurately contribute towards scheduler reductions can
case
this. So, BIFs that don't estimate work and perform BIF_TRAPs are
also
bad. Turns out that that the commonly used <code>term_to_binary</code>
and
<code>external_size</code> BIFs have this problem. "<br>
<br>
Joe points out a couple of misbehaving BIF's and NIF's which will
cause this, breaking the scheduling algorithm. I bet there's more
of them. I can see several problems that needs to be fixed:<br>
<br>
1) OTP should of course not have code (BIF's or NIF's or whatever)
that does not even bump reductions or trap properly. <br>
2) If writing NIF's, you should have a way to monitor the
scheduler behavior to easily find long schedules. DTrace is nice,
but not available everywhere...<br>
3) If writing NIF's, you should have a simple way to put the
execution of your code in a separate worker thread.<br>
<br>
The answer to (1) is that we continue (or intensify) our work when
it comes to adding proper reductions and trapping to BIF's (and
NIF's). A first step would be to just add proper reductions to all
relevant BIF's, which is fairly easy to do. Whenever there's a BIF
whose work depends on the size of the input, it should also at
least add a cost to the process that's proportional. Some old
BIF's does not do even that, which really needs to be fixed.
Contributions are always welcome... term_to_binary and
external_size are already being worked on, but there's most
probably more problem BIF's out there...<br>
<br>
One step towards (2) is the ability to monitor long schedules in
the system. I've extended erlang:system_monitor/2 to have an
option to monitor all schedules and port operations that run for
more than a specified amount of wall clock time. That should at
least help in identifying such problems (the code is not in maint
yet, but will be soon). More monitoring options, to see the
scheduler behavior may be needed, but this is at least a start. As
an example, monitoring long schedules in test10, will inform you
that the processes run uninterrupted for a whopping 1,5 *seconds*.
Just adding reduction cost to the md5 calls will reduce this to a
tenth of the scheduling time of course.<br>
<br>
The answer to (3) is "dirty schedulers", which is in the roadmap
for R17.<br>
<br>
I think all three things need to be done for the scheduling to
work properly, but not only for that. A schedule that takes too
long, also breaks real time properties of the VM, so fixing this
by poking the schedulers to wake up at certain intervals just
handles one symptom, but does not remove the cause and does not
cure the impact on real time behavior... <br>
<br>
So - it's not the scheduling algorithms as such that results in
this problem, it's still a problem with uninterrupted C-code.
These examples shows that some (or many) of our BIF's need to be
fixed, that we need to intensify the work on monitoring options
and that we need dirty schedulers. At least that's how I see it.<br>
<br>
Cheers,<br>
Patrik<br>
<br>
On 05/01/2013 12:13 AM, Scott Lystig Fritchie wrote:<br>
</div>
<blockquote cite="mid:83533.1367359995@snookles.snookles.com"
type="cite">
<pre wrap="">Patrik, there are a couple of synthetic load cases that have an end
result of what we occasionally see Riak and Riak CS doing in the wild.
Manymany thanks to Joseph Blomstedt for inventing these two modules.
test10.erl:
<a class="moz-txt-link-freetext" href="https://gist.github.com/jtuple/0d9ca553b7e58adcb6f4">https://gist.github.com/jtuple/0d9ca553b7e58adcb6f4</a>
test11:erl:
<a class="moz-txt-link-freetext" href="https://gist.github.com/jtuple/8f12ce9c21471f5d6f01">https://gist.github.com/jtuple/8f12ce9c21471f5d6f01</a>
Both can be used by running the 'go/0' function.
The test10:go() function creates an oscillation between a couple of
workloads: one that tends toward scheduler collapse, and one that tends
to wake them up again.
The test11:go() function uses only a single load that tends toward
scheduler collapse.
Both of them fail mostly regularly on my 8 core MBP using R15B01,
R15B03, and R16B.
The io:format() messages are sent while load is not running, with very
generous pauses before starting the next phase of workload. If you call
io:format() during unfairly-scheduled workload (which these tests excel
at doing), the messages can be delayed by dozens of seconds.
Note that these synthetic tests are using two different functions to
cause scheduler collapse: test10.erl with crypto:md5_update/2, a NIF,
and test11.erl with erlang:external_size/1, a BIF. It's quite likely
that erlang:term_to_binary/1 is similarly effective/buggy.
Neither of them fails when using this patch on any of those three VM
versions:
<a class="moz-txt-link-freetext" href="https://github.com/slfritchie/otp/compare/erlang:maint...disable-scheduler-sleeps">https://github.com/slfritchie/otp/compare/erlang:maint...disable-scheduler-sleeps</a>
or
<a class="moz-txt-link-freetext" href="https://github.com/slfritchie/otp/tree/disable-scheduler-sleeps">https://github.com/slfritchie/otp/tree/disable-scheduler-sleeps</a>
... when also using "+scl false +zdnfgtse 500:500".
-Scott
</pre>
</blockquote>
<br>
</body>
</html>