<div dir="ltr"><div><div>If memory serves, R14 can't experience scheduler collapse since it doesn't do rebalancing of work the same way as R15 and onwards. So I think this is a red herring.<br><br></div>Have you established a baseline for the locking in R14? You are contending on the runqueue lock quite a lot, which could account for all the spinning you are seeing, but it is hard to say if this is a high or low number without some baseline you can use to relate. Also, many of the futex() calls are probably for this contention as well. There is a chance your scheduler utilization isn't that high, but you are getting into the spinning all the time. If utilization is fairly low, then the 50% CPU isn't of concern: just load the system more :)<br><br></div><div>Chances are you are hunting the wrong mark as well: You have 2 or more pathologies, and they overlap in what you are seeing. Hence you get distracted by the noise generated by the other problems. It may be you have a CPU problem and on top of that, you have a latency/blocking problem in an I/O layer as well. One could account for the latency spikes, whereas the other would explain the high CPU. But if you don't know you are sitting with two problems in the first place, then their cooperation in the system confuses you.<br><br></div><div></div><div>If you have elevated CPU, then a snapshot of the current thread stacks at 97hz per second[1] should tell you where things are taking time. This in itself could give you hints as to where you are spending all of your time in the system, and also what you are spinning on, if anything.<br><br></div><div>[1] Old trick: Never snapshot at 100hz or something which means you can get into phase with other jobs. Pick some prime around your target.<br><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 1, 2015 at 9:42 PM, Lukas Larsson <span dir="ltr"><<a href="mailto:lukas@erlang-solutions.com" target="_blank">lukas@erlang-solutions.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On Tue, Sep 1, 2015 at 9:14 PM, Paul Davis <span dir="ltr"><<a href="mailto:paul.joseph.davis@gmail.com" target="_blank">paul.joseph.davis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div><span style="color:rgb(34,34,34)">Also, does anyone have a quick pointer to where the busy wait loop is?</span><br></div></div>
After I look at the scheduler time I was going to go find that code<br>
and see if I couldn't come up with a better idea of what exactly might<br>
be changing with that setting.</blockquote><div><br></div></span><div>This should be the code that does the waiting: <a href="https://github.com/erlang/otp/blob/master/erts/lib_src/pthread/ethr_event.c#L65-L161" target="_blank">https://github.com/erlang/otp/blob/master/erts/lib_src/pthread/ethr_event.c#L65-L161</a></div><div><br></div><div>The mutex implementation that calls it is in here: <a href="https://github.com/erlang/otp/blob/master/erts/lib_src/common/ethr_mutex.c" target="_blank">https://github.com/erlang/otp/blob/master/erts/lib_src/common/ethr_mutex.c</a><br></div><div><br></div><div>The different spin options are set here: <a href="https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_process.c#L5325-L5364" target="_blank">https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_process.c#L5325-L5364</a></div><div><br></div><div>There are also a couple of other places where it spind in erl_process.c, just search for spin and you'll find them :)</div></div></div></div>
</blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature">J.</div>
</div>