<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hi Mark</p>
<p>I have seen the high scheduler/low CPU pattern before when a
system had lock contention in ets tables. I was able to verify
this was the case using lcnt.<br>
</p>
Cheers<br>
Andy<br>
<br>
<div class="moz-cite-prefix">On 17/02/2018 22:17, Mark Smith wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAFMFiFZfkCSLa5AthjXwdEY9v=wpiKgBrrj0vVnx1_XwUGN9wg@mail.gmail.com">
<div dir="ltr">Hello,
<div><br>
</div>
<div>I'm an SRE at Discord. We make heavy use of Elixir/Erlang
in production and it works really well for us. Lately we've
been seeing an interesting performance issue with our VMs and
have spent quite some hours on understanding it, but haven't
found a root cause yet.</div>
<div><br>
</div>
<div>The top line summary is: weighted schedular utilization
rises from 30% to 80% in a 1-2 minute span while user CPU
usage drops from 70% to 30%. This state lasts approximately
5-30 minutes and then recovers just as quickly as it began.</div>
<div><br>
</div>
<div>Reference: <a
href="https://cdn.discordapp.com/attachments/392075851132567552/414530154988175363/unknown.png"
moz-do-not-send="true">https://cdn.discordapp.com/attachments/392075851132567552/414530154988175363/unknown.png</a></div>
<div><br>
</div>
<div>We've managed to isolate the behavior to being related to
memory bandwidth/latency. I.e., during the periods we're
seeing the above impact, the malloc/mmap latency (as measured
by eBPF tracing) increases by 5-10x. This gives us two leading
theories for the cause of these issues:</div>
<div><br>
</div>
<div>a) Noisy neighbors. We're in a cloud environment, so it's
possible that another host on the shared instance is
saturating memory bandwidth and causing this impact. We've
been working with Google Cloud to investigate this branch.</div>
<div><br>
</div>
<div>b) One of our Erlang processes is allocating in an
anti-pattern and causing the memory bandwidth saturation.
I.e., we're shooting ourselves in the foot.</div>
<div><br>
</div>
<div>We are trying to root cause this because of course if it's
the second situation, we'd like to understand and fix the
problem. We've added a fair bit of instrumentation and stats
recording based on recon_alloc and so we do have some
pointers, but none of us are knowledgeable enough about the
BEAM VM to understand where to go next.</div>
<div><br>
</div>
<div>Some interesting things we have determined:</div>
<div><br>
</div>
<div>Reference: <a
href="https://cdn.discordapp.com/attachments/392075851132567552/414530948097376256/unknown.png"
moz-do-not-send="true">https://cdn.discordapp.com/attachments/392075851132567552/414530948097376256/unknown.png</a></div>
<div><br>
</div>
<div>* temp_alloc allocations drops quite a bit, sl_alloc usage
rises, and the two become mirrors of each other.</div>
<div>* fix_alloc rises some during the bad periods.</div>
<div>* other allocators seem unaffected.</div>
<div><br>
</div>
<div>The sl_alloc feels like an interesting culprit. The number
of MBCs in sl_alloc is ~500 in the good periods, but during
the bad period climbs to 15,000. However, the number of
mseg_alloc calls (from sl_alloc) drops to nearly 0, which
doesn't seem to make sense to me.</div>
<div><br>
</div>
<div>sl_alloc MBCs: <a
href="https://cdn.discordapp.com/attachments/392075851132567552/414543516782428161/unknown.png"
moz-do-not-send="true">https://cdn.discordapp.com/attachments/392075851132567552/414543516782428161/unknown.png</a></div>
<div><br>
</div>
<div>sl_alloc calls to mseg_alloc: <a
href="https://cdn.discordapp.com/attachments/392075851132567552/414543687201062923/unknown.png"
moz-do-not-send="true">https://cdn.discordapp.com/attachments/392075851132567552/414543687201062923/unknown.png</a></div>
<div><br>
</div>
<div>I have no idea if this is cause or effect, and so my basic
question is to ask for advice on how to determine that and
also what people would suggest we do with the investigation.</div>
<div>
<div><br>
</div>
<div>I'm available on IRC as 'zorkian' and on Discord as
zorkian#0001, if anybody wants to follow up more
immediately/ask for more data.</div>
<div><br>
</div>
<div>Thank you for reading!</div>
<div><br>
</div>
-- <br>
<div class="gmail_signature">
<div dir="ltr">
<div>Mark Smith</div>
<div>SRE @ Discord<br>
</div>
</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
erlang-questions mailing list
<a class="moz-txt-link-abbreviated" href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a>
<a class="moz-txt-link-freetext" href="http://erlang.org/mailman/listinfo/erlang-questions">http://erlang.org/mailman/listinfo/erlang-questions</a>
</pre>
</blockquote>
<br>
</body>
</html>