<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Hi Mark</p>

    <p>I have seen the high scheduler/low CPU pattern before when a

      system had lock contention in ets tables. I was able to verify

      this was the case using lcnt.<br>

    </p>

    Cheers<br>

    Andy<br>

    <br>

    <div class="moz-cite-prefix">On 17/02/2018 22:17, Mark Smith wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAFMFiFZfkCSLa5AthjXwdEY9v=wpiKgBrrj0vVnx1_XwUGN9wg@mail.gmail.com">

      <div dir="ltr">Hello,

        <div><br>

        </div>

        <div>I'm an SRE at Discord. We make heavy use of Elixir/Erlang

          in production and it works really well for us. Lately we've

          been seeing an interesting performance issue with our VMs and

          have spent quite some hours on understanding it, but haven't

          found a root cause yet.</div>

        <div><br>

        </div>

        <div>The top line summary is: weighted schedular utilization

          rises from 30% to 80% in a 1-2 minute span while user CPU

          usage drops from 70% to 30%. This state lasts approximately

          5-30 minutes and then recovers just as quickly as it began.</div>

        <div><br>

        </div>

        <div>Reference:Â <a

href="https://cdn.discordapp.com/attachments/392075851132567552/414530154988175363/unknown.png"

            moz-do-not-send="true">https://cdn.discordapp.com/attachments/392075851132567552/414530154988175363/unknown.png</a></div>

        <div><br>

        </div>

        <div>We've managed to isolate the behavior to being related to

          memory bandwidth/latency. I.e., during the periods we're

          seeing the above impact, the malloc/mmap latency (as measured

          by eBPF tracing) increases by 5-10x. This gives us two leading

          theories for the cause of these issues:</div>

        <div><br>

        </div>

        <div>a) Noisy neighbors. We're in a cloud environment, so it's

          possible that another host on the shared instance is

          saturating memory bandwidth and causing this impact. We've

          been working with Google Cloud to investigate this branch.</div>

        <div><br>

        </div>

        <div>b) One of our Erlang processes is allocating in an

          anti-pattern and causing the memory bandwidth saturation.

          I.e., we're shooting ourselves in the foot.</div>

        <div><br>

        </div>

        <div>We are trying to root cause this because of course if it's

          the second situation, we'd like to understand and fix the

          problem. We've added a fair bit of instrumentation and stats

          recording based on recon_alloc and so we do have some

          pointers, but none of us are knowledgeable enough about the

          BEAM VM to understand where to go next.</div>

        <div><br>

        </div>

        <div>Some interesting things we have determined:</div>

        <div><br>

        </div>

        <div>Reference:Â <a

href="https://cdn.discordapp.com/attachments/392075851132567552/414530948097376256/unknown.png"

            moz-do-not-send="true">https://cdn.discordapp.com/attachments/392075851132567552/414530948097376256/unknown.png</a></div>

        <div><br>

        </div>

        <div>* temp_alloc allocations drops quite a bit, sl_alloc usage

          rises, and the two become mirrors of each other.</div>

        <div>* fix_alloc rises some during the bad periods.</div>

        <div>* other allocators seem unaffected.</div>

        <div><br>

        </div>

        <div>The sl_alloc feels like an interesting culprit. The number

          of MBCs in sl_alloc is ~500 in the good periods, but during

          the bad period climbs to 15,000. However, the number of

          mseg_alloc calls (from sl_alloc) drops to nearly 0, which

          doesn't seem to make sense to me.</div>

        <div><br>

        </div>

        <div>sl_alloc MBCs:Â <a

href="https://cdn.discordapp.com/attachments/392075851132567552/414543516782428161/unknown.png"

            moz-do-not-send="true">https://cdn.discordapp.com/attachments/392075851132567552/414543516782428161/unknown.png</a></div>

        <div><br>

        </div>

        <div>sl_alloc calls to mseg_alloc:Â <a

href="https://cdn.discordapp.com/attachments/392075851132567552/414543687201062923/unknown.png"

            moz-do-not-send="true">https://cdn.discordapp.com/attachments/392075851132567552/414543687201062923/unknown.png</a></div>

        <div><br>

        </div>

        <div>I have no idea if this is cause or effect, and so my basic

          question is to ask for advice on how to determine that and

          also what people would suggest we do with the investigation.</div>

        <div>

          <div><br>

          </div>

          <div>I'm available on IRC as 'zorkian' and on Discord as

            zorkian#0001, if anybody wants to follow up more

            immediately/ask for more data.</div>

          <div><br>

          </div>

          <div>Thank you for reading!</div>

          <div><br>

          </div>

          -- <br>

          <div class="gmail_signature">

            <div dir="ltr">

              <div>Mark Smith</div>

              <div>SRE @ Discord<br>

              </div>

            </div>

          </div>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

erlang-questions mailing list

<a class="moz-txt-link-abbreviated" href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a>

<a class="moz-txt-link-freetext" href="http://erlang.org/mailman/listinfo/erlang-questions">http://erlang.org/mailman/listinfo/erlang-questions</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>