<div dir="ltr"> Hi Danil ! Thank you very much for the suggestions. For the time being, we can't change the Bad cluster with a new erlang with the extended msacc and lcnt. But in the good cluster we have lcnt enabled and we are going to enable msacc. Although we wont have anything to compare against.<div><br></div><div>We'll report back any findings with regards the other suggestions very soon.</div><div><br></div><div>Thanks !</div><div>  </div></div><br><div class="gmail_quote"><div dir="ltr">On Wed, Jul 18, 2018 at 11:28 AM Danil Zagoskin <<a href="mailto:z@gosk.in">z@gosk.in</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi!<div><br></div><div>I know that feel when a load is very high and you don't know why.<br><br>More things to see:<br>* extended msacc: configure OTP with --with-microstate-accounting=extra<div>* LCNT: lcnt:apply(timer, sleep, [5000]), lcnt:conflicts(), lcnt:inspect, etc.</div><div>* Check if there are some processes spending too much time running. You may find this tool useful: <a href="https://gist.github.com/stolen/9a28ed9403c724541b0ee5fcd822613e" target="_blank">https://gist.github.com/stolen/9a28ed9403c724541b0ee5fcd822613e</a></div><div>* network buffers. Check if rmem/wmem in sysctl are the same. Also check network interfaces — MTU, drops, etc.</div><div>* NUMA and scheduler bindings. Try running the whole application on single NUMA node to avoid interconnect cost</div><div>* Other processes on the host. Once we saw a malware miner that hid itself from ps and was active only when the server was busy doing its main job. Use perf top on the every CPU core to detect this.</div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr">On Wed, Jul 18, 2018 at 6:34 AM Marcial Rosales <<a href="mailto:mrosales@pivotal.io" target="_blank">mrosales@pivotal.io</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><span style="color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">We are experiencing a very high cpu utilization in 3 clustered Erlang VMs running RabbitMQ. We have deployed another cluster in an attempt to reproduce the same behaviour without much success.</span><br></div><div><span style="color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><p style="box-sizing:border-box;margin-top:0px;margin-bottom:16px;color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px;text-decoration-style:initial;text-decoration-color:initial">Our goals are:</p><ul style="box-sizing:border-box;padding-left:2em;margin-top:0px;margin-bottom:16px;color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px;text-decoration-style:initial;text-decoration-color:initial"><li style="box-sizing:border-box">Find out where the CPU is being utilized</li><li style="box-sizing:border-box;margin-top:0.25em">Choose the right tools to analyze CPU utilization</li></ul><br class="m_7907647030859876833m_-2256524670992938224gmail-Apple-interchange-newline"><p style="box-sizing:border-box;margin-top:0px;margin-bottom:16px;color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px;text-decoration-style:initial;text-decoration-color:initial">Our observations so far:</p><ul style="box-sizing:border-box;padding-left:2em;margin-top:0px;margin-bottom:16px;color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px;text-decoration-style:initial;text-decoration-color:initial"><li style="box-sizing:border-box">The<span> </span><strong style="box-sizing:border-box;font-weight:600">BAD</strong><span> </span>cluster observes a pretty excessive CPU utilization, both user and system ones, and also network.</li><li style="box-sizing:border-box;margin-top:0.25em">The<span> </span><strong style="box-sizing:border-box;font-weight:600">BAD</strong><span> </span>cluster also observes a higher Erlang scheduler utilization, specially on microstate<span> </span><code style="box-sizing:border-box;font-family:SFMono-Regular,Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:13.6px;padding:0.2em 0.4em;margin:0px;background-color:rgba(27,31,35,0.05);border-radius:3px">emulator</code><span> </span>and<span> </span><code style="box-sizing:border-box;font-family:SFMono-Regular,Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:13.6px;padding:0.2em 0.4em;margin:0px;background-color:rgba(27,31,35,0.05);border-radius:3px">other</code>. We are yet to understand what<span> </span><code style="box-sizing:border-box;font-family:SFMono-Regular,Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:13.6px;padding:0.2em 0.4em;margin:0px;background-color:rgba(27,31,35,0.05);border-radius:3px">other</code><span> </span>could be. According to Erlang documentation is<span> </span><em style="box-sizing:border-box">unaccounted things</em>.</li><li style="box-sizing:border-box;margin-top:0.25em">The<span> </span><strong style="box-sizing:border-box;font-weight:600">BAD</strong><span> </span>cluster observes a considerably higher number of system calls which we are yet to identify (dunno how) why is that.</li><li style="box-sizing:border-box;margin-top:0.25em">The<span> </span><strong style="box-sizing:border-box;font-weight:600">BAD</strong><span> </span>cluster does not necessarily run higher number of reductions. In fact, the<span> </span><strong style="box-sizing:border-box;font-weight:600">GOOD</strong><span> </span>cluster runs more reductions and yet has a lower scheduler utilization.</li></ul><div><font color="#24292e" face="-apple-system, system-ui, Segoe UI, Helvetica, Arial, sans-serif, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol"><span style="font-size:16px"><table style="box-sizing:border-box;border-collapse:collapse;margin-top:0px;margin-bottom:16px;display:block;width:888px;overflow:auto;text-decoration-style:initial;text-decoration-color:initial"><thead style="box-sizing:border-box"><tr style="box-sizing:border-box;background-color:rgb(255,255,255);border-top:1px solid rgb(198,203,209)"><th style="box-sizing:border-box;padding:6px 13px;font-weight:600;border:1px solid rgb(223,226,229)">METRIC</th><th style="box-sizing:border-box;padding:6px 13px;font-weight:600;border:1px solid rgb(223,226,229)">BAD</th><th style="box-sizing:border-box;padding:6px 13px;font-weight:600;border:1px solid rgb(223,226,229)">GOOD</th></tr></thead><tbody style="box-sizing:border-box"><tr style="box-sizing:border-box;background-color:rgb(255,255,255);border-top:1px solid rgb(198,203,209)"><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)"><a href="https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat" style="box-sizing:border-box;background-color:transparent;color:rgb(3,102,214);text-decoration:none" target="_blank">user cpu</a></td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">46% - 57%</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">19% - 40%</td></tr><tr style="box-sizing:border-box;background-color:rgb(246,248,250);border-top:1px solid rgb(198,203,209)"><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)"><a href="https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat" style="box-sizing:border-box;background-color:transparent;color:rgb(3,102,214);text-decoration:none" target="_blank">system cpu</a></td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">20% - 37%</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">1% - 10%</td></tr><tr style="box-sizing:border-box;background-color:rgb(255,255,255);border-top:1px solid rgb(198,203,209)"><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)"><a href="https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat" style="box-sizing:border-box;background-color:transparent;color:rgb(3,102,214);text-decoration:none" target="_blank">network traffic</a></td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">6M - 19M</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">up to 8M</td></tr><tr style="box-sizing:border-box;background-color:rgb(246,248,250);border-top:1px solid rgb(198,203,209)"><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)"><a href="https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat" style="box-sizing:border-box;background-color:transparent;color:rgb(3,102,214);text-decoration:none" target="_blank">system interrupts</a></td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">120k - 196k</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">10k - 20k</td></tr><tr style="box-sizing:border-box;background-color:rgb(255,255,255);border-top:1px solid rgb(198,203,209)"><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)"><a href="https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#syscalls" style="box-sizing:border-box;background-color:transparent;color:rgb(3,102,214);text-decoration:none" target="_blank">syscalls</a></td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">1.6M - 2.1M</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">49k - 110k</td></tr><tr style="box-sizing:border-box;background-color:rgb(246,248,250);border-top:1px solid rgb(198,203,209)"><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)"><a href="https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf-stat" style="box-sizing:border-box;background-color:transparent;color:rgb(3,102,214);text-decoration:none" target="_blank">task-clock 10sec</a></td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">68255</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)">12324</td></tr><tr style="box-sizing:border-box;background-color:rgb(255,255,255);border-top:1px solid rgb(198,203,209)"><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)"><a href="https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf_record_cpu_cycles" style="box-sizing:border-box;background-color:transparent;color:rgb(3,102,214);text-decoration:none" target="_blank">cpu profiling info</a></td></tr></tbody></table></span></font></div><div><br></div></div><div><p style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;box-sizing:border-box;margin-top:0px;margin-bottom:16px;color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px">We have gathered lots of metrics in attempt to identify why the BAD cluster uses so much CPU. All the information can be found here <a href="https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841" target="_blank">https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841</a>  along with the environment information.</p><p style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;box-sizing:border-box;margin-top:0px;margin-bottom:16px;color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px"><br></p><p style="background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;box-sizing:border-box;margin-top:0px;margin-bottom:16px;color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px">We appreciate a lot any insights as to what could be causing the issue and/or in relation to additional tools we could use.</p></div><div><span style="color:rgb(36,41,46);font-family:-apple-system,system-ui,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Many thanks</span></div><div> </div><div>-- <br><div dir="ltr" class="m_7907647030859876833m_-2256524670992938224gmail-m_6433092018425678905gmail_signature"><div dir="ltr"><div dir="ltr"><span style="color:rgb(80,0,80);font-size:12.8px">Marcial Rosales</span><div style="color:rgb(80,0,80);font-size:12.8px"><span style="font-size:12.8px">Pivotal, Inc.  EMEA</span><br></div></div><div dir="ltr"><div><br></div></div></div></div></div></div>
_______________________________________________<br>
erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="m_7907647030859876833gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><font face="'courier new', monospace">Danil Zagoskin | <a href="mailto:z@gosk.in" target="_blank">z@gosk.in</a></font></div></div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><span style="color:rgb(80,0,80);font-size:12.8000001907349px">Marcial Rosales</span><div style="color:rgb(80,0,80);font-size:12.8000001907349px"><span style="font-size:12.8000001907349px;color:rgb(34,34,34)">Advisory Solution Architect (</span><span style="font-size:12.8000001907349px">Customer Success Organization)</span></div><div style="color:rgb(80,0,80);font-size:12.8000001907349px">Pivotal, Inc.  EMEA</div></div><div dir="ltr"><br><div><br></div></div></div></div></div>