<div dir="ltr"><div><div>Hi,<br><br>Try lock counter during load-testing<br><a href="http://erlang.org/doc/apps/tools/lcnt_chapter.html">http://erlang.org/doc/apps/tools/lcnt_chapter.html</a><br></div><div>It's very unsafe to run it in prod.<br></div><div><br></div><div>Check locking time of the port drivers.Sometimes setting  ERL_DRV_FLAG_USE_PORT_LOCKING flag can help.<br><br></div>All long operations in ports cause the same scheduling problems you heard about NIFs.<br></div><div>Unless these operations are async and driver is well-written.<br></div><div>Scheduling/locking problems cause random latency spikes in all time metrics. So, check 95 percentile for this.<br></div><div><br></div><div>Regards,<br></div><div>Michael<br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 23 June 2016 at 12:35, Eli Iser <span dir="ltr"><<a href="mailto:eli.iser@gmail.com" target="_blank">eli.iser@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><span style="font-size:12.8px">I'm running several ejabberd nodes in a cluster. Sadly, it uses a very old Erlang version - R13B04, on top of CentOS with 4 cores. I am using 5 async-threads with kernel poll enabled.</span><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">The old Erlang version might perhaps explain the problem I saw, but I believe that the question is general enough.<div><br></div><div>At one time the CPU usage of some of the nodes started climbing (from about 20% to near 100%). This happened only once and doesn't happen anymore (after a full cluster power off and power on again).</div><div><br></div><div>Connecting to the nodes with a remote shell showed almost everything was the same to nodes that were un-affected:</div><div><br></div><div>* cprof and eprof - showed the same usage patterns.</div><div>* Listing top most erlang:process_info on all processes for reductions and message_queue_len showed similar patterns.</div><div>* erlang:statistics - looked the same for wall_clock and reductions.</div><div><br></div><div>The only concrete difference between the affected and un-affected nodes were:</div><div><br></div><div>* run_queue - affected nodes had a run queue of several dozens (less than 100), while un-affected nodes had 0 (always). Since run_queues is undocumented (at least I didn't see it in the documentation), I didn't run it at the time of the problem.</div><div>* runtime - affected nodes progress the runtime at about 150% of the wall time, while un-affected nodes progress the runtime at about 20% of the wall time.</div><div><br></div><div>All of this made me suspect some NIFs and/or ports taking a long time to complete, taking CPU time but not increasing reductions and calls.</div><div><br></div><div>Looking at the various profilers for Erlang, I couldn't find anything that can profile NIFs or even reveal that they indeed do take a long time to finish.</div><div><br></div><div>Is there a better way to diagnose a high CPU usage issue?</div><div><br></div></div><div style="font-size:12.8px">Cheers,</div><div style="font-size:12.8px">Eli</div></div>

<br>_______________________________________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

<br></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">С уважением,<br>Уваров Михаил.<br>Best regards,<br>Uvarov Michael</div>

</div>