[erlang-questions] Identifying causes for high CPU usage

Michael Uvarov freeakk@REDACTED
Thu Jun 23 14:25:52 CEST 2016


Try lock counter during load-testing
It's very unsafe to run it in prod.

Check locking time of the port drivers.Sometimes setting

All long operations in ports cause the same scheduling problems you heard
about NIFs.
Unless these operations are async and driver is well-written.
Scheduling/locking problems cause random latency spikes in all time
metrics. So, check 95 percentile for this.


On 23 June 2016 at 12:35, Eli Iser <eli.iser@REDACTED> wrote:

> I'm running several ejabberd nodes in a cluster. Sadly, it uses a very old
> Erlang version - R13B04, on top of CentOS with 4 cores. I am using 5
> async-threads with kernel poll enabled.
> The old Erlang version might perhaps explain the problem I saw, but I
> believe that the question is general enough.
> At one time the CPU usage of some of the nodes started climbing (from
> about 20% to near 100%). This happened only once and doesn't happen anymore
> (after a full cluster power off and power on again).
> Connecting to the nodes with a remote shell showed almost everything was
> the same to nodes that were un-affected:
> * cprof and eprof - showed the same usage patterns.
> * Listing top most erlang:process_info on all processes for reductions and
> message_queue_len showed similar patterns.
> * erlang:statistics - looked the same for wall_clock and reductions.
> The only concrete difference between the affected and un-affected nodes
> were:
> * run_queue - affected nodes had a run queue of several dozens (less than
> 100), while un-affected nodes had 0 (always). Since run_queues is
> undocumented (at least I didn't see it in the documentation), I didn't run
> it at the time of the problem.
> * runtime - affected nodes progress the runtime at about 150% of the wall
> time, while un-affected nodes progress the runtime at about 20% of the wall
> time.
> All of this made me suspect some NIFs and/or ports taking a long time to
> complete, taking CPU time but not increasing reductions and calls.
> Looking at the various profilers for Erlang, I couldn't find anything that
> can profile NIFs or even reveal that they indeed do take a long time to
> finish.
> Is there a better way to diagnose a high CPU usage issue?
> Cheers,
> Eli
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

С уважением,
Уваров Михаил.
Best regards,
Uvarov Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160623/611e6d07/attachment.htm>

More information about the erlang-questions mailing list