[erlang-questions] Identifying causes for high CPU usage

Thu Jun 23 12:35:19 CEST 2016

I'm running several ejabberd nodes in a cluster. Sadly, it uses a very old
Erlang version - R13B04, on top of CentOS with 4 cores. I am using 5
async-threads with kernel poll enabled.

The old Erlang version might perhaps explain the problem I saw, but I
believe that the question is general enough.

At one time the CPU usage of some of the nodes started climbing (from about
20% to near 100%). This happened only once and doesn't happen anymore
(after a full cluster power off and power on again).

Connecting to the nodes with a remote shell showed almost everything was
the same to nodes that were un-affected:

* cprof and eprof - showed the same usage patterns.
* Listing top most erlang:process_info on all processes for reductions and
message_queue_len showed similar patterns.
* erlang:statistics - looked the same for wall_clock and reductions.

The only concrete difference between the affected and un-affected nodes
were:

* run_queue - affected nodes had a run queue of several dozens (less than
100), while un-affected nodes had 0 (always). Since run_queues is
undocumented (at least I didn't see it in the documentation), I didn't run
it at the time of the problem.
* runtime - affected nodes progress the runtime at about 150% of the wall
time, while un-affected nodes progress the runtime at about 20% of the wall
time.

All of this made me suspect some NIFs and/or ports taking a long time to
complete, taking CPU time but not increasing reductions and calls.

Looking at the various profilers for Erlang, I couldn't find anything that
can profile NIFs or even reveal that they indeed do take a long time to
finish.

Is there a better way to diagnose a high CPU usage issue?

Cheers,
Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160623/d74250ae/attachment.htm>