[erlang-questions] Identifying causes for high CPU usage
Thu Jun 23 14:25:52 CEST 2016
Try lock counter during load-testing
It's very unsafe to run it in prod.
Check locking time of the port drivers.Sometimes setting
ERL_DRV_FLAG_USE_PORT_LOCKING flag can help.
All long operations in ports cause the same scheduling problems you heard
Unless these operations are async and driver is well-written.
Scheduling/locking problems cause random latency spikes in all time
metrics. So, check 95 percentile for this.
On 23 June 2016 at 12:35, Eli Iser <eli.iser@REDACTED> wrote:
> I'm running several ejabberd nodes in a cluster. Sadly, it uses a very old
> Erlang version - R13B04, on top of CentOS with 4 cores. I am using 5
> async-threads with kernel poll enabled.
> The old Erlang version might perhaps explain the problem I saw, but I
> believe that the question is general enough.
> At one time the CPU usage of some of the nodes started climbing (from
> about 20% to near 100%). This happened only once and doesn't happen anymore
> (after a full cluster power off and power on again).
> Connecting to the nodes with a remote shell showed almost everything was
> the same to nodes that were un-affected:
> * cprof and eprof - showed the same usage patterns.
> * Listing top most erlang:process_info on all processes for reductions and
> message_queue_len showed similar patterns.
> * erlang:statistics - looked the same for wall_clock and reductions.
> The only concrete difference between the affected and un-affected nodes
> * run_queue - affected nodes had a run queue of several dozens (less than
> 100), while un-affected nodes had 0 (always). Since run_queues is
> undocumented (at least I didn't see it in the documentation), I didn't run
> it at the time of the problem.
> * runtime - affected nodes progress the runtime at about 150% of the wall
> time, while un-affected nodes progress the runtime at about 20% of the wall
> All of this made me suspect some NIFs and/or ports taking a long time to
> complete, taking CPU time but not increasing reductions and calls.
> Looking at the various profilers for Erlang, I couldn't find anything that
> can profile NIFs or even reveal that they indeed do take a long time to
> Is there a better way to diagnose a high CPU usage issue?
> erlang-questions mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions