[erlang-questions] Identifying causes for high CPU usage

Michael Uvarov freeakk@REDACTED
Thu Jun 23 14:25:52 CEST 2016


Hi,

Try lock counter during load-testing
http://erlang.org/doc/apps/tools/lcnt_chapter.html
It's very unsafe to run it in prod.

Check locking time of the port drivers.Sometimes setting
ERL_DRV_FLAG_USE_PORT_LOCKING flag can help.

All long operations in ports cause the same scheduling problems you heard
about NIFs.
Unless these operations are async and driver is well-written.
Scheduling/locking problems cause random latency spikes in all time
metrics. So, check 95 percentile for this.

Regards,
Michael


On 23 June 2016 at 12:35, Eli Iser <eli.iser@REDACTED> wrote:

> I'm running several ejabberd nodes in a cluster. Sadly, it uses a very old
> Erlang version - R13B04, on top of CentOS with 4 cores. I am using 5
> async-threads with kernel poll enabled.
>
> The old Erlang version might perhaps explain the problem I saw, but I
> believe that the question is general enough.
>
> At one time the CPU usage of some of the nodes started climbing (from
> about 20% to near 100%). This happened only once and doesn't happen anymore
> (after a full cluster power off and power on again).
>
> Connecting to the nodes with a remote shell showed almost everything was
> the same to nodes that were un-affected:
>
> * cprof and eprof - showed the same usage patterns.
> * Listing top most erlang:process_info on all processes for reductions and
> message_queue_len showed similar patterns.
> * erlang:statistics - looked the same for wall_clock and reductions.
>
> The only concrete difference between the affected and un-affected nodes
> were:
>
> * run_queue - affected nodes had a run queue of several dozens (less than
> 100), while un-affected nodes had 0 (always). Since run_queues is
> undocumented (at least I didn't see it in the documentation), I didn't run
> it at the time of the problem.
> * runtime - affected nodes progress the runtime at about 150% of the wall
> time, while un-affected nodes progress the runtime at about 20% of the wall
> time.
>
> All of this made me suspect some NIFs and/or ports taking a long time to
> complete, taking CPU time but not increasing reductions and calls.
>
> Looking at the various profilers for Erlang, I couldn't find anything that
> can profile NIFs or even reveal that they indeed do take a long time to
> finish.
>
> Is there a better way to diagnose a high CPU usage issue?
>
> Cheers,
> Eli
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>


-- 
С уважением,
Уваров Михаил.
Best regards,
Uvarov Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160623/611e6d07/attachment.htm>


More information about the erlang-questions mailing list