[erlang-questions] How to pin-point High CPU utilization in Erlang VM

Wed Jul 18 11:28:10 CEST 2018

Hi!

I know that feel when a load is very high and you don't know why.

More things to see:
* extended msacc: configure OTP with --with-microstate-accounting=extra
* LCNT: lcnt:apply(timer, sleep, [5000]), lcnt:conflicts(), lcnt:inspect,
etc.
* Check if there are some processes spending too much time running. You may
find this tool useful:
https://gist.github.com/stolen/9a28ed9403c724541b0ee5fcd822613e
* network buffers. Check if rmem/wmem in sysctl are the same. Also check
network interfaces — MTU, drops, etc.
* NUMA and scheduler bindings. Try running the whole application on single
NUMA node to avoid interconnect cost
* Other processes on the host. Once we saw a malware miner that hid itself
from ps and was active only when the server was busy doing its main job.
Use perf top on the every CPU core to detect this.

On Wed, Jul 18, 2018 at 6:34 AM Marcial Rosales <mrosales@REDACTED> wrote:

> We are experiencing a very high cpu utilization in 3 clustered Erlang VMs
> running RabbitMQ. We have deployed another cluster in an attempt to
> reproduce the same behaviour without much success.
>
> Our goals are:
>
>    - Find out where the CPU is being utilized
>    - Choose the right tools to analyze CPU utilization
>
>
> Our observations so far:
>
>    - The *BAD* cluster observes a pretty excessive CPU utilization, both
>    user and system ones, and also network.
>    - The *BAD* cluster also observes a higher Erlang scheduler
>    utilization, specially on microstate emulator and other. We are yet to
>    understand what other could be. According to Erlang documentation is *unaccounted
>    things*.
>    - The *BAD* cluster observes a considerably higher number of system
>    calls which we are yet to identify (dunno how) why is that.
>    - The *BAD* cluster does not necessarily run higher number of
>    reductions. In fact, the *GOOD* cluster runs more reductions and yet
>    has a lower scheduler utilization.
>
> METRICBADGOOD
> user cpu
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 46%
> - 57% 19% - 40%
> system cpu
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 20%
> - 37% 1% - 10%
> network traffic
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 6M
> - 19M up to 8M
> system interrupts
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 120k
> - 196k 10k - 20k
> syscalls
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#syscalls> 1.6M
> - 2.1M 49k - 110k
> task-clock 10sec
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf-stat>
> 68255 12324
> cpu profiling info
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf_record_cpu_cycles>
>
> We have gathered lots of metrics in attempt to identify why the BAD
> cluster uses so much CPU. All the information can be found here
> https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841
> along with the environment information.
>
>
> We appreciate a lot any insights as to what could be causing the issue
> and/or in relation to additional tools we could use.
> Many thanks
>
> --
> Marcial Rosales
> Pivotal, Inc.  EMEA
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>

-- 
Danil Zagoskin | z@REDACTED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180718/9b6a5e2a/attachment.htm>