[erlang-questions] How to pin-point High CPU utilization in Erlang VM

Thu Jul 19 11:50:57 CEST 2018

Hello,

On Wed, Jul 18, 2018 at 5:34 AM Marcial Rosales <mrosales@REDACTED> wrote:

> We are experiencing a very high cpu utilization in 3 clustered Erlang VMs
> running RabbitMQ. We have deployed another cluster in an attempt to
> reproduce the same behaviour without much success.
>
> Our goals are:
>
>    - Find out where the CPU is being utilized
>    - Choose the right tools to analyze CPU utilization
>
>
> Our observations so far:
>
>    - The *BAD* cluster observes a pretty excessive CPU utilization, both
>    user and system ones, and also network.
>    - The *BAD* cluster also observes a higher Erlang scheduler
>    utilization, specially on microstate emulator and other. We are yet to
>    understand what other could be. According to Erlang documentation is *unaccounted
>    things*.
>
> If you compile Erlang with "./configure
--with-microstate-accounting=extra" (as suggested by Danil), the other part
will be broken into more granular parts. However, looking at your perf
recordings I would guess that most of the other time is time spent spinning
before going to sleep.

>
>    - The *BAD* cluster observes a considerably higher number of system
>    calls which we are yet to identify (dunno how) why is that.
>
> Maybe use strace and then write some small script that counts the
different syscall made?

>
>    - The *BAD* cluster does not necessarily run higher number of
>    reductions. In fact, the *GOOD* cluster runs more reductions and yet
>    has a lower scheduler utilization.
>
> METRICBADGOOD
> user cpu
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 46%
> - 57% 19% - 40%
> system cpu
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 20%
> - 37% 1% - 10%
> network traffic
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 6M
> - 19M up to 8M
> system interrupts
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 120k
> - 196k 10k - 20k
> syscalls
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#syscalls> 1.6M
> - 2.1M 49k - 110k
> task-clock 10sec
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf-stat>
> 68255 12324
> cpu profiling info
> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf_record_cpu_cycles>
>
> We have gathered lots of metrics in attempt to identify why the BAD
> cluster uses so much CPU. All the information can be found here
> https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841
> along with the environment information.
>
>
> In the BAD cluster, would it be possible to see the annotated versions of
process_main and scheduler_wait in the "perf record" run?
There doesn't seem to be any kernel symbols in the BAD cluster perf run, do
you know why?

> We appreciate a lot any insights as to what could be causing the issue
> and/or in relation to additional tools we could use.
>
My gut tells me that there is some syscall that is a lot slower on XEN than
it is in the KVM. In virtualized environments I always tend to suspect the
time source first. Different hypervisors have very different performance
for getting the time and the Erlang VM does a lot of time fetching.

Lukas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180719/c7a280aa/attachment.htm>