[erlang-questions] How to pin-point High CPU utilization in Erlang VM

Wed Jul 25 10:18:35 CEST 2018

 Hi Lukas, sorry for responding this late to you. I dont have yet all the
information but I ought to send you a reply.
 See my replies in-line:

On Thu, Jul 19, 2018 at 11:51 AM Lukas Larsson <lukas@REDACTED> wrote:

> Hello,
>
> On Wed, Jul 18, 2018 at 5:34 AM Marcial Rosales <mrosales@REDACTED>
> wrote:
>
>> We are experiencing a very high cpu utilization in 3 clustered Erlang VMs
>> running RabbitMQ. We have deployed another cluster in an attempt to
>> reproduce the same behaviour without much success.
>>
>> Our goals are:
>>
>>    - Find out where the CPU is being utilized
>>    - Choose the right tools to analyze CPU utilization
>>
>>
>> Our observations so far:
>>
>>    - The *BAD* cluster observes a pretty excessive CPU utilization, both
>>    user and system ones, and also network.
>>    - The *BAD* cluster also observes a higher Erlang scheduler
>>    utilization, specially on microstate emulator and other. We are yet
>>    to understand what other could be. According to Erlang documentation
>>    is *unaccounted things*.
>>
>> If you compile Erlang with "./configure
> --with-microstate-accounting=extra" (as suggested by Danil), the other
> part will be broken into more granular parts. However, looking at your perf
> recordings I would guess that most of the other time is time spent spinning
> before going to sleep.
>
>
MR > Thanks! We are going to recompile Erlang with lock counting and extra
microstate-accounting. Hopefully we can get more visibility the next time
we get a cluster in that bad state.

>>    - The *BAD* cluster observes a considerably higher number of system
>>    calls which we are yet to identify (dunno how) why is that.
>>
>> Maybe use strace and then write some small script that counts the
> different syscall made?
>

MR >  The strace reports that 97% of the calls are `futex`, should that
helped.

>
>>
>>    - The *BAD* cluster does not necessarily run higher number of
>>    reductions. In fact, the *GOOD* cluster runs more reductions and yet
>>    has a lower scheduler utilization.
>>
>> METRICBADGOOD
>> user cpu
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 46%
>> - 57% 19% - 40%
>> system cpu
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 20%
>> - 37% 1% - 10%
>> network traffic
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 6M
>> - 19M up to 8M
>> system interrupts
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 120k
>> - 196k 10k - 20k
>> syscalls
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#syscalls> 1.6M
>> - 2.1M 49k - 110k
>> task-clock 10sec
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf-stat>
>> 68255 12324
>> cpu profiling info
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf_record_cpu_cycles>
>>
>> We have gathered lots of metrics in attempt to identify why the BAD
>> cluster uses so much CPU. All the information can be found here
>> https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841
>> along with the environment information.
>>
>>
>> In the BAD cluster, would it be possible to see the annotated versions of
> process_main and scheduler_wait in the "perf record" run?
>
There doesn't seem to be any kernel symbols in the BAD cluster perf run, do
> you know why?
>

MR> I dont really know. We are running with the same OS image. Any ideas
what could that be?

>
>> We appreciate a lot any insights as to what could be causing the issue
>> and/or in relation to additional tools we could use.
>>
> My gut tells me that there is some syscall that is a lot slower on XEN
> than it is in the KVM. In virtualized environments I always tend to suspect
> the time source first. Different hypervisors have very different
> performance for getting the time and the Erlang VM does a lot of time
> fetching.
>
> MR> We are in the process of provisioning a cluster in AWS in an attempt
to discard or confirm the hypervisor.

> Lukas
>

-- 
Marcial Rosales
Advisory Solution Architect (Customer Success Organization)
Pivotal, Inc.  EMEA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180725/10babc9b/attachment.htm>