[erlang-questions] How to pin-point High CPU utilization in Erlang VM

Wed Jul 25 13:18:06 CEST 2018

We've had issues recently with Xen VMs on AWS being much slower than usual.
It wasn't even running Erlang.

On Wed, Jul 25, 2018, 4:18 AM Marcial Rosales <mrosales@REDACTED> wrote:

>  Hi Lukas, sorry for responding this late to you. I dont have yet all the
> information but I ought to send you a reply.
>  See my replies in-line:
>
> On Thu, Jul 19, 2018 at 11:51 AM Lukas Larsson <lukas@REDACTED> wrote:
>
>> Hello,
>>
>> On Wed, Jul 18, 2018 at 5:34 AM Marcial Rosales <mrosales@REDACTED>
>> wrote:
>>
>>> We are experiencing a very high cpu utilization in 3 clustered Erlang
>>> VMs running RabbitMQ. We have deployed another cluster in an attempt to
>>> reproduce the same behaviour without much success.
>>>
>>> Our goals are:
>>>
>>>    - Find out where the CPU is being utilized
>>>    - Choose the right tools to analyze CPU utilization
>>>
>>>
>>> Our observations so far:
>>>
>>>    - The *BAD* cluster observes a pretty excessive CPU utilization,
>>>    both user and system ones, and also network.
>>>    - The *BAD* cluster also observes a higher Erlang scheduler
>>>    utilization, specially on microstate emulator and other. We are yet
>>>    to understand what other could be. According to Erlang documentation
>>>    is *unaccounted things*.
>>>
>>> If you compile Erlang with "./configure
>> --with-microstate-accounting=extra" (as suggested by Danil), the other
>> part will be broken into more granular parts. However, looking at your perf
>> recordings I would guess that most of the other time is time spent spinning
>> before going to sleep.
>>
>>
> MR > Thanks! We are going to recompile Erlang with lock counting and extra
> microstate-accounting. Hopefully we can get more visibility the next time
> we get a cluster in that bad state.
>
>
>>>    - The *BAD* cluster observes a considerably higher number of system
>>>    calls which we are yet to identify (dunno how) why is that.
>>>
>>> Maybe use strace and then write some small script that counts the
>> different syscall made?
>>
>
> MR >  The strace reports that 97% of the calls are `futex`, should that
> helped.
>
>
>>
>>>
>>>    - The *BAD* cluster does not necessarily run higher number of
>>>    reductions. In fact, the *GOOD* cluster runs more reductions and yet
>>>    has a lower scheduler utilization.
>>>
>>> METRICBADGOOD
>>> user cpu
>>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 46%
>>> - 57% 19% - 40%
>>> system cpu
>>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 20%
>>> - 37% 1% - 10%
>>> network traffic
>>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 6M
>>> - 19M up to 8M
>>> system interrupts
>>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 120k
>>> - 196k 10k - 20k
>>> syscalls
>>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#syscalls> 1.6M
>>> - 2.1M 49k - 110k
>>> task-clock 10sec
>>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf-stat>
>>> 68255 12324
>>> cpu profiling info
>>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#perf_record_cpu_cycles>
>>>
>>> We have gathered lots of metrics in attempt to identify why the BAD
>>> cluster uses so much CPU. All the information can be found here
>>> https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841
>>> along with the environment information.
>>>
>>>
>>> In the BAD cluster, would it be possible to see the annotated versions
>> of process_main and scheduler_wait in the "perf record" run?
>>
> There doesn't seem to be any kernel symbols in the BAD cluster perf run,
>> do you know why?
>>
>
> MR> I dont really know. We are running with the same OS image. Any ideas
> what could that be?
>
>
>>
>>> We appreciate a lot any insights as to what could be causing the issue
>>> and/or in relation to additional tools we could use.
>>>
>> My gut tells me that there is some syscall that is a lot slower on XEN
>> than it is in the KVM. In virtualized environments I always tend to suspect
>> the time source first. Different hypervisors have very different
>> performance for getting the time and the Erlang VM does a lot of time
>> fetching.
>>
>> MR> We are in the process of provisioning a cluster in AWS in an attempt
> to discard or confirm the hypervisor.
>
>
>> Lukas
>>
>
>
> --
> Marcial Rosales
> Advisory Solution Architect (Customer Success Organization)
> Pivotal, Inc.  EMEA
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180725/dea60d3a/attachment.htm>