[erlang-questions] How to pin-point High CPU utilization in Erlang VM
Thu Jul 19 12:02:54 CEST 2018
Hi Danil ! Thank you very much for the suggestions. For the time being, we
can't change the Bad cluster with a new erlang with the extended msacc and
lcnt. But in the good cluster we have lcnt enabled and we are going to
enable msacc. Although we wont have anything to compare against.
We'll report back any findings with regards the other suggestions very soon.
On Wed, Jul 18, 2018 at 11:28 AM Danil Zagoskin <> wrote:
> I know that feel when a load is very high and you don't know why.
> More things to see:
> * extended msacc: configure OTP with --with-microstate-accounting=extra
> * LCNT: lcnt:apply(timer, sleep, ), lcnt:conflicts(), lcnt:inspect,
> * Check if there are some processes spending too much time running. You
> may find this tool useful:
> * network buffers. Check if rmem/wmem in sysctl are the same. Also check
> network interfaces — MTU, drops, etc.
> * NUMA and scheduler bindings. Try running the whole application on single
> NUMA node to avoid interconnect cost
> * Other processes on the host. Once we saw a malware miner that hid itself
> from ps and was active only when the server was busy doing its main job.
> Use perf top on the every CPU core to detect this.
> On Wed, Jul 18, 2018 at 6:34 AM Marcial Rosales <>
>> We are experiencing a very high cpu utilization in 3 clustered Erlang VMs
>> running RabbitMQ. We have deployed another cluster in an attempt to
>> reproduce the same behaviour without much success.
>> Our goals are:
>> - Find out where the CPU is being utilized
>> - Choose the right tools to analyze CPU utilization
>> Our observations so far:
>> - The *BAD* cluster observes a pretty excessive CPU utilization, both
>> user and system ones, and also network.
>> - The *BAD* cluster also observes a higher Erlang scheduler
>> utilization, specially on microstate emulator and other. We are yet
>> to understand what other could be. According to Erlang documentation
>> is *unaccounted things*.
>> - The *BAD* cluster observes a considerably higher number of system
>> calls which we are yet to identify (dunno how) why is that.
>> - The *BAD* cluster does not necessarily run higher number of
>> reductions. In fact, the *GOOD* cluster runs more reductions and yet
>> has a lower scheduler utilization.
>> user cpu
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 46%
>> - 57% 19% - 40%
>> system cpu
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 20%
>> - 37% 1% - 10%
>> network traffic
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 6M
>> - 19M up to 8M
>> system interrupts
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#dstat> 120k
>> - 196k 10k - 20k
>> <https://gist.github.com/MarcialRosales/226716f0cb9e27cd9ab02eac04702841#syscalls> 1.6M
>> - 2.1M 49k - 110k
>> task-clock 10sec
>> 68255 12324
>> cpu profiling info
>> We have gathered lots of metrics in attempt to identify why the BAD
>> cluster uses so much CPU. All the information can be found here
>> along with the environment information.
>> We appreciate a lot any insights as to what could be causing the issue
>> and/or in relation to additional tools we could use.
>> Many thanks
>> Marcial Rosales
>> Pivotal, Inc. EMEA
>> erlang-questions mailing list
> Danil Zagoskin |
Advisory Solution Architect (Customer Success Organization)
Pivotal, Inc. EMEA
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions