[erlang-questions] Diagnosing performance issue in Discord usage of Erlang on Google Cloud

Mark Smith mark@REDACTED
Sat Feb 17 23:17:35 CET 2018


Hello,

I'm an SRE at Discord. We make heavy use of Elixir/Erlang in production and
it works really well for us. Lately we've been seeing an interesting
performance issue with our VMs and have spent quite some hours on
understanding it, but haven't found a root cause yet.

The top line summary is: weighted schedular utilization rises from 30% to
80% in a 1-2 minute span while user CPU usage drops from 70% to 30%. This
state lasts approximately 5-30 minutes and then recovers just as quickly as
it began.

Reference:
https://cdn.discordapp.com/attachments/392075851132567552/414530154988175363/unknown.png

We've managed to isolate the behavior to being related to memory
bandwidth/latency. I.e., during the periods we're seeing the above impact,
the malloc/mmap latency (as measured by eBPF tracing) increases by 5-10x.
This gives us two leading theories for the cause of these issues:

a) Noisy neighbors. We're in a cloud environment, so it's possible that
another host on the shared instance is saturating memory bandwidth and
causing this impact. We've been working with Google Cloud to investigate
this branch.

b) One of our Erlang processes is allocating in an anti-pattern and causing
the memory bandwidth saturation. I.e., we're shooting ourselves in the foot.

We are trying to root cause this because of course if it's the second
situation, we'd like to understand and fix the problem. We've added a fair
bit of instrumentation and stats recording based on recon_alloc and so we
do have some pointers, but none of us are knowledgeable enough about the
BEAM VM to understand where to go next.

Some interesting things we have determined:

Reference:
https://cdn.discordapp.com/attachments/392075851132567552/414530948097376256/unknown.png

* temp_alloc allocations drops quite a bit, sl_alloc usage rises, and the
two become mirrors of each other.
* fix_alloc rises some during the bad periods.
* other allocators seem unaffected.

The sl_alloc feels like an interesting culprit. The number of MBCs in
sl_alloc is ~500 in the good periods, but during the bad period climbs to
15,000. However, the number of mseg_alloc calls (from sl_alloc) drops to
nearly 0, which doesn't seem to make sense to me.

sl_alloc MBCs:
https://cdn.discordapp.com/attachments/392075851132567552/414543516782428161/unknown.png

sl_alloc calls to mseg_alloc:
https://cdn.discordapp.com/attachments/392075851132567552/414543687201062923/unknown.png

I have no idea if this is cause or effect, and so my basic question is to
ask for advice on how to determine that and also what people would suggest
we do with the investigation.

I'm available on IRC as 'zorkian' and on Discord as zorkian#0001, if
anybody wants to follow up more immediately/ask for more data.

Thank you for reading!

-- 
Mark Smith
SRE @ Discord
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180217/41085db6/attachment.htm>


More information about the erlang-questions mailing list