[erlang-questions] Diagnosing performance issue in Discord usage of Erlang on Google Cloud

Sun Feb 18 14:20:43 CET 2018

Hi Mark

I have seen the high scheduler/low CPU pattern before when a system had 
lock contention in ets tables. I was able to verify this was the case 
using lcnt.

Cheers
Andy

On 17/02/2018 22:17, Mark Smith wrote:
> Hello,
>
> I'm an SRE at Discord. We make heavy use of Elixir/Erlang in 
> production and it works really well for us. Lately we've been seeing 
> an interesting performance issue with our VMs and have spent quite 
> some hours on understanding it, but haven't found a root cause yet.
>
> The top line summary is: weighted schedular utilization rises from 30% 
> to 80% in a 1-2 minute span while user CPU usage drops from 70% to 
> 30%. This state lasts approximately 5-30 minutes and then recovers 
> just as quickly as it began.
>
> Reference: 
> https://cdn.discordapp.com/attachments/392075851132567552/414530154988175363/unknown.png
>
> We've managed to isolate the behavior to being related to memory 
> bandwidth/latency. I.e., during the periods we're seeing the above 
> impact, the malloc/mmap latency (as measured by eBPF tracing) 
> increases by 5-10x. This gives us two leading theories for the cause 
> of these issues:
>
> a) Noisy neighbors. We're in a cloud environment, so it's possible 
> that another host on the shared instance is saturating memory 
> bandwidth and causing this impact. We've been working with Google 
> Cloud to investigate this branch.
>
> b) One of our Erlang processes is allocating in an anti-pattern and 
> causing the memory bandwidth saturation. I.e., we're shooting 
> ourselves in the foot.
>
> We are trying to root cause this because of course if it's the second 
> situation, we'd like to understand and fix the problem. We've added a 
> fair bit of instrumentation and stats recording based on recon_alloc 
> and so we do have some pointers, but none of us are knowledgeable 
> enough about the BEAM VM to understand where to go next.
>
> Some interesting things we have determined:
>
> Reference: 
> https://cdn.discordapp.com/attachments/392075851132567552/414530948097376256/unknown.png
>
> * temp_alloc allocations drops quite a bit, sl_alloc usage rises, and 
> the two become mirrors of each other.
> * fix_alloc rises some during the bad periods.
> * other allocators seem unaffected.
>
> The sl_alloc feels like an interesting culprit. The number of MBCs in 
> sl_alloc is ~500 in the good periods, but during the bad period climbs 
> to 15,000. However, the number of mseg_alloc calls (from sl_alloc) 
> drops to nearly 0, which doesn't seem to make sense to me.
>
> sl_alloc MBCs: 
> https://cdn.discordapp.com/attachments/392075851132567552/414543516782428161/unknown.png
>
> sl_alloc calls to mseg_alloc: 
> https://cdn.discordapp.com/attachments/392075851132567552/414543687201062923/unknown.png
>
> I have no idea if this is cause or effect, and so my basic question is 
> to ask for advice on how to determine that and also what people would 
> suggest we do with the investigation.
>
> I'm available on IRC as 'zorkian' and on Discord as zorkian#0001, if 
> anybody wants to follow up more immediately/ask for more data.
>
> Thank you for reading!
>
> -- 
> Mark Smith
> SRE @ Discord
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180218/46104e0a/attachment.htm>