[erlang-questions] Diagnosing performance issue in Discord usage of Erlang on Google Cloud
Andy Till
atill@REDACTED
Sun Feb 18 14:20:43 CET 2018
Hi Mark
I have seen the high scheduler/low CPU pattern before when a system had
lock contention in ets tables. I was able to verify this was the case
using lcnt.
Cheers
Andy
On 17/02/2018 22:17, Mark Smith wrote:
> Hello,
>
> I'm an SRE at Discord. We make heavy use of Elixir/Erlang in
> production and it works really well for us. Lately we've been seeing
> an interesting performance issue with our VMs and have spent quite
> some hours on understanding it, but haven't found a root cause yet.
>
> The top line summary is: weighted schedular utilization rises from 30%
> to 80% in a 1-2 minute span while user CPU usage drops from 70% to
> 30%. This state lasts approximately 5-30 minutes and then recovers
> just as quickly as it began.
>
> Reference:
> https://cdn.discordapp.com/attachments/392075851132567552/414530154988175363/unknown.png
>
> We've managed to isolate the behavior to being related to memory
> bandwidth/latency. I.e., during the periods we're seeing the above
> impact, the malloc/mmap latency (as measured by eBPF tracing)
> increases by 5-10x. This gives us two leading theories for the cause
> of these issues:
>
> a) Noisy neighbors. We're in a cloud environment, so it's possible
> that another host on the shared instance is saturating memory
> bandwidth and causing this impact. We've been working with Google
> Cloud to investigate this branch.
>
> b) One of our Erlang processes is allocating in an anti-pattern and
> causing the memory bandwidth saturation. I.e., we're shooting
> ourselves in the foot.
>
> We are trying to root cause this because of course if it's the second
> situation, we'd like to understand and fix the problem. We've added a
> fair bit of instrumentation and stats recording based on recon_alloc
> and so we do have some pointers, but none of us are knowledgeable
> enough about the BEAM VM to understand where to go next.
>
> Some interesting things we have determined:
>
> Reference:
> https://cdn.discordapp.com/attachments/392075851132567552/414530948097376256/unknown.png
>
> * temp_alloc allocations drops quite a bit, sl_alloc usage rises, and
> the two become mirrors of each other.
> * fix_alloc rises some during the bad periods.
> * other allocators seem unaffected.
>
> The sl_alloc feels like an interesting culprit. The number of MBCs in
> sl_alloc is ~500 in the good periods, but during the bad period climbs
> to 15,000. However, the number of mseg_alloc calls (from sl_alloc)
> drops to nearly 0, which doesn't seem to make sense to me.
>
> sl_alloc MBCs:
> https://cdn.discordapp.com/attachments/392075851132567552/414543516782428161/unknown.png
>
> sl_alloc calls to mseg_alloc:
> https://cdn.discordapp.com/attachments/392075851132567552/414543687201062923/unknown.png
>
> I have no idea if this is cause or effect, and so my basic question is
> to ask for advice on how to determine that and also what people would
> suggest we do with the investigation.
>
> I'm available on IRC as 'zorkian' and on Discord as zorkian#0001, if
> anybody wants to follow up more immediately/ask for more data.
>
> Thank you for reading!
>
> --
> Mark Smith
> SRE @ Discord
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180218/46104e0a/attachment.htm>
More information about the erlang-questions
mailing list