[erlang-questions] Diagnosing performance issue in Discord usage of Erlang on Google Cloud

Lukas Larsson lukas@REDACTED
Mon Feb 19 17:58:17 CET 2018


Hello,

It would be useful to know which Erlang/OTP version that you are using and
also which (if any) emulator options that you are using.

I would start by trying to find out what allocation it is that is taking
place in sl_alloc.

Can you run perf in your production? By sampling the call stack at
mmap/malloc calls (or just at some time interval) it should be possible to
pinpoint what allocation is causing problems.

Do you notice any difference in the erlang:memory/0 stats during this
period? sl_alloc can allocate in most of the erlang:memory categories, so
if one category rises more than others it could help pinpoint what the
allocations are. The mappings can be found here:
https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_alloc.types.
My guess would be that the allocations end up in the system category, which
does not tell us all that much, but at least it can rule out processes.


On Sat, Feb 17, 2018 at 11:17 PM, Mark Smith <mark@REDACTED> wrote:

> Some interesting things we have determined:
>
> Reference: https://cdn.discordapp.com/attachments/392075851132567552/
> 414530948097376256/unknown.png
>

> * temp_alloc allocations drops quite a bit, sl_alloc usage rises, and the
> two become mirrors of each other.
>

Does free also have the same pattern?


> * fix_alloc rises some during the bad periods.
> * other allocators seem unaffected.
>
> The sl_alloc feels like an interesting culprit. The number of MBCs in
> sl_alloc is ~500 in the good periods, but during the bad period climbs to
> 15,000. However, the number of mseg_alloc calls (from sl_alloc) drops to
> nearly 0, which doesn't seem to make sense to me.
>
> sl_alloc MBCs: https://cdn.discordapp.com/attachments/392075851132567552/
> 414543516782428161/unknown.png
>
>
First, a bit of naming, what you call MBCs, I would call MBC blocks. The
number of MBCs would be what you get from mbcs->carriers.

The allocations could be done from the mbcs pool, which will not show up in
the calls statistics for mseg or sys. So it would be interesting to look at
the mbcs->carriers and mbcs_pool->carriers to see if the allocation come
from there.


> sl_alloc calls to mseg_alloc: https://cdn.discordapp.com/attachments/
> 392075851132567552/414543687201062923/unknown.png
>
> I have no idea if this is cause or effect, and so my basic question is to
> ask for advice on how to determine that and also what people would suggest
> we do with the investigation.
>

My wild guess; for some reason some syscalls/libc calls (maybe mmap) starts
being slower. This causes the system to start building internal queues. The
VM's internal queues are usually allocated using sl_alloc, so you see an
increase in the number of sl_alloc blocks. When you are doing work,
temp_alloc is used for a lot of the work that, and thus it decreases when
the system is blocking in a syscall. The reason that sl_alloc and
temp_alloc end up in lock step is because each of the queue jobs result in
one temp allocation.

One place that comes to mind that this could happen is if a port gets stuck
in writev when sending data on a socket. Each port operation speculatively
does a temp alloc for the data to send, and then if the port is busy it
does an sl alloc to schedule the job for later. If the port is not busy,
only a temp alloc is done and no sl alloc. Tcp ports use non-blocking I/O,
but i'm not sure what the guarantees of that is in different cloud
environments.

Lukas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180219/a0ce042c/attachment.htm>


More information about the erlang-questions mailing list