Carrier migration, madvice and poor performance

Pablo Polvorin pablo.polvorin@REDACTED
Thu Aug 27 05:53:35 CEST 2020



> On 26 Aug 2020, at 06:50, Lukas Larsson <lukas@REDACTED> wrote:
> 
> 
> 
> On Tue, 25 Aug 2020, 17:28 Pablo Polvorin, <pablo.polvorin@REDACTED <mailto:pablo.polvorin@REDACTED>> wrote:
> 
> 
>> On 25 Aug 2020, at 09:05, Lukas Larsson <lukas@REDACTED <mailto:lukas@REDACTED>> wrote:
>> 
>> Hello,
>> 
>> On Tue, Aug 25, 2020 at 9:26 AM Pablo Polvorin <pablo.polvorin@REDACTED <mailto:pablo.polvorin@REDACTED>> wrote:
>> Hi,
>> we recently migrated our systems from OTP21 to OTP22.   On one of them, performance noticeable degraded when switched to OTP22 (20% drop).
>> After some investigation, we found the apparent cause was the madvice() call when returning the carriers to the carrier pool, that lead to lots of
>> minor page faults, ultimately killing our performance.  We run in a virtualised cloud, not sure how much that affects the minor page fault overhead.
>> 
>> Do you know if you have access to MADV_FREE or use MADV_DONTNEED?
>> 
> Looks like we don’t :/ 
> 
> #include <sys/mman.h>
> #include <stdio.h>
> int main() {
>    #ifdef MADV_FREE
> 	   printf("Have it\n");
>    #else
> 	   printf("Dont have it\n");
>    #endif
>    return 0;
> }
> > Dont have it
> 
> This is on amazon linux,   4.14.186-110.268.amzn1.x86_64 .
> 
> That may explain it. Maybe we should not use madvise when FREE is not available.
> 
> Have you tried do delete this line: https://github.com/erlang/otp/blob/7ad81c674d1aa705ae41743b343043d05ea1944b/erts/emulator/sys/common/erl_mmap.h#L215 <https://github.com/erlang/otp/blob/7ad81c674d1aa705ae41743b343043d05ea1944b/erts/emulator/sys/common/erl_mmap.h#L215> and see what happens then?
> 
Just tested today,  removing the call to madvice and enabled carrier migration back,   the problem with perf is not there anymore. 
On our case, we actually don’t see difference on throughput on this vs  disabling carrier migration. Everything else the same,
this looks like a safer bet.

> 
> 
>> 
>> So we spend some time tweaking the allocator' settings (that haven't been touched in many years, with the system itself evolved a lot since that time) and got
>> to a good improvement. But for carrier migration ultimately the thing that worked best for us was just disable it entirely.  Is this a terrible idea?
>> our load is a fairly stable flow of homogenous requests, that lead to several short-lived (milliseconds) processes being spawned. Have plenty of memory, so  I'm not too worried about a badly utilised carrier being stuck within a scheduler.
>> 
>> Got a few questions regarding this:
>> 
>> * Wonder, it's something common out there to disable carrier migration?  Feels a bit strange that nobody hit the same problem when updating to OTP22, I’m
>> assuming there are lots of not-so-great allocator settings out there, like was our case.  (disabling it was our last try actually, we tried the settings suggested by
>> erts_alloc_config, and then make the +M<S>acul  and +M<S>acfml settings significantly stricter as well, and while that helps, still had too many page faults).
>> 
>> 
>> Carrier migration can help a lot to deal with memory fragmentation issues. It is however not free as you have noticed. I know that other people have disabled it with some success, but as far as I remember that was to work around bugs in the migration logic, not because of the performance overhead.
> Given the frequency at which we where abandoning and taking carriers from the pool,  smells something fishy on our settings and allocation pattern.  But so far couldn’t really figure out how to bring that down to a low enough level that the madvice() won’t affect us much.
> 
> Yes, that does seem odd. Carriers are not meant to be pushed in and put of the pool at a rapid pace.
> 
> I don't suppose you have a relatively small example that will reproduce the behaviour? Or if not, then maybe a couple of recon_alloc snapshots?
Don’t have a good example to reproduce this,  but here are two alloc snapshots during our test:
https://gist.github.com/polvorin/20ff77e7bfd4d3f81cb6d1ac20c21f7a
https://gist.github.com/polvorin/1a3579259cc9dfb57cde169178cbed5b

> 
> 
>>  
>> * What's the tradeoff on having large multiblock carriers,  other than the memory overhead when they aren't fully used?.
>> 
>> * Do it make sense to make a config flag to allow carrier migration but disable the madvice() on free blocks?
>> 
>> Given your experiences with it, I think that would make sense. We did not notice any degradation when testing madvise ourselves, but it is not possible to test all scenarios in all environments.
>> 
> Might work on this,  although I guess it would require to re-learn autoconf sorcery so probably will not happen soon 
> 
> No need to do any autoconf sorcery, I was thinking that this could probably be a start flag passed to erl?
that sounds easier, will have a look :)

thanks!


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200827/9d4d8132/attachment.htm>


More information about the erlang-questions mailing list