Carrier migration, madvice and poor performance

Tue Aug 25 03:27:03 CEST 2020

Hi,
we recently migrated our systems from OTP21 to OTP22.   On one of them, performance noticeable degraded when switched to OTP22 (20% drop).
After some investigation, we found the apparent cause was the madvice() call when returning the carriers to the carrier pool, that lead to lots of
minor page faults, ultimately killing our performance.  We run in a virtualised cloud, not sure how much that affects the minor page fault overhead.

So we spend some time tweaking the allocator' settings (that haven't been touched in many years, with the system itself evolved a lot since that time) and got
to a good improvement. But for carrier migration ultimately the thing that worked best for us was just disable it entirely.  Is this a terrible idea?
our load is a fairly stable flow of homogenous requests, that lead to several short-lived (milliseconds) processes being spawned. Have plenty of memory, so  I'm not too worried about a badly utilised carrier being stuck within a scheduler.

Got a few questions regarding this:

* Wonder, it's something common out there to disable carrier migration?  Feels a bit strange that nobody hit the same problem when updating to OTP22, I’m
assuming there are lots of not-so-great allocator settings out there, like was our case.  (disabling it was our last try actually, we tried the settings suggested by
erts_alloc_config, and then make the +M<S>acul  and +M<S>acfml settings significantly stricter as well, and while that helps, still had too many page faults).

* What's the tradeoff on having large multiblock carriers,  other than the memory overhead when they aren't fully used?.

* Do it make sense to make a config flag to allow carrier migration but disable the madvice() on free blocks?

* Any suggestion on what to look at regarding our allocator settings and usage?  I know this is as vague as it gets :( 

regards,
Pablo