OTP22 performance on lists

Dmytro Lytovchenko dmytro.lytovchenko@REDACTED
Thu Jul 2 14:53:13 CEST 2020


Yielding means that the process running the ++ BIF will save state and
self-suspend trying to find another process to share the CPU with, there's
extra work involved (Erlang process switching) when working on a big list,
even if it is a clean empty VM running, there will be time lost on
scheduling out and in.

The 1st list has to be fully copied for ++ operation to continue with the
2nd list.
Length of 2nd list is irrelevant as it's just appended on build.

If this particular BIF is important for your project, please modify that
threshold value and rebuild the VM.
The entry point is `erl_bif_lists.c:230` in master branch, see `static
Eterm append/2` then proceeds to `int append_start/4` and from there to
`int append_continue/2` and to `BIF_RETTYPE lists_append_alloc/2`

The threshold at which the switch is happening can be found from the C
source code for ++ BIF.  The threshold value is *60 list cells* per
reduction (erl_bif_lists.c:150 CELLS_PER_RED) and the value is multiplied
by remaining reductions ERTS_BIF_REDS_LEFT. Max reductions value is 4000
but your test process might have less remaining which causes process
scheduling out early.

On Thu, 2 Jul 2020 at 10:34, Fernando Benavides <elbrujohalcon@REDACTED>
wrote:

> Hi Dániel,
>
> Thanks for checking!
> I can definitely tell it's not that OTP21 bug. In all my test environments
> (being those in my machine or in the servers), this returns *1*
>  everytime: erlang:convert_time_unit(1000, native, microsecond).
>
> You got your results with smaller samples… So, I tried with smaller
> samples, too… And you're right! OTP22 seems faster with those.
> For instance,
>
> test:bench({test, rec}, 2500, 500, 25). % reports avg times over 100µs in
> OTP21, but under 70µs in OTP22…
>
> …but if I increase the size of the individual lists to concat…
>
> test:bench({test, rec}, 2500, 500, 200). % reports ~550µs in OTP21 /
> ~1300µs in OTP22
>
> …and if I increase the # of lists…
>
> test:bench({test, rec}, 2500, 5000, 25). % reports ~600µs in OTP21 and it *jumps
> between* 700µs and 1400µs in OTP22 *(O_O)!!*
>
> This is OTP22… and it happened multiple times…
> 72> [test:bench({test, rec}, 2500, 5000, 25) || _ <- lists:seq(1,10)].
> [1296.7716,*700.2968*,1317.4632,*698.1416*,1316.9648,*696.0352*,
>  1294.4548,*706.8484*,1304.8368,*697.4452*]
> How is that *shaking *even possible?
> Aaanyway… if I only run the code 250 times (instead of 2500)… OTP22 is
> consistently in *~1800µs*.
>
> So, extrapolating wildly from very incomplete data… the problem seems to
> present itself when applying ++  to either larger lists or to a large # of
> lists (which, in time, results in ++'ing larger lists too… although AFAICT
> the time it takes to run ++ is more affected by the length of the first
> list and not the second one).
>
> I'm still puzzled…
>
>
> On Thu, Jul 2, 2020 at 1:43 AM Dániel Szoboszlay <dszoboszlay@REDACTED>
> wrote:
>
>> Hi Brujo,
>>
>> I tried to reproduce your results, and for me OTP 22 seems to be
>> consistently faster for rec and about the same for lrec:
>>
>> Test case     Min         Max        Median     Average
>> rec  / OTP 21 124.170us   305.990us  127.120us  128.688us
>> rec  / OTP 22  62.570us   289.190us   67.610us   68.976us
>> lrec / OTP 21  16.110us    22.570us   17.790us   17.881us
>> lrec / OTP 22  15.390us    22.710us   17.170us   17.441us
>>
>> (These are times for single invocation of the functions for much smaller
>> inputs than in your example.)
>>
>> However, there are some OTP 21 versions where due to a configuration bug
>> the resolution of the native clock was much lower. For example erlang:convert_time_unit(1000,
>> native, microsecond). normally gives 1 on my machine (so the native
>> clock has nanosecond resolution), and on the buggy OTP 21 versions I get
>> 610. On these OTP 21 versions I see that your rec benchmark seems to run
>> ~6% faster than with OTP 22, but this is measurement error due to the lower
>> clock resolution.
>>
>> Could you please check you have the same clock resolution on both OTP
>> versions?
>>
>> Cheers,
>> Daniel
>>
>> On Wed, 1 Jul 2020 at 12:56, Fernando Benavides <elbrujohalcon@REDACTED>
>> wrote:
>>
>>> (Sorry if you receive this email twice, I sent it from another account
>>> and it seems like it didn't reach the mailing list then)
>>>
>>> Hi erlangers,
>>>
>>> Yeah! It's me talking about performance, you read that subject right
>>> ��‍♂️
>>> Anyway… At NextRoll we're in the process of migrating our systems from
>>> OTP21 to OTP22 (not 23, yet) and our tests showed a *huge* impact on
>>> performance in general, that was not associated to anything in particular.
>>>
>>> While trying to figure out what was causing it, we came up with a very
>>> very basic example of things that are consistently slower in OTP22.
>>>
>>> I created this gist to show it:
>>> https://gist.github.com/elbrujohalcon/d4e995fbc4b93fadddfd1f0d6b9f8121
>>>
>>> I'm aware that these kinds of micro-benchmarks are treacherous and they
>>> may vary wildly depending on context. Nevertheless, on both on one of our
>>> servers in AWS and on my machine (A MacOS Pro running Catalina 10.15.5) and
>>> using kerl to install multiple versions of OTP 21 and 22… every single time
>>> I run the tests I found the same results…
>>>
>>> Always starting the nodes with *erl -boot start_clean *then running the
>>> following in both OTP21 and OTP22…
>>>
>>> c(test), test:bench({test, lrec}, 250, 5000, 2000).
>>>
>>> …generates very similar numbers in both versions, regardless of the
>>> numbers used for the different parameters. But…
>>>
>>> c(test), test:bench({test, rec}, 250, 5000, 2000).
>>>
>>> …consistently generates larger results in OTP22 than OTP21. I tried with
>>> different values for the number of tests, the number of lists and their
>>> length and sometimes the difference is more evident, sometimes less… but
>>> OTP22 times are *always* larger.
>>>
>>> I found this in the OTP22 readme…
>>>
>>>   OTP-15427    Application(s): erts
>>>                Appending lists (The ++ operator) will now yield
>>>                properly on large inputs.
>>>
>>> So… questions…
>>> 1. Has anybody experienced (and hopefully solved) this problem before
>>> when migrating to OTP22?
>>> 2. Do you think OTP-15427 can be related to what I'm seeing?
>>> 3. Can someone confirm if you also experience the same difference in
>>> performance when running the same benchmarks that I pasted on that gist?
>>> 4. Is there anything else I should try/test/use to check?
>>> 5. Am I going slightly mad?
>>>
>>> Thanks in advance, cheers :)
>>>
>>>
>>> --
>>>
>>> <https://about.me/elbrujohalcon?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb>
>>> Brujo Benavides
>>> about.me/elbrujohalcon
>>> <https://about.me/elbrujohalcon?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb>
>>>
>>
>
> --
>
> <https://about.me/elbrujohalcon?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb>
> Brujo Benavides
> about.me/elbrujohalcon
> <https://about.me/elbrujohalcon?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200702/5354bc72/attachment.htm>


More information about the erlang-questions mailing list