[erlang-questions] On OTP rand module difference between OTP 19 and OTP 20

Mon Sep 4 10:48:15 CEST 2017

On Sat, Sep 02, 2017 at 04:35:39PM -0700, Michael Truog wrote:
> On 09/01/2017 05:13 AM, Raimo Niskanen wrote:
> > On Fri, Sep 01, 2017 at 04:00:59AM -0700, Michael Truog wrote:
> >> On 09/01/2017 01:54 AM, Raimo Niskanen wrote:
> >>> On Thu, Aug 31, 2017 at 10:29:34PM -0700, Michael Truog wrote:
> >>> :
> >>>> I have some examples that can make this desire a bit clearer:
> >>>>
> >>>> https://github.com/CloudI/cloudi_core/blob/a1c10a02245f0f4284d701a2ee5f07aad17f6e51/src/cloudi_core_i_runtime_testing.erl#L139-L149
> >>>>
> >>>>        % use Box-Muller transformation to generate Gaussian noise
> >>>>        % (G. E. P. Box and Mervin E. Muller,
> >>>>        %  A Note on the Generation of Random Normal Deviates,
> >>>>        %  The Annals of Mathematical Statistics (1958),
> >>>>        %  Vol. 29, No. 2 pp. 610–611)
> >>>>        X1 = random(),
> >>>>        X2 = PI2 * random(),
> >>>>        K = StdDev * math:sqrt(-2.0 * math:log(X1)),
> >>>>        Result1 = erlang:max(erlang:round(Mean + K * math:cos(X2)), 1),
> >>>>        Result2 = erlang:max(erlang:round(Mean + K * math:sin(X2)), 1),
> >>>>        sleep(Result2),
> >>> Why not use rand:normal/3?
> >>>
> >>> It uses the Ziggurat Method and is supposed to be much faster and
> >>> numerically more stable than the basic Box-Muller method.
> >>>
> >> The Box-Muller is simpler and producing 2 results instead of 1 .  I believe I looked at the source code for rand:normal/3 and expected the Box-Muller to be faster only because it creates 2 results, though I should check that.  I will have to investigate it more.
> > Simpler - yes.
> >
> > The basic benchmark in rand_SUITE indicates that rand:normal() is only
> > about 50% slower than rand:uniform(1 bsl 58) (internal word size),
> > which I think is a very good number.
> >
> > The Box-Muller transform method needs 4 calls to the 'math' module for
> > non-trivial floating point functions i.e log(), sqrt(), cos() and sin(),
> > which is why I think that "must" be slower.
> >
> > But I have also not measured... :-/
> >
> > Looking forward to hear your results!
> I have some interesting results.
> 
> These results use https://github.com/okeuday/erlbench which includes a copy of the source code at https://github.com/okeuday/quickrand :
> 
> TEST pseudo_randomness
> N == 10000 (10 runs)
>           18_bxor_abs get:     1612.7 us (  1.3)
> 18_erlang:system_tim get:     1254.1 us (  1.0)
>          18_monotonic get:     1372.5 us (  1.1)
>   18_os:system_time/1 get:     1221.7 us (  1.0)
> 19_os:perf_counter/1 get:     3752.2 us (  3.1)
>        20_rand:normal get:     6832.0 us (  5.6)
>         20_rand_exrop get:     3949.3 us (  3.2)
>      20_rand_exs1024s get:    12073.3 us (  9.9)
>          20_rand_exsp get:     3390.4 us (  2.8)
>        os:timestamp/0 get:     1392.3 us (  1.1)
> os_time:perf_counter get:     4109.4 us (  3.4)
> quickrand_c:floatR/0 get:     5776.0 us (  4.7)
> quickrand_c:floatR/1 get:     5704.3 us (  4.7)
>     quickrand_c:uni/1 get:     4015.2 us (  3.3)
>     quickrand_c:uni/2 get:     3960.7 us (  3.2)
> quickrand_c_normal/2 get:     9329.5 us (  7.6)
> quickrand_c_normal/3 get:     8917.7 us (  7.3)
> random_wh06_int:unif get:    10777.5 us (  8.8)
> random_wh82:uniform/ get:     4750.0 us (  3.9)
> random_wh82_int:unif get:     4866.4 us (  4.0)
> 
> The function names that are relevant for a normal distribution are:
>        20_rand:normal ->   rand:normal/0 (when using rand:seed(exsp, _))
>          20_rand_exsp ->   rand:uniform/1 (when using rand:seed(exsp, _))
> quickrand_c:floatR/0 ->   quickrand_cache:floatR/0
> quickrand_c:floatR/1 ->   quickrand_cache:floatR/1
> quickrand_c_normal/2 ->   quickrand_cache_normal:box_muller/2
> quickrand_c_normal/3 -> quickrand_cache_normal:box_muller/3
> 
> The rand module exsp algorithm was used here because it is the fastest pseudo-random number generator in the rand module.

Thank you for sharing these numbers!

> 
> A rough look at the latency associated with the normal distribution method, ignoring the latency for random number source is:
> rand:normal/0
>    3441.6 us = 6832.0 us - (rand:uniform/1 3390.4 us)

Should not the base value come from rand:uniform/0 instead.  I know the
difference is not big - rand_SUITE:measure/1 suggests 3%, but it also
suggests that rand:normal/0 is about 50% slower than rand:uniform/0 while
your numbers suggests 100% slower.  Slightly strange...

> quickrand_cache_normal:box_muller/2
>    3553.5 us = 9329.5 us - (quickrand_cache:floatR/0 5776.0 us)
> quickrand_cache_normal:box_muller/3
>    3213.4 us = 8917.7 us - (quickrand_cache:floatR/1 5704.3us)

It is really interesting to see that the calls to the 'math' module
does not slow that algorithm down very much (hardly noticable)!

> 
> So, this helps to show that the latency with both methods is very similar if you ignore the random number generation.  However, it likely requires some explanation:  The quickrand_cache module is what I am using here for random number generation, which stores cached data from crypto:strong_rand_bytes/1 with a default size of 64KB for the cache.  The difference between the functions quickrand_cache_normal:box_muller/2 and quickrand_cache_normal:box_muller/3 is that the first uses the process dictionary while the second uses a state variable.  Using the large amount of cached random data, the latency associated with individual calls to crypto:strong_rand_bytes/1 is avoided at the cost of the extra memory consumption, and the use of the cache makes the speed of random number generation similar to the speed of pseudo-random number generation that occurs in the rand module.

We should add a 'rand' plugin to the 'crypto' module that does this
buffered crypto:strong_random_bytes/1 trick.  There is something like that
in rand_SUITE, but we should really have an official one.

I also wonder where the sweet spot is?  64 KB seems like a lot of buffer.

> 
> In CloudI, I instead use quickrand_normal:box_muller/2 to avoid the use of cached data to keep the memory use minimal (the use-case there doesn't require avoiding the latency associated with crypto:strong_rand_bytes/1 because it is adding latency for testing (at https://github.com/CloudI/cloudi_core/blob/299df02e6d22103415c8ba14379e90ca8c3d3b82/src/cloudi_core_i_runtime_testing.erl#L138) and it is best using a cryptographic random source to keep the functionality widely applicable).  However, the same function calls occur in the quickrand Box-Muller transformation source code, so the overhead is the same.
> 
> I used Erlang/OTP 20.0 (without HiPE) using the hardware below:
> |Core i7 2670QM 2.2GHz 1 cpu, 4 cores/cpu, 2 hts/core
> L2:4×256KB L3:6MB RAM:8GB:DDR3-1333MHz
> Sandy Bridge-HE-4 (Socket G2)
> 
> Best Regards,
> Michael
> |

Best regards
-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB