[erlang-questions] On OTP rand module difference between OTP 19 and OTP 20

Fri Sep 1 14:53:38 CEST 2017

On Fri, Sep 01, 2017 at 09:36:37AM +0000, Hugo Mills wrote:
> On Fri, Sep 01, 2017 at 10:41:15AM +0200, Raimo Niskanen wrote:
> > On Thu, Aug 31, 2017 at 10:29:34PM -0700, Michael Truog wrote:
> > > On 08/31/2017 07:57 PM, Richard A. O'Keefe wrote:
> > > >
> > > > On 1/09/17 6:35 AM, Michael Truog wrote:
> > > >> As I argued in the original pull request for these recent 20.0 random
> > > >> number changes, a uniform distribution is much more intuitive if it is
> > > >> inclusive: [0,1]
> > > >
> > > > Intuitive?  For integers, I'll grant that.  For reals?  Not so much.
> > > > I certainly cannot grant "MUCH more intuitive".
> > > > I've had occasion to do (random next * 2 - 1) arcTanh, which of
> > > > course breaks down if *either* 0 or 1 is returned by (random next).
> > > 
> > > A uniform distribution should be uniformly distributed.  I understand the woes of floating-point prevent perfect uniform distribution, but we could at least try to pay attention to the limits involved, and if we did, that would make the idea much more intuitive.
> > 
> > If I try to be philosophical, picking a random number in the range
> > 0.0 to 1.0 of real numbers, the probability of getting a number exactly 0.0
> > (or exactly 1.0) is infinitely low.  Therefore the range (0.0,1.0) is more
> > natural.
> 
>    Mathematically, there's a distinction. What you've just described
> is that in a random variable over the interval [0.0, 1.0], 0.0 and 1.0
> happen *almost never* (which is a very specific technical term), and
> that values in the open interval (0.0, 1.0) occur *almost surely*.
> 
>    Being discrete, the computer implementation based on floating point
> numbers ensures that the probability of getting 0.0 or 1.0 in that
> case is measurably non-zero, whereas in the ideal version over the
> reals, above, it is infinitesimally small. In that distinction lie
> most of the problems that people are talking about here, I think.

Precisely!

> 
> > > My belief is that the [0,1) distribution is the most common
> > > because it is the easiest to implement given the IEEE floating
> > > point standard format.  However, I would also like to be proven
> > > wrong, to have more faith in the current situation.
> 
> > I think that is very possible.
> 
>    From my relatively limited practical experience, either I've wanted
> [0, 1) or I don't care. Example:
> 
>    Red = int(random() * 256)
> 
> where I don't want the value 256, because it's out of range for my
> 8-bit graphics mode, but I do want the probability of 255 to be the
> same as every other value. So I want [0, 1) as my range.
> 
>    Alternatively:
> 
>    P = random(),
>    if
>       P =< 0.3 -> ...;
>       P =< 0.7 -> ...;
>       P > 0.7 -> ...
>    end
> 
> where, in general, I don't care if I could get 0.0 or 1.0 or not,
> because the differences are immeasurably small for all practical
> purposes.

Especially since decimal numbers below 1.0 have no exact representation as
IEEE floating point numbers.

> 
>    I think it's clear to me that _several_ functions are needed for
> different cases, with fully-closed, fully-open and half-open
> intervals. IMO, the fully-closed and half-open are probably the most
> useful (and, modulo any floating-point issues which I'm not qualified
> to talk about, [0,1) can be turned into (0,1] with
> 1-random_halfopen()).

There should be no issues in this case with the updated algorithms in
the 'rand' module that produce numbers on the form N * 2^-53, and due to
the fact that IEEE floating point arithmetics is defined to produce
correctly rounded results for the simple operations +, -, *, /, ...

> 
>    With a fully-closed interval, it should be possible to write
> helpers for generating the other three by simply calling
> random_closed() again if you get an undesirable end-point. You can't
> easily extend the range of the half-open or open intervals to give you
> the closed ones. So I'd say at minimum, there should be a function
> giving the closed interval.

The closed interval [0.0,1.0] may be the most general for the reasons you
mention.  Unfortunately it is cumbersome to implement "efficiently".

You would have to generate integers [0, 1+2^53] and to do that with a 58
bit generator (the default) you generate one number and have roughly a
1/32 chance of landing in the high interval where you need to retry.

The amortized impact of this is only about 3% slower, but the code has to
do all the testing for these corner cases which adds up to maybe 20% slower.

But once in a million it will take 4 full attempts or more to get a number,
which is rather annoying, and the worst case is infinity (but will never
happen ;-).

Therefore the half open interval [0.0,1.0) is probably the most useful one,
and the question is if the closed interval [0.0,1.0] (and the open interval
(0.0,1.0) is worth to implement...

> 
>     Whether the "test and retry" approach is the best implementation
> or not is a matter for discussion, as is the question of whether all

I have never heard of a better alternative than "test and retry" when you
are limiting the interval, except possibly when the probability for retry is
high (approaching 50%) - then you may consider to generate twice as many
bits and do a simple 'mod' operation for the result; the skew would be
impossible to notice.  The "test and retry" method is still expected to be
as fast or faster, amortized, but the "generate double and 'mod'" method
has more or less fixed execution time.

> or some of these functions should be in the standard lib, or they are
> expected to be hacked together on an as-needed basis.

Yes.  That is the question.

> 
>    Hugo.
> 
> -- 
> Hugo Mills             | Anyone who says their system is completely secure
> hugo@REDACTED carfax.org.uk | understands neither systems nor security.
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |                                        Bruce Schneier

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB