Binary, List and Tuple Inequalities (Paradox?)

Fri Nov 1 12:36:31 CET 2019

Historical note: back in, oh, 1978 or 1979, I came across a PL/I
dialect which compared
strings X Y thus:
  if X is longer than Y,  X > Y
  if X is shorter than Y,  X < Y
  otherwise, compare the characters.
It was rather disconcerting to find that 'KETTLE' > 'CAT'.
However, it was also disconcerting to realise that the normal PL/I
rules are equally quirky:
  "If the strings are not the same length, pad the shorter on the
right with blanks,
   and then compare."
This means, for example, that you can find strings X Y such that X > X||Y
(where || is the string concatenation operator in PL/I).
Fortran does the same as PL/I here: first pad with blanks to the same
length, then compare.
And it has the same rather confusing result.
It turned out that the dialect I mentioned at the start had changed
the rules to get
something easier to reason about.  Of course, lexicographic order
would have been
easy to reason about too...

I built a string library once that stored strings in big-endian order
even on little-endian
machines, and word-aligned them, so that you could compare 4 (or 8)
bytes at a time.
It turned out not to be as much faster as I had hoped.

On Sat, 26 Oct 2019 at 03:37, Valentin Micic <v@REDACTED> wrote:
>
> Hi all,
>
> Consider the following inequalities:
>
> (tsdb_1_1@REDACTED)2876> <<0,0,1>> < <<1,0,0>>.
> true
>
> As well as:
>
> (tsdb_1_1@REDACTED)2878> [0,0,1] < [1,0,0].
> true
>
> The result (true) makes sense — in both cases operands are of the same size, and the first (leftmost) element value of the left operand is less than  first (again, leftmost) element value of the right operand.
>
> However:
>
> (tsdb_1_1@REDACTED)2877> <<0,0,1>> < <<1,0>>.
> true
>
> and
>
> (tsdb_1_1@REDACTED)2879> [0,0,1] < [1,0].
> true
>
> This indicates that the actual length of the operands are not considered, for clearly, three octets cannot be less then two, right?
>
> This becomes even more confusing if you check how tuples are compared given the similar test-case:
>
> (tsdb_1_1@REDACTED)2886> {0,0,1} < {1,0,0}.
> true
> (tsdb_1_1@REDACTED)2887> {0,0,1} < {1,0}.
> false
>
> Here, the number of elements in the tuple are clearly taken into the consideration.
>
> Then, if one converts all three pairs of operands to its (external) binary representation, e.g.
>
> Binary:
>
> (tsdb_1_1@REDACTED)2916> term_to_binary( <<0,0,1>> ).
> <<131,109,0,0,0,3,0,0,1>>
> (tsdb_1_1@REDACTED)2917> term_to_binary( <<1,0>> ).
> <<131,109,0,0,0,2,1,0>>
>
> List:
>
> tsdb_1_1@REDACTED)2918> term_to_binary( [0,0,1] ).
> <<131,107,0,3,0,0,1>>
> (tsdb_1_1@REDACTED)2919> term_to_binary( [1,0] ).
> <<131,107,0,2,1,0>>
>
> Tuple:
> (tsdb_1_1@REDACTED)2920> term_to_binary( {0,0,1} ).
> <<131,104,3,97,0,97,0,97,1>>
> (tsdb_1_1@REDACTED)2921> term_to_binary( {1,0} ).
> <<131,104,2,97,1,97,0>>
>
> One could see that the number of “elements” are known and available for comparison (I’ve highlighted this number using bold, yellow font) for all three situations.
>
> And yet, the different approaches between binary and lists on one side, and tuples, on another, appear to be deliberate as binary and list types are following the approach used by C standard library memcmp function, whereas tuples do not, e.g.
>
> int memcmp(const void *s1, const void *s2, size_t n);
>
> In other words, operands are reduced to some common size ’n', and then compared as if they were equal in length.
>
> I do understand this restriction in C -- if not for ’n’, the argument sizes are not known, and without it the execution may end up in a catastrophic failure (e.g. segment violation, hmm.. which can still happen anyway if ’n’ is inadequate).
>
> So, THE QUESTION:
>
> Why would it be wrong to consider inequalities for binary() data types the same way we do it for tuples — number of “elements” first and  “numeric values” of individual elements — second?
>
> In my view, this is not only not wrong, but it would be more logical, and this is why.
>
> Look at the following comparison:
>
> (tsdb_1_1@REDACTED)2957> <<0,0,1>> < <<0,0,1,0>>.
> true
>
>
> Well, this make sense, for less-is-less (e.g. three octets are less than four octets) and even if one considers  <<0,0,1>> as a 24-bit integer value, and <<0,0,1,0>> as a 32-bit integer, one may “normalise” the 24-bit integer to its 32-bit equivalent by adding a leading zero, and the comparison expression would still return true:
>
> (tsdb_1_1@REDACTED)2961> <<0,0,0,1>> < <<0,0,1,0>>.
> true
>
> Therefore, as expected, 1 <  256, and, thus, we may be forgiven if we have the same expectation when we write:
>
> (tsdb_1_1@REDACTED)2958> <<1>> < <<0,0,1,0>>.
>
> Right? Well, not so, because:
>
> (tsdb_1_1@REDACTED)2958> <<1>> < <<0,0,1,0>>.
> false
>
> Thus, "less is more", and therefore 1 appears to be greater-than-or-equal-to  256. Somehow.
>
> This could never happen if one considered a number of elements (in this case number of octets) before attempting to compare the individual element values.
>
> But wait, the "less-is-more" may easily become its opposite — “more is less”, because:
>
> (tsdb_1_1@REDACTED)2973> <<32, 97>> < <<97>>.
> true
>
> Yes, this approach may help us to sort a list of, say,  surnames alphabetically, so we can have all the “Ms” before the “Ns”, regardless of how long the surname is, but is this actually logical for two arbitrary binaries? Seeing that we actually care about number of elements when we deal with tuples, why not extend the same approach to arbitrary binary values?
>
> Why do we presume that binaries are given as STRINGS when we compare them? When it comes to lists, the situation is even worse, because lists do, but then do not follow the STRING comparison semantic:
>
> (tsdb_1_1@REDACTED)2991> [32, 97] < [97].
> true
>
> *** But! ***
>
> (tsdb_1_1@REDACTED)2992> [{32, 97}] < [97].
> false
>
> (tsdb_1_1@REDACTED)2993> [{32, 97}] < [{97}].
> false
>
> And just to be clear, I do not expect Erlang to change. Nor do I care about how lists are handled (well, at least not for the moment).
>
> My main concern is semantic behind comparing two arbitrary binary() operands. For one of my projects I’d like to employ tuple-like semantic (.e.g. number of octets first, values later) when comparing two arbitrary binaries and thus avoid “less-is-more” and “more-is-less” paradoxes.
>
> Is there any reason why I shouldn’t? (Okay, it may be a bit slower, but one may argue this approach may be even a bit faster for a long-enough binaries. But what else?)
>
> At any rate, I would appreciate your reasoning on the topic. What resonates better with you?
>
>
> Thanks in advance.
>
> V/
>
>