[erlang-questions] Strange difference between construction and matching of binaries

Wed Dec 23 16:24:27 CET 2015

When playing with a new testing tool for Erlang programs, we discovered 
the following difference between construction and matching of binaries, 
which, although we understand from an implementation point-of-view, we 
still find sufficiently weird and worthy of at least some discussion here.

The simplest way of describing the difference between construction and 
matching of binaries is the following interaction with the Erlang shell:

=================================================================
Eshell V7.2.1  (abort with ^G)
1> <<42:7>> = <<42:7>>.
<<42:7>>
2> <<42:6>> = <<42:6>>.
<<42:6>>
3> <<42:5>> = <<42:5>>.
** exception error: no match of right hand side value <<10:5>>
=================================================================

For those that find the above surprising, it should be pointed out that 
the fine reference manual 
(http://www.erlang.org/doc/reference_manual/expressions.html#bit_syntax) 
contains the following note:

   When constructing binaries, if the size N of an integer segment is
   too small to contain the given integer, the most significant bits of
   the integer are silently discarded and only the N least significant
   bits are put into the binary.

So, the next line one may want to type in the shell could be:

=================================================================
4> <<42:5>> =:= <<234:5>>.
true
=================================================================

This may be a bit surprising but is fine in some sense.  The problem is 
that the fine reference manual nowhere explains what happens during 
matching with segments that either contain concrete values (as in the 
examples above) or variables that are bound to values that do not fit in 
the size of their segment.  From what can be seen in the above examples, 
apparently something different happens to these segments when used in 
matching instead of when used in construction.

Now, the problem with this difference between construction and matching 
of binaries containing values that do not fit in their segments is that 
it breaks many of the invariants that functional programmers (and their 
compilers!) expect to hold.  For example, the following clause heads are 
not all the same:

   foo(<<42:5>>) ->

   foo(<<Int:5>>) when Int =:= 42 ->

   foo(Bits) when Bits =:= <<42:5>> ->

and, perhaps surprisingly, only the third clause matches with <<10:5>> 
(as well as <<42:5>>, <<106:5>>, <<234:5>>, ...).  I am willing to bet 
many may find the above as breaking the principle of least astonishment.

With this post, I want to initiate some discussion about the above in 
the hope that we can come up with better semantics and implementation 
for matching with bound binary segments than the current behavior.  (Or 
at least formally document this difference.)

Kostis