[erlang-questions] String versus variable in binary literal

Wed May 16 14:01:00 CEST 2012

On 05/16/2012 12:29 PM, Joe Armstrong wrote:
> On Wed, May 16, 2012 at 10:56 AM, Richard Carlsson
> <carlsson.richard@REDACTED>  wrote:
>> The bit syntax doesn't (currently) support encoding strings that are not
>> constant literals. This is something that should be fixed, IMO.
>
> It's a bug (or should be a bug) - try this for size:
>
>>   <<1223232321111,3476824682351,18368119>>.
> <<"Wow">>
>
> Isn't that beautiful :-)

Well, yes, but nothing strange about it. The default size for integers 
is byte, and numbers get truncated to fit the desired size:

   1> <<1223232321111:8,3476824682351:8,18368119:8>>.
   <<"Wow">>
   2> <<1223232321111:16,3476824682351,18368119>>.
   <<"öWow">>
   3> <<1223232321111:32,3476824682351:32,18368119:32>>.
   <<206,83,246,87,130,230,111,111,1,24,70,119>>

What tends to surprise people is that the default field type is integer, 
even if the given value is a constant of some other type, but if you add 
the correct type specifier it works:

   1> << <<1,2>> >>.
   ** exception error: bad argument
   2> << <<1,2>>/binary >>.
   <<1,2>>
   3> << 3.14 >>.
   ** exception error: bad argument
   4> << 3.14/float >>.
   <<64,9,30,184,81,235,133,31>>

in fact, <<"abc">> works just because it's considered to be a special 
notation for a number of integers: << $a, $b, $c >> = <<"abc">>. But 
you're not allowed to write it as << [$a,$b,$c] >>, even though 
[$a,$b,$c] = "abc".

Nowadays there is some extra syntax for UTF-<N> encodings:

   1> <<"åäö"/utf8>>.
   <<"Ã¥Ã¤Ã¶">>
   2> <<"åäö"/utf16>>.
   <<0,229,0,228,0,246>>
   3> <<"åäö"/utf32>>.
   <<0,0,0,229,0,0,0,228,0,0,0,246>>

However, <<String/utf8>> doesn't work, if String is a variable. This 
could be fixed. But since <<String>> is interpreted as expecting String 
to be an integer, there is still no way to easily insert a normal 
Latin-1 string dynamically in a binary, even if <<String/utf8>> is made 
to work. I would suggest the addition of type specifiers 'latin1' and 
'ascii' for this purpose, where 'latin1' would accept only character 
codes 0-255 (no truncation), and 'ascii' would only accept codes 0-127 
(good for ensuring that http headers and similar things are 7-bit 
clean). While you're at it, String should be allowed to be a chardata() 
just as in the unicode module, not just flat lists of chars.

Oh, and atoms should be allowed in chardata() and iolist(). I think 
that's about it.

    /Richard