[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Thu Oct 20 15:26:15 CEST 2011

On Thu, Oct 20, 2011 at 2:54 PM, Fred Hebert <mononcqc@REDACTED> wrote:
> No, list_to_binary and iolist_to_binary are not considered harmful.

But they are dangerous. list_to_binary can fail when its argument is
not list of 0.255 integers
binary_to_list always works. term_to_binary and its inverse
binary_to_term can never fail
(and I know about the danger of does this with Pids and Refs and funs ...)

>
> The problem is that they both implicitly convert integers on the range
> 0..255 inclusively. This works based on the definition that iolists only
> contain bytes or binaries. list_to_binary will accept all iolists (deeply
> nested lists of bytes (0..255) and binaries) and iolist_to_binary will
> accept both iolists or flat binaries.
> They were apparently meant to convert between lists and binaries of bytes,
> not lists or binaries of arbitrary large (or negative) numbers. Because we
> Erlang programmers have been so relying on the idea of lists as strings,
> using ASCII and most Latin1 sequences of bytes made for fine conversions to
> binaries and nobody ever had a problem.
> Unicode standards (and their respective UTF encodings) break this assumption
> that strings most of the time only contain ASCII and Latin1 integers for
> their codepoints. This is why you need the unicode module's conversion
> there.
> The same is then true of binary_to_list. The trouble there is that the
> binary representation (in bytes) of unicode strings doesn't match the list
> representation of unicode as accepted by ~ts printing and whatnot. The raw
> bytes representation isn't good enough, and that's what binary_to_list gives
> you. Again, the unicode module is clever enough to handle that.
> iolist_to_binary, list_to_binary and binary_to_list are fine when you know
> they're meant to be used for bytes, not arbitrary data.
> What's hurting Erlang more, I think, is the lack of Unicode algorithms as
> described by Michael Uvarov. If we have a unicode string, we currently can't
> get its length (in terms of graphemes, or 'characters to the human mind')

I don't understand the terminology here. What is "a unicode string?"

In Erlang "string" means "list of integers"

So when you write "a unicode string" my brain interprets this as a "a
list of unicode codepoints"

So the "unicode string" for "10(euros)" is the erlang list [49,48,8364]
and the length of this list is three (ie the number of 'characters in my mind')

The problem is that if I just write X = [N1,N2,N3,....] where all the
N's are in 0..255
I cannot see at a glance if this is supposed to represent (say) an
ascii string or
a UTF-8 encoded string of unicode codepoints. There is no way of
knowning. So I must add some
wrapper, ie

      X1 = {ascii, "abc"} or

      X2 = {utf8encoded,unicode,[49,48,226,130,172]}

      X3 = {unicode, "10\x{20ac}"} = {unicode, [49,48,8364]}

So now I know that the string in X1 is a "asci string" (unencoded)
and the string in X2 is the utf8 encoding of the unicode codepoints
[49,48,8364]
and the list in X3 is what I call a "unicode string"

Its just a matter of defining a convention and sticking to it - the
alternative would be to make a
new string type (a list with some invisible bits) but this would be
rather complicated and
probably not worth the effort.

/Joe

> in
> any reliable way. We also can't specify locales, can't do casing and
> whatnot, etc. Ideally those would be the next step for Erlang's unicode
> support, I think.
> On Thu, Oct 20, 2011 at 4:23 AM, Joe Armstrong <erlang@REDACTED> wrote:
>>
>> Interesting comment: this is almost where I could write an article with
>> the
>> title "list_to_binary considered harmful" - I guess if Erlang is
>> serializing terms
>> to be stored on disk etc. term_to_binary and its inverse should be used.
>> list_to_binary seems to imply that you are going to send something to the
>> outside world - and then you should stop and think hard, this is
>> because there is
>> no universal agreement in the outside world as to what an integer is
>> (ie is it bounded or not)
>> fixing a notion of an integer to something in the range 0..255 allows
>> communication of
>> integers, but requires a framing protocol (ie UTF8, or ASN.1) that
>> tells how integers
>> are encoded - but this is out of band.
>>
>> The problem is that I might write
>>
>>    X1 = "10$"    (10 dollars) or
>>    X2 = "10\x{20ac}"  (10 euros)
>>
>> Now list_to_binary(X1) will succeed but list_to_binary(X2) will fail
>>
>> So maybe I should write
>>
>>    X1 = {ansii, "10$"}
>>    X2 = {unicode,"10\x{20ac}"}
>>
>> If the libraries were written this way then life might be easier
>>
>> /Joe
>>
>
>