[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Fred Hebert mononcqc@REDACTED
Thu Oct 20 14:54:19 CEST 2011


No, list_to_binary and iolist_to_binary are *not* considered harmful.

The problem is that they both implicitly convert integers on the range
0..255 inclusively. This works based on the definition that iolists only
contain bytes or binaries. list_to_binary will accept all iolists (deeply
nested lists of bytes (0..255) and binaries) and iolist_to_binary will
accept both iolists or flat binaries.

They were apparently meant to convert between lists and binaries of bytes,
not lists or binaries of arbitrary large (or negative) numbers. Because we
Erlang programmers have been so relying on the idea of lists as strings,
using ASCII and most Latin1 sequences of bytes made for fine conversions to
binaries and nobody ever had a problem.

Unicode standards (and their respective UTF encodings) break this assumption
that strings most of the time only contain ASCII and Latin1 integers for
their codepoints. This is why you need the unicode module's conversion
there.

The same is then true of binary_to_list. The trouble there is that the
binary representation (in bytes) of unicode strings doesn't match the list
representation of unicode as accepted by ~ts printing and whatnot. The raw
bytes representation isn't good enough, and that's what binary_to_list gives
you. Again, the unicode module is clever enough to handle that.

iolist_to_binary, list_to_binary and binary_to_list are fine when you know
they're meant to be used for bytes, not arbitrary data.

What's hurting Erlang more, I think, is the lack of Unicode algorithms as
described by Michael Uvarov. If we have a unicode string, we currently can't
get its length (in terms of graphemes, or 'characters to the human mind') in
any reliable way. We also can't specify locales, can't do casing and
whatnot, etc. Ideally those would be the next step for Erlang's unicode
support, I think.

On Thu, Oct 20, 2011 at 4:23 AM, Joe Armstrong <erlang@REDACTED> wrote:

> Interesting comment: this is almost where I could write an article with the
> title "list_to_binary considered harmful" - I guess if Erlang is
> serializing terms
> to be stored on disk etc. term_to_binary and its inverse should be used.
> list_to_binary seems to imply that you are going to send something to the
> outside world - and then you should stop and think hard, this is
> because there is
> no universal agreement in the outside world as to what an integer is
> (ie is it bounded or not)
> fixing a notion of an integer to something in the range 0..255 allows
> communication of
> integers, but requires a framing protocol (ie UTF8, or ASN.1) that
> tells how integers
> are encoded - but this is out of band.
>
> The problem is that I might write
>
>    X1 = "10$"    (10 dollars) or
>    X2 = "10\x{20ac}"  (10 euros)
>
> Now list_to_binary(X1) will succeed but list_to_binary(X2) will fail
>
> So maybe I should write
>
>    X1 = {ansii, "10$"}
>    X2 = {unicode,"10\x{20ac}"}
>
> If the libraries were written this way then life might be easier
>
> /Joe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111020/e23b55b4/attachment.htm>


More information about the erlang-questions mailing list