[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Thu Oct 20 10:23:21 CEST 2011

On Wed, Oct 19, 2011 at 1:19 PM, Fred Hebert <mononcqc@REDACTED> wrote:
>
>
> On Wed, Oct 19, 2011 at 6:14 AM, Joe Armstrong <erlang@REDACTED> wrote:
>>
>> cookbook # 1 - draft 1
>>
>> <aside>
>>  We're going to write a cookbook.
>>
>>  This will be free (in an electronic version, PDF, epub)
>>  And you will be able to buy a paper version (POD)
>>
>>  The development model is
>>
>>  - a few authors
>>  - many reviewers (you are the reviewers)
>>    the reviewers report errors/suggest changes
>>    the authors make the changes
>
> That sounds neat.
>
>>
>>  The POD version we hope will generate some income
>>  this will be split according to the contributions. Authors
>>  will be paid as will reviewers whose suggestions are incorporated.
>>
>>  Payment (if we make a profit) will be in direct relation to the size
>> of the contribution
>>
>>  Expensive things like professional proof reading, will be
>>  sponsorship, or crowd sourced, or otherwise financed.
>>
>>  To start the ball rolling I have some text below.
>>
>>  Please comment on this text. If your comments are accepted one day you
>> might get paid :-)
>>
>>  Note: 1) By commenting you are implicitly agreeing that if your comments
>> are accepted into the final text then you will be subject to the
>> licensing conditions of that text. The text will always be free and
>> open source.
>>
>> </aside>
>>
>> Cookbook Question:
>>
>> I have often seen the words "UTF-8 string" used in sentences like
>> "Java has UTF-8 strings". What does this mean when applied to Erlang?
>>
>> ----------------------------------------------------------------------
>>
>> Answer:
>>
>> In Erlang strings are syntactic sugar for "lists of integers"
>>
>> Imagine the string "10(Euro)" - (Euro) is the glyph representing the
>> Euro currency symbol.
>>
>> The term "UF8-string" representing "10(euro)" in Erlang could
>> mean one of two things:
>>
>>   Either a) [49,48,8364]           (ie its a list of three unicode
>> integers)
>>   Or     b) [49,48,226,130,172]    (ie its the UTF-8 encoding of the
>>                                     unicode characters)
>>
>> The so words "UTF-8" string might mean a) or might mean b)
>
> No, it won't mean b). See this (using ~ts instead of ~s to specify we want
> unicode handling):
> 4> io:format("~ts~n",[<<49,48,226,130,172>>]).
> 10€
> ok
> 5> io:format("~ts~n",[[49,48,226,130,172]]).
> 10â‚¬
> ok
> 6> io:format("~ts~n",[[49,48,8364]]).
> 10€
> ok
> They are not the same thing.
> See http://ferd.ca/will-the-real-unicode-wrangler-please-stand-up.html for a
> bit of stuff I've written on it, although my terminology in that blog post
> is far from stellar and exact according to unicode standards.
> The gist of it is that binaries and strings do not share the same ways to
> represent unicode strings. Unicode strings will have full-length codepoints,
> and binaries will have the byte-based representation you show.
>
>>
>> Erlang folks have always said "unicode/UTF-8 is easy in Erlang
>> since strings are just lists of integers" - by this we mean that
>> Erlang programs should always manipulate strings given the type a)
>> interpretation. *all* library functions assume type a) encoding.
>
> Not always. list_to_binary and binary_to_list won't work well with that. You
> need to use unicode:characters_to_[binary|list]/1-3 and make sure the
> original string has been encoded correctly.

Interesting comment: this is almost where I could write an article with the
title "list_to_binary considered harmful" - I guess if Erlang is
serializing terms
to be stored on disk etc. term_to_binary and its inverse should be used.
list_to_binary seems to imply that you are going to send something to the
outside world - and then you should stop and think hard, this is
because there is
no universal agreement in the outside world as to what an integer is
(ie is it bounded or not)
fixing a notion of an integer to something in the range 0..255 allows
communication of
integers, but requires a framing protocol (ie UTF8, or ASN.1) that
tells how integers
are encoded - but this is out of band.

The problem is that I might write

    X1 = "10$"    (10 dollars) or
    X2 = "10\x{20ac}"  (10 euros)

Now list_to_binary(X1) will succeed but list_to_binary(X2) will fail

So maybe I should write

    X1 = {ansii, "10$"}
    X2 = {unicode,"10\x{20ac}"}

If the libraries were written this way then life might be easier

/Joe

>
>>
>> The type b) interpretation only has meaning when you write data to a
>> file etc. and should be as invisible to the user as possible (but when
>> things go wrong and you get the wrong character printed you need to
>> understand the difference)
>
> This can work when printing to a file because you can print raw bytes using
> a list, but it's rather hard and I would prefer people to push the
> distinction above (binary for bytes, lists for codepoints and numbers larger
> than bytes)
>
>>
>> Question 1) How can we get a unicode characters into a list item?
>>            or what does a string literal look like?
>>
>>   > X = "10\x{20ac}"
>>   [49,48,8364]
>
> This is one way to do it, yes. You can also copy/paste the character
> directly in the shell if I recall.
>
>>
>>   This is not described in my book since the change came after the
>>   book was published (is it in the other Erlang books yet?)
>
> Not that I know of.
>>
>> Question 2) How can we convert between representations a) and b) above?
>>
>>   Easy - though one has to dig in the documentation a bit.
>>
>>   > B = unicode:characters_to_binary(X, unicode, utf8).
>>   <<49,48,226,130,172>>
>>   > unicode:characters_to_list(B).
>>   [49,48,8364]
>
> Yep.
>>
>> Question 3) Can I write "10(Euro)" in an editor which supports
>> unicode/UTF-8 and does the erlang tool chain support this?
>>
>> Will "erlc foo.erl" automatically detect that foo.erl is unicode
>> encoded and do the right thing when scanning and tokenising strings?
>>
>>   Answer: I don't know?
>
> Nope. Erlang assumes all files to be in latin-1. Anything that looks like
> unicode support is luck at best. There is no good Unicode support from
> static files, although the shell does support it.
>
>>
>> Question 4)  Can string literals be improved on?
>>
>> I hope so -- In Html I can say (I hope) €
>>
>> I'd like to say:
>>
>>      X = "10€" in Erlang
>>
>>      People who know far more about this than I do can tell me if this
>> is OK
>
> This might be funny because people might already use that escaping for HTML
> content on site. If you start inlining that one and converting to unicode
> symbols for them, there's no telling how backwards compatible this will be.
> In any case, I'd well prefer the compiler to support utf-8,utf-16 or utf-32
> encoded files than support for '&<code>;' or '&#ijk;' encoding. I'm pretty
> sure the non-latin-1 users of this mailing list would agree there, too.
>