[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Wed Oct 19 13:19:19 CEST 2011

On Wed, Oct 19, 2011 at 6:14 AM, Joe Armstrong <erlang@REDACTED> wrote:

> cookbook # 1 - draft 1
>
> <aside>
>  We're going to write a cookbook.
>
>  This will be free (in an electronic version, PDF, epub)
>  And you will be able to buy a paper version (POD)
>
>  The development model is
>
>  - a few authors
>  - many reviewers (you are the reviewers)
>    the reviewers report errors/suggest changes
>    the authors make the changes
>

That sounds neat.

>
>  The POD version we hope will generate some income
>  this will be split according to the contributions. Authors
>  will be paid as will reviewers whose suggestions are incorporated.
>
>  Payment (if we make a profit) will be in direct relation to the size
> of the contribution
>
>  Expensive things like professional proof reading, will be
>  sponsorship, or crowd sourced, or otherwise financed.
>
>  To start the ball rolling I have some text below.
>
>  Please comment on this text. If your comments are accepted one day you
> might get paid :-)
>
>  Note: 1) By commenting you are implicitly agreeing that if your comments
> are accepted into the final text then you will be subject to the
> licensing conditions of that text. The text will always be free and
> open source.
>
> </aside>
>
> Cookbook Question:
>
> I have often seen the words "UTF-8 string" used in sentences like
> "Java has UTF-8 strings". What does this mean when applied to Erlang?
>
> ----------------------------------------------------------------------
>
> Answer:
>
> In Erlang strings are syntactic sugar for "lists of integers"
>
> Imagine the string "10(Euro)" - (Euro) is the glyph representing the
> Euro currency symbol.
>
> The term "UF8-string" representing "10(euro)" in Erlang could
> mean one of two things:
>
>   Either a) [49,48,8364]           (ie its a list of three unicode
> integers)
>   Or     b) [49,48,226,130,172]    (ie its the UTF-8 encoding of the
>                                     unicode characters)
>
> The so words "UTF-8" string might mean a) or might mean b)
>

No, it won't mean b). See this (using ~ts instead of ~s to specify we want
unicode handling):

4> io:format("~ts~n",[<<49,48,226,130,172>>]).
10€
ok
5> io:format("~ts~n",[[49,48,226,130,172]]).
10â‚¬
ok
6> io:format("~ts~n",[[49,48,8364]]).
10€
ok

They are not the same thing. See
http://ferd.ca/will-the-real-unicode-wrangler-please-stand-up.html for a bit
of stuff I've written on it, although my terminology in that blog post is
far from stellar and exact according to unicode standards.

The gist of it is that binaries and strings do not share the same ways to
represent unicode strings. Unicode strings will have full-length codepoints,
and binaries will have the byte-based representation you show.

>
> Erlang folks have always said "unicode/UTF-8 is easy in Erlang
> since strings are just lists of integers" - by this we mean that
> Erlang programs should always manipulate strings given the type a)
> interpretation. *all* library functions assume type a) encoding.
>

Not always. list_to_binary and binary_to_list won't work well with that. You
need to use unicode:characters_to_[binary|list]/1-3 and make sure the
original string has been encoded correctly.

>
> The type b) interpretation only has meaning when you write data to a
> file etc. and should be as invisible to the user as possible (but when
> things go wrong and you get the wrong character printed you need to
> understand the difference)
>

This can work when printing to a file because you can print raw bytes using
a list, but it's rather hard and I would prefer people to push the
distinction above (binary for bytes, lists for codepoints and numbers larger
than bytes)

>
> Question 1) How can we get a unicode characters into a list item?
>            or what does a string literal look like?
>
>   > X = "10\x{20ac}"
>   [49,48,8364]
>

This is one way to do it, yes. You can also copy/paste the character
directly in the shell if I recall.

>
>   This is not described in my book since the change came after the
>   book was published (is it in the other Erlang books yet?)
>

Not that I know of.

>
> Question 2) How can we convert between representations a) and b) above?
>
>   Easy - though one has to dig in the documentation a bit.
>
>   > B = unicode:characters_to_binary(X, unicode, utf8).
>   <<49,48,226,130,172>>
>   > unicode:characters_to_list(B).
>   [49,48,8364]
>

Yep.

>
> Question 3) Can I write "10(Euro)" in an editor which supports
> unicode/UTF-8 and does the erlang tool chain support this?
>
> Will "erlc foo.erl" automatically detect that foo.erl is unicode
> encoded and do the right thing when scanning and tokenising strings?
>
>   Answer: I don't know?
>

Nope. Erlang assumes all files to be in latin-1. Anything that looks like
unicode support is luck at best. There is no good Unicode support from
static files, although the shell does support it.

>
> Question 4)  Can string literals be improved on?
>
> I hope so -- In Html I can say (I hope) €
>
> I'd like to say:
>
>      X = "10€" in Erlang
>
>      People who know far more about this than I do can tell me if this
> is OK
>

This might be funny because people might already use that escaping for HTML
content on site. If you start inlining that one and converting to unicode
symbols for them, there's no telling how backwards compatible this will be.
In any case, I'd well prefer the compiler to support utf-8,utf-16 or utf-32
encoded files than support for '&<code>;' or '&#ijk;' encoding. I'm pretty
sure the non-latin-1 users of this mailing list would agree there, too.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111019/9af16a20/attachment.htm>