[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Wed Oct 19 12:14:30 CEST 2011

cookbook # 1 - draft 1

<aside>
 We're going to write a cookbook.

 This will be free (in an electronic version, PDF, epub)
 And you will be able to buy a paper version (POD)

 The development model is

  - a few authors
  - many reviewers (you are the reviewers)
    the reviewers report errors/suggest changes
    the authors make the changes

 The POD version we hope will generate some income
 this will be split according to the contributions. Authors
 will be paid as will reviewers whose suggestions are incorporated.

 Payment (if we make a profit) will be in direct relation to the size
of the contribution

 Expensive things like professional proof reading, will be
 sponsorship, or crowd sourced, or otherwise financed.

 To start the ball rolling I have some text below.

 Please comment on this text. If your comments are accepted one day you
might get paid :-)

 Note: 1) By commenting you are implicitly agreeing that if your comments
are accepted into the final text then you will be subject to the
licensing conditions of that text. The text will always be free and
open source.

</aside>

Cookbook Question:

I have often seen the words "UTF-8 string" used in sentences like
"Java has UTF-8 strings". What does this mean when applied to Erlang?

----------------------------------------------------------------------

Answer:

In Erlang strings are syntactic sugar for "lists of integers"

Imagine the string "10(Euro)" - (Euro) is the glyph representing the
Euro currency symbol.

The term "UF8-string" representing "10(euro)" in Erlang could
mean one of two things:

   Either a) [49,48,8364]           (ie its a list of three unicode integers)
   Or     b) [49,48,226,130,172]    (ie its the UTF-8 encoding of the
                                     unicode characters)

The so words "UTF-8" string might mean a) or might mean b)

Erlang folks have always said "unicode/UTF-8 is easy in Erlang
since strings are just lists of integers" - by this we mean that
Erlang programs should always manipulate strings given the type a)
interpretation. *all* library functions assume type a) encoding.

The type b) interpretation only has meaning when you write data to a
file etc. and should be as invisible to the user as possible (but when
things go wrong and you get the wrong character printed you need to
understand the difference)

Question 1) How can we get a unicode characters into a list item?
            or what does a string literal look like?

   > X = "10\x{20ac}"
   [49,48,8364]

   This is not described in my book since the change came after the
   book was published (is it in the other Erlang books yet?)	

Question 2) How can we convert between representations a) and b) above?

   Easy - though one has to dig in the documentation a bit.

   > B = unicode:characters_to_binary(X, unicode, utf8).
   <<49,48,226,130,172>>
   > unicode:characters_to_list(B).
   [49,48,8364]

Question 3) Can I write "10(Euro)" in an editor which supports
unicode/UTF-8 and does the erlang tool chain support this?

Will "erlc foo.erl" automatically detect that foo.erl is unicode
encoded and do the right thing when scanning and tokenising strings?

   Answer: I don't know?

Question 4)  Can string literals be improved on?

I hope so -- In Html I can say (I hope) €

I'd like to say:

      X = "10€" in Erlang

      People who know far more about this than I do can tell me if this
is OK

----------------------------------------------------------------------