[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings
Joe Armstrong
erlang@REDACTED
Wed Oct 19 12:14:30 CEST 2011
cookbook # 1 - draft 1
<aside>
We're going to write a cookbook.
This will be free (in an electronic version, PDF, epub)
And you will be able to buy a paper version (POD)
The development model is
- a few authors
- many reviewers (you are the reviewers)
the reviewers report errors/suggest changes
the authors make the changes
The POD version we hope will generate some income
this will be split according to the contributions. Authors
will be paid as will reviewers whose suggestions are incorporated.
Payment (if we make a profit) will be in direct relation to the size
of the contribution
Expensive things like professional proof reading, will be
sponsorship, or crowd sourced, or otherwise financed.
To start the ball rolling I have some text below.
Please comment on this text. If your comments are accepted one day you
might get paid :-)
Note: 1) By commenting you are implicitly agreeing that if your comments
are accepted into the final text then you will be subject to the
licensing conditions of that text. The text will always be free and
open source.
</aside>
Cookbook Question:
I have often seen the words "UTF-8 string" used in sentences like
"Java has UTF-8 strings". What does this mean when applied to Erlang?
----------------------------------------------------------------------
Answer:
In Erlang strings are syntactic sugar for "lists of integers"
Imagine the string "10(Euro)" - (Euro) is the glyph representing the
Euro currency symbol.
The term "UF8-string" representing "10(euro)" in Erlang could
mean one of two things:
Either a) [49,48,8364] (ie its a list of three unicode integers)
Or b) [49,48,226,130,172] (ie its the UTF-8 encoding of the
unicode characters)
The so words "UTF-8" string might mean a) or might mean b)
Erlang folks have always said "unicode/UTF-8 is easy in Erlang
since strings are just lists of integers" - by this we mean that
Erlang programs should always manipulate strings given the type a)
interpretation. *all* library functions assume type a) encoding.
The type b) interpretation only has meaning when you write data to a
file etc. and should be as invisible to the user as possible (but when
things go wrong and you get the wrong character printed you need to
understand the difference)
Question 1) How can we get a unicode characters into a list item?
or what does a string literal look like?
> X = "10\x{20ac}"
[49,48,8364]
This is not described in my book since the change came after the
book was published (is it in the other Erlang books yet?)
Question 2) How can we convert between representations a) and b) above?
Easy - though one has to dig in the documentation a bit.
> B = unicode:characters_to_binary(X, unicode, utf8).
<<49,48,226,130,172>>
> unicode:characters_to_list(B).
[49,48,8364]
Question 3) Can I write "10(Euro)" in an editor which supports
unicode/UTF-8 and does the erlang tool chain support this?
Will "erlc foo.erl" automatically detect that foo.erl is unicode
encoded and do the right thing when scanning and tokenising strings?
Answer: I don't know?
Question 4) Can string literals be improved on?
I hope so -- In Html I can say (I hope) €
I'd like to say:
X = "10€" in Erlang
People who know far more about this than I do can tell me if this
is OK
----------------------------------------------------------------------
More information about the erlang-questions
mailing list