[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings
Angel J. Alvarez Miguel
clist@REDACTED
Fri Oct 21 10:41:39 CEST 2011
questions below...
On Miércoles, 19 de Octubre de 2011 12:14:30 Joe Armstrong escribió:
> cookbook # 1 - draft 1
>
> <aside>
> We're going to write a cookbook.
>
> This will be free (in an electronic version, PDF, epub)
> And you will be able to buy a paper version (POD)
>
> The development model is
>
> - a few authors
> - many reviewers (you are the reviewers)
> the reviewers report errors/suggest changes
> the authors make the changes
>
> The POD version we hope will generate some income
> this will be split according to the contributions. Authors
> will be paid as will reviewers whose suggestions are incorporated.
>
> Payment (if we make a profit) will be in direct relation to the size
> of the contribution
>
> Expensive things like professional proof reading, will be
> sponsorship, or crowd sourced, or otherwise financed.
>
> To start the ball rolling I have some text below.
>
> Please comment on this text. If your comments are accepted one day you
> might get paid :-)
>
> Note: 1) By commenting you are implicitly agreeing that if your comments
> are accepted into the final text then you will be subject to the
> licensing conditions of that text. The text will always be free and
> open source.
>
> </aside>
>
> Cookbook Question:
>
> I have often seen the words "UTF-8 string" used in sentences like
> "Java has UTF-8 strings". What does this mean when applied to Erlang?
>
> ----------------------------------------------------------------------
>
> Answer:
>
> In Erlang strings are syntactic sugar for "lists of integers"
>
> Imagine the string "10(Euro)" - (Euro) is the glyph representing the
> Euro currency symbol.
>
> The term "UF8-string" representing "10(euro)" in Erlang could
> mean one of two things:
>
> Either a) [49,48,8364] (ie its a list of three unicode
> integers) Or b) [49,48,226,130,172] (ie its the UTF-8 encoding of
> the unicode characters)
>
> The so words "UTF-8" string might mean a) or might mean b)
>
> Erlang folks have always said "unicode/UTF-8 is easy in Erlang
> since strings are just lists of integers" - by this we mean that
> Erlang programs should always manipulate strings given the type a)
> interpretation. *all* library functions assume type a) encoding.
>
> The type b) interpretation only has meaning when you write data to a
> file etc. and should be as invisible to the user as possible (but when
> things go wrong and you get the wrong character printed you need to
> understand the difference)
>
> Question 1) How can we get a unicode characters into a list item?
> or what does a string literal look like?
>
> > X = "10\x{20ac}"
>
> [49,48,8364]
>
> This is not described in my book since the change came after the
> book was published (is it in the other Erlang books yet?)
>
> Question 2) How can we convert between representations a) and b) above?
>
> Easy - though one has to dig in the documentation a bit.
>
> > B = unicode:characters_to_binary(X, unicode, utf8).
>
> <<49,48,226,130,172>>
>
> > unicode:characters_to_list(B).
>
> [49,48,8364]
>
> Question 3) Can I write "10(Euro)" in an editor which supports
> unicode/UTF-8 and does the erlang tool chain support this?
I would say No! (last year i posted a mail complaining about my spanish
messages getting garbled when i used ñ, ó, á etc...
But right now it works!!
Ive just added some national caracters into one of my strings and the seems
survive the compilation step..
...io:format("Procesando ó ñ ü fichero ~s / ~s ~.16b ~n",
[filename:dirname(Path),filename:basename(Path),Digest]),....
thist outputs:
Procesando ó ñ ü fichero /home/sinosuke / .bash_history
84a45c9c62121aec0d1860534377577a
Procesando ó ñ ü fichero /home/sinosuke / .xim.template
93d3a1252fe3069130b1cece05fd6d44
Procesando ó ñ ü fichero /home/sinosuke / .xinitrc.template
d3f5ce074afc0ef61c89d1d08c582457
(Im using kate on OpenSSUE 11.4 X64 and erlang/OTP R14B04 (erts-5.8.5) and my
sources are in utf-8)
the same sources opened in ISO-8859-1 mode show:
io:format("Procesando ó ñ ü fichero ~s / ~s ~.16b ~n",
[filename:dirname(Path),filename:basename(Path),Digest]),
those ó ñ ü are the infamous Unicode codepoints, are they?
I hope this helps about this question.
regards, Angel
>
> Will "erlc foo.erl" automatically detect that foo.erl is unicode
> encoded and do the right thing when scanning and tokenising strings?
>
> Answer: I don't know?
>
> Question 4) Can string literals be improved on?
>
> I hope so -- In Html I can say (I hope) €
>
> I'd like to say:
>
> X = "10€" in Erlang
>
> People who know far more about this than I do can tell me if this
> is OK
>
>
> ----------------------------------------------------------------------
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
More information about the erlang-questions
mailing list