[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings
Richard Carlsson
carlsson.richard@REDACTED
Fri Oct 21 14:35:48 CEST 2011
On 10/21/2011 10:41 AM, Angel J. Alvarez Miguel wrote:
> (Im using kate on OpenSSUE 11.4 X64 and erlang/OTP R14B04 (erts-5.8.5) and my
> sources are in utf-8)
No, don't make this mistake. To the Erlang compiler, your sources are in
Latin-1, plain and simple. As far as the compiler knows, you have
actually written "ó ñ ü" and nothing else. When you print the string
with io:format, you are printing the Latin-1 text "ó ñ ü" (the bytes
[195, 179, 32, 195, 177, 32, 195, 188]) to the standard output. That
your console re-interprets these bytes as "ó ñ ü" just means that you
have managed to fool the system for this particular use case.
(By the way, those characters are already in the Latin-1 charset, so you
don't *need* UTF-8 at all unless you have some additional characters you
want to use that are above 255 in Unicode.)
If/when Erlang supports other encodings in source code (this will
probably require adding a compiler flag for specifying the input
encoding), a string literal such as "ᚱ" should be equivalent to [5809],
not [225,154,177], just like your "óñü" should be equivalent to
[243,241,252] (which is what you would have got if your editor had been
set to Latin-1 to begin with).
One can think about it like this: taking an existing, working, Latin-1
source file, converting it to UTF-8 (or any other encoding), and
compiling it with a flag that informs the compiler what the input
encoding is, should not change the semantics of the program in any
respect whatsoever compared to compiling the original source file. Thus,
a string literal that today contains "ß" ([223]) in a plain old Latin-1
encoded Erlang source file must *always* mean [223] no matter what you
change the input encoding to.
>> Will "erlc foo.erl" automatically detect that foo.erl is unicode
>> encoded and do the right thing when scanning and tokenising strings?
No. Erlang source code is (currently) Latin-1 by definition. No matter
what your editor thinks it is using, the compiler will interpret the
bytes as Latin-1.
/ᚱᛁᚴᚼᛅᚱᛏ
More information about the erlang-questions
mailing list