[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Fri Oct 21 14:35:48 CEST 2011

On 10/21/2011 10:41 AM, Angel J. Alvarez Miguel wrote:
> (Im using kate on OpenSSUE 11.4 X64 and erlang/OTP  R14B04 (erts-5.8.5) and my
> sources are in utf-8)

No, don't make this mistake. To the Erlang compiler, your sources are in 
Latin-1, plain and simple. As far as the compiler knows, you have 
actually written "Ã³ Ã± Ã¼" and nothing else. When you print the string 
with io:format, you are printing the Latin-1 text "Ã³ Ã± Ã¼" (the bytes 
[195, 179, 32, 195, 177, 32, 195, 188]) to the standard output. That 
your console re-interprets these bytes as "ó ñ ü" just means that you 
have managed to fool the system for this particular use case.

(By the way, those characters are already in the Latin-1 charset, so you 
don't *need* UTF-8 at all unless you have some additional characters you 
want to use that are above 255 in Unicode.)

If/when Erlang supports other encodings in source code (this will 
probably require adding a compiler flag for specifying the input 
encoding), a string literal such as "ᚱ" should be equivalent to [5809], 
not [225,154,177], just like your "óñü" should be equivalent to 
[243,241,252] (which is what you would have got if your editor had been 
set to Latin-1 to begin with).

One can think about it like this: taking an existing, working, Latin-1 
source file, converting it to UTF-8 (or any other encoding), and 
compiling it with a flag that informs the compiler what the input 
encoding is, should not change the semantics of the program in any 
respect whatsoever compared to compiling the original source file. Thus, 
a string literal that today contains "ß" ([223]) in a plain old Latin-1 
encoded Erlang source file must *always* mean [223] no matter what you 
change the input encoding to.

>> Will "erlc foo.erl" automatically detect that foo.erl is unicode
>> encoded and do the right thing when scanning and tokenising strings?

No. Erlang source code is (currently) Latin-1 by definition. No matter 
what your editor thinks it is using, the compiler will interpret the 
bytes as Latin-1.

    /ᚱᛁᚴᚼᛅᚱᛏ