[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Fri Oct 21 22:03:37 CEST 2011

On Fri, Oct 21, 2011 at 12:28 PM, Joe Armstrong <erlang@REDACTED> wrote:
> On Fri, Oct 21, 2011 at 2:35 PM, Richard Carlsson
> <carlsson.richard@REDACTED> wrote:
>> On 10/21/2011 10:41 AM, Angel J. Alvarez Miguel wrote:
>>>
>>> (Im using kate on OpenSSUE 11.4 X64 and erlang/OTP  R14B04 (erts-5.8.5)
>>> and my
>>> sources are in utf-8)
>>
>> No, don't make this mistake. To the Erlang compiler, your sources are in
>> Latin-1, plain and simple. As far as the compiler knows, you have actually
>> written "Ã³ Ã± Ã¼" and nothing else. When you print the string with
>> io:format, you are printing the Latin-1 text "Ã³ Ã± Ã¼" (the bytes [195,
>> 179, 32, 195, 177, 32, 195, 188]) to the standard output. That your console
>> re-interprets these bytes as "ó ñ ü" just means that you have managed to
>> fool the system for this particular use case.
>>
>> (By the way, those characters are already in the Latin-1 charset, so you
>> don't *need* UTF-8 at all unless you have some additional characters you
>> want to use that are above 255 in Unicode.)
>>
>> If/when Erlang supports other encodings in source code (this will probably
>> require adding a compiler flag for specifying the input encoding), a string
>> literal such as "ᚱ" should be equivalent to [5809], not [225,154,177], just
>> like your "óñü" should be equivalent to [243,241,252] (which is what you
>> would have got if your editor had been set to Latin-1 to begin with).
>>
>> One can think about it like this: taking an existing, working, Latin-1
>> source file, converting it to UTF-8 (or any other encoding), and compiling
>> it with a flag that informs the compiler what the input encoding is, should
>> not change the semantics of the program in any respect whatsoever compared
>> to compiling the original source file. Thus, a string literal that today
>> contains "ß" ([223]) in a plain old Latin-1 encoded Erlang source file must
>> *always* mean [223] no matter what you change the input encoding to.
>>
>>>> Will "erlc foo.erl" automatically detect that foo.erl is unicode
>>>> encoded and do the right thing when scanning and tokenising strings?
>>
>> No. Erlang source code is (currently) Latin-1 by definition. No matter what
>> your editor thinks it is using, the compiler will interpret the bytes as
>> Latin-1.
>
> I hate to say this - but just about the only thing XML got right was
> the declaration
>
>   <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
>
> Should we have
>
>   -erlang("1.0","UTF=8","no"). :-)
>
> as the first line :-)
>
> (( I have argued in vain for a version for years - to allow for
> incompatible changes to
> the syntax ))

Python does the encoding declaration with a comment near the top of
the file. Changing the default from latin-1 to utf-8 (or ascii!) would
also be less surprising to most. The benefit of this approach is that
some text editors (e.g. Emacs) already know what to do with the
declaration.

%% -*- coding: utf-8 -*-

http://www.python.org/dev/peps/pep-0263/

-bob