[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Fri Oct 21 22:37:21 CEST 2011

On 10/21/2011 01:03 PM, Bob Ippolito wrote:
> On Fri, Oct 21, 2011 at 12:28 PM, Joe Armstrong <erlang@REDACTED> wrote:
>> On Fri, Oct 21, 2011 at 2:35 PM, Richard Carlsson
>> <carlsson.richard@REDACTED> wrote:
>>> On 10/21/2011 10:41 AM, Angel J. Alvarez Miguel wrote:
>>>> (Im using kate on OpenSSUE 11.4 X64 and erlang/OTP  R14B04 (erts-5.8.5)
>>>> and my
>>>> sources are in utf-8)
>>> No, don't make this mistake. To the Erlang compiler, your sources are in
>>> Latin-1, plain and simple. As far as the compiler knows, you have actually
>>> written "Ã³ Ã± Ã¼" and nothing else. When you print the string with
>>> io:format, you are printing the Latin-1 text "Ã³ Ã± Ã¼" (the bytes [195,
>>> 179, 32, 195, 177, 32, 195, 188]) to the standard output. That your console
>>> re-interprets these bytes as "ó ñ ü" just means that you have managed to
>>> fool the system for this particular use case.
>>>
>>> (By the way, those characters are already in the Latin-1 charset, so you
>>> don't *need* UTF-8 at all unless you have some additional characters you
>>> want to use that are above 255 in Unicode.)
>>>
>>> If/when Erlang supports other encodings in source code (this will probably
>>> require adding a compiler flag for specifying the input encoding), a string
>>> literal such as "ᚱ" should be equivalent to [5809], not [225,154,177], just
>>> like your "óñü" should be equivalent to [243,241,252] (which is what you
>>> would have got if your editor had been set to Latin-1 to begin with).
>>>
>>> One can think about it like this: taking an existing, working, Latin-1
>>> source file, converting it to UTF-8 (or any other encoding), and compiling
>>> it with a flag that informs the compiler what the input encoding is, should
>>> not change the semantics of the program in any respect whatsoever compared
>>> to compiling the original source file. Thus, a string literal that today
>>> contains "ß" ([223]) in a plain old Latin-1 encoded Erlang source file must
>>> *always* mean [223] no matter what you change the input encoding to.
>>>
>>>>> Will "erlc foo.erl" automatically detect that foo.erl is unicode
>>>>> encoded and do the right thing when scanning and tokenising strings?
>>> No. Erlang source code is (currently) Latin-1 by definition. No matter what
>>> your editor thinks it is using, the compiler will interpret the bytes as
>>> Latin-1.
>> I hate to say this - but just about the only thing XML got right was
>> the declaration
>>
>>   <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
>>
>> Should we have
>>
>>   -erlang("1.0","UTF=8","no"). :-)
>>
>> as the first line :-)
>>
>> (( I have argued in vain for a version for years - to allow for
>> incompatible changes to
>> the syntax ))
> Python does the encoding declaration with a comment near the top of
> the file. Changing the default from latin-1 to utf-8 (or ascii!) would
> also be less surprising to most. The benefit of this approach is that
> some text editors (e.g. Emacs) already know what to do with the
> declaration.
>
> %% -*- coding: utf-8 -*-
>
> http://www.python.org/dev/peps/pep-0263/
>
> -bob

Please keep in mind there are two modeline syntaxes, one for vi and one for emacs (this is mentioned within the python PEP link).  So it would be nice to support both formats if they occur within the first top 3 lines or so of the source file (editors usually allow this as a configuration item):

%%% -*- coding: utf-8; Mode: erlang; tab-width: 4; c-basic-offset: 4; indent-tabs-mode: nil -*-
%%% ex: set softtabstop=4 tabstop=4 shiftwidth=4 expandtab fileencoding=utf-8:

The rest within the modeline is just to make sure the code is indented with 4 spaces, no tabs.

- Michael