[erlang-questions] unicode in string literals

Mon Jul 30 15:02:13 CEST 2012

On 07/30/2012 02:35 PM, Joe Armstrong wrote:
> What is a literal string in Erlang? Originally it was a list of
> integers, each integer
> being a single character code - this made strings very easy to work with
>
> The code
>
>      test() -> "a∞b".
>
> Compiles to code which returns the list
> of integers [97,226,136,158,98].
>
> This is very inconvenient. I had expected it to return
> [97, 8734, 98]. The length of the list should be 3 not 5
> since it contains three unicode characters not five.
>
> Is this a bug or a horrible misfeature?

You saved your source file as UTF-8, so between the two double-quotes, 
the source file contains exactly those bytes. But the Erlang compiler 
assumes your source code is Latin-1, so it thinks that you wrote a 
Latin-1 string of 5 characters (some of which are non-printing). There's 
as yet no support for telling the compiler that the input is anything 
else than Latin-1, so you can't save your source files as UTF-8. (One 
thing you can do is put the UTF-8 strings in another file and read them 
at runtime.)

> test() -> <<"a∞b"/utf8>>   seems to be a bug

Try <<"åäö"/utf8>>. It works, but like your first example, the source 
string is limited to Latin-1. Strings entered in the shell may be 
interpreted differently though, depending on your locale settings.

    /Richard