[erlang-questions] unicode in string literals
Richard Carlsson
carlsson.richard@REDACTED
Mon Jul 30 15:02:13 CEST 2012
On 07/30/2012 02:35 PM, Joe Armstrong wrote:
> What is a literal string in Erlang? Originally it was a list of
> integers, each integer
> being a single character code - this made strings very easy to work with
>
> The code
>
> test() -> "a∞b".
>
> Compiles to code which returns the list
> of integers [97,226,136,158,98].
>
> This is very inconvenient. I had expected it to return
> [97, 8734, 98]. The length of the list should be 3 not 5
> since it contains three unicode characters not five.
>
> Is this a bug or a horrible misfeature?
You saved your source file as UTF-8, so between the two double-quotes,
the source file contains exactly those bytes. But the Erlang compiler
assumes your source code is Latin-1, so it thinks that you wrote a
Latin-1 string of 5 characters (some of which are non-printing). There's
as yet no support for telling the compiler that the input is anything
else than Latin-1, so you can't save your source files as UTF-8. (One
thing you can do is put the UTF-8 strings in another file and read them
at runtime.)
> test() -> <<"a∞b"/utf8>> seems to be a bug
Try <<"åäö"/utf8>>. It works, but like your first example, the source
string is limited to Latin-1. Strings entered in the shell may be
interpreted differently though, depending on your locale settings.
/Richard
More information about the erlang-questions
mailing list