[erlang-questions] unicode in string literals

Mon Jul 30 15:13:49 CEST 2012

On 2012-07-30, at 15:02 , Richard Carlsson wrote:

> On 07/30/2012 02:35 PM, Joe Armstrong wrote:
>> What is a literal string in Erlang? Originally it was a list of
>> integers, each integer
>> being a single character code - this made strings very easy to work with
>> 
>> The code
>> 
>>     test() -> "a∞b".
>> 
>> Compiles to code which returns the list
>> of integers [97,226,136,158,98].
>> 
>> This is very inconvenient. I had expected it to return
>> [97, 8734, 98]. The length of the list should be 3 not 5
>> since it contains three unicode characters not five.
>> 
>> Is this a bug or a horrible misfeature?
> 
> You saved your source file as UTF-8, so between the two double-quotes, the source file contains exactly those bytes. But the Erlang compiler assumes your source code is Latin-1

I'd expect the string manipulation functions of Erlang assume that as
well (that strings are lists of "bytes"), don't they? E.g. that `words`
splits on 0x20 (and maybe 0xA0), not on the {{Zs}} general category?