[erlang-questions] unicode in string literals
Masklinn
masklinn@REDACTED
Mon Jul 30 15:13:49 CEST 2012
On 2012-07-30, at 15:02 , Richard Carlsson wrote:
> On 07/30/2012 02:35 PM, Joe Armstrong wrote:
>> What is a literal string in Erlang? Originally it was a list of
>> integers, each integer
>> being a single character code - this made strings very easy to work with
>>
>> The code
>>
>> test() -> "a∞b".
>>
>> Compiles to code which returns the list
>> of integers [97,226,136,158,98].
>>
>> This is very inconvenient. I had expected it to return
>> [97, 8734, 98]. The length of the list should be 3 not 5
>> since it contains three unicode characters not five.
>>
>> Is this a bug or a horrible misfeature?
>
> You saved your source file as UTF-8, so between the two double-quotes, the source file contains exactly those bytes. But the Erlang compiler assumes your source code is Latin-1
I'd expect the string manipulation functions of Erlang assume that as
well (that strings are lists of "bytes"), don't they? E.g. that `words`
splits on 0x20 (and maybe 0xA0), not on the {{Zs}} general category?
More information about the erlang-questions
mailing list