[erlang-questions] correct terminology for referring to strings

Tue Jul 31 16:04:05 CEST 2012

On 07/31/2012 01:48 PM, Ian wrote:
>> << A "string" is a list of integers where the integers
>>        represent Unicode codepoints. >>
>
> I think this is technically correct, but it is very confusing because it
> implies that the source file may be encoded as unicode.
>
> As I understand it, source files are always treated as being in Latin-1.
> This means that string literals are lists of Latin-1 values, and not
> lists of unicode codepoints. (The values from 128 to 255 have
> different/no meanings, and values > 255 will not happen).

No, you're confusing Unicode (a sequence of code points) with specific 
encodings such as UTF-8 and UTF-16. The first is downwards compatible 
with Latin-1: the values from 128 to 255 are the same. In UTF-8 they're 
not. At runtime, Erlang's strings are just plain sequences of Unicode 
code points (you can think of it as UTF-32 if you like). Whether the 
source code is encoded in UTF-8 or Latin-1 or any other encoding is 
irrelevant as long as the compiler knows how to transform the input to 
the single-codepoint representation.

For example, reversing a Unicode string is a bad idea anyway because it 
could contain combining characters, and reversing the order of the 
codepoints in that case will create an illegal string. But an expression 
like lists:reverse("a∞b") will be working on the list [97, 8734, 98] 
(once the compiler has been extended to accept other encodings than 
Latin-1), not the list [97,226,136,158,98], so it will produce the 
intended "b∞a". This string might then become encoded as UTF-8 on its 
way to your terminal, but that's another story.

     /Richard