[erlang-questions] correct terminology for referring to strings
Richard Carlsson
carlsson.richard@REDACTED
Tue Jul 31 16:04:05 CEST 2012
On 07/31/2012 01:48 PM, Ian wrote:
>> << A "string" is a list of integers where the integers
>> represent Unicode codepoints. >>
>
> I think this is technically correct, but it is very confusing because it
> implies that the source file may be encoded as unicode.
>
> As I understand it, source files are always treated as being in Latin-1.
> This means that string literals are lists of Latin-1 values, and not
> lists of unicode codepoints. (The values from 128 to 255 have
> different/no meanings, and values > 255 will not happen).
No, you're confusing Unicode (a sequence of code points) with specific
encodings such as UTF-8 and UTF-16. The first is downwards compatible
with Latin-1: the values from 128 to 255 are the same. In UTF-8 they're
not. At runtime, Erlang's strings are just plain sequences of Unicode
code points (you can think of it as UTF-32 if you like). Whether the
source code is encoded in UTF-8 or Latin-1 or any other encoding is
irrelevant as long as the compiler knows how to transform the input to
the single-codepoint representation.
For example, reversing a Unicode string is a bad idea anyway because it
could contain combining characters, and reversing the order of the
codepoints in that case will create an illegal string. But an expression
like lists:reverse("a∞b") will be working on the list [97, 8734, 98]
(once the compiler has been extended to accept other encodings than
Latin-1), not the list [97,226,136,158,98], so it will produce the
intended "b∞a". This string might then become encoded as UTF-8 on its
way to your terminal, but that's another story.
/Richard
More information about the erlang-questions
mailing list