[erlang-questions] correct terminology for referring to strings

Michael Turner michael.eugene.turner@REDACTED
Tue Jul 31 16:19:49 CEST 2012


> At runtime, Erlang's strings are just plain sequences of Unicode code points
> (you can think of it as UTF-32 if you like).

Can you go further and say that it actually *is* UTF-32? A footnote
like "[*] Basically, UTF-32; see ref XYZ for details" might be
helpful.

-michael turner

On Tue, Jul 31, 2012 at 11:04 PM, Richard Carlsson
<carlsson.richard@REDACTED> wrote:
> On 07/31/2012 01:48 PM, Ian wrote:
>>>
>>> << A "string" is a list of integers where the integers
>>>        represent Unicode codepoints. >>
>>
>>
>> I think this is technically correct, but it is very confusing because it
>> implies that the source file may be encoded as unicode.
>>
>> As I understand it, source files are always treated as being in Latin-1.
>> This means that string literals are lists of Latin-1 values, and not
>> lists of unicode codepoints. (The values from 128 to 255 have
>> different/no meanings, and values > 255 will not happen).
>
>
> No, you're confusing Unicode (a sequence of code points) with specific
> encodings such as UTF-8 and UTF-16. The first is downwards compatible with
> Latin-1: the values from 128 to 255 are the same. In UTF-8 they're not. At
> runtime, Erlang's strings are just plain sequences of Unicode code points
> (you can think of it as UTF-32 if you like). Whether the source code is
> encoded in UTF-8 or Latin-1 or any other encoding is irrelevant as long as
> the compiler knows how to transform the input to the single-codepoint
> representation.
>
> For example, reversing a Unicode string is a bad idea anyway because it
> could contain combining characters, and reversing the order of the
> codepoints in that case will create an illegal string. But an expression
> like lists:reverse("a∞b") will be working on the list [97, 8734, 98] (once
> the compiler has been extended to accept other encodings than Latin-1), not
> the list [97,226,136,158,98], so it will produce the intended "b∞a". This
> string might then become encoded as UTF-8 on its way to your terminal, but
> that's another story.
>
>     /Richard
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list