[erlang-questions] correct terminology for referring to strings

CGS cgsmcmlxxv@REDACTED
Wed Aug 1 00:52:15 CEST 2012


On Tue, Jul 31, 2012 at 4:04 PM, Richard Carlsson <
carlsson.richard@REDACTED> wrote:

> On 07/31/2012 01:48 PM, Ian wrote:
>
>> << A "string" is a list of integers where the integers
>>>        represent Unicode codepoints. >>
>>>
>>
>> I think this is technically correct, but it is very confusing because it
>> implies that the source file may be encoded as unicode.
>>
>> As I understand it, source files are always treated as being in Latin-1.
>> This means that string literals are lists of Latin-1 values, and not
>> lists of unicode codepoints. (The values from 128 to 255 have
>> different/no meanings, and values > 255 will not happen).
>>
>
> No, you're confusing Unicode (a sequence of code points) with specific
> encodings such as UTF-8 and UTF-16. The first is downwards compatible with
> Latin-1: the values from 128 to 255 are the same. In UTF-8 they're not. At
> runtime, Erlang's strings are just plain sequences of Unicode code points
> (you can think of it as UTF-32 if you like). Whether the source code is
> encoded in UTF-8 or Latin-1 or any other encoding is irrelevant as long as
> the compiler knows how to transform the input to the single-codepoint
> representation.
>
> For example, reversing a Unicode string is a bad idea anyway because it
> could contain combining characters, and reversing the order of the
> codepoints in that case will create an illegal string. But an expression
> like lists:reverse("a∞b") will be working on the list [97, 8734, 98] (once
> the compiler has been extended to accept other encodings than Latin-1...


Actually, try this:

1. set your environment to UTF-8 (in my case, whatever Linux terminal with
BASH environment, export LANG="en_US.utf8", use locale to find your
environment language definition - "en_US.latin1" for LATIN-1)
2. in a module:

test_reverse(String) -> lists:reverse(String).

3. Give as parameter the example given by yourself.
4. Check the output.

Pretty interesting to see how Erlang "knows" about UTF-8 encoding, isn't
it? (You can try directly in the shell lists:reverse("a∞b") and it will
transform as expected (using 3-elements list).) Actually, it knows nothing
about, but relying on the environment to extract the integers for the list
(which it mimics here the knowledge about UTF-8).

...), not the list [97,226,136,158,98], so it will produce the intended
> "b∞a". This string might then become encoded as UTF-8 on its way to your
> terminal, but that's another story.


I would add to the last part ("on its way to your terminal") also "from"
and not leaving only "on" (it seems that the both ways are valid even if
that can break the code).

I agree that for string literals, what you said is always true.

CGS
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120801/8724cb5f/attachment.htm>


More information about the erlang-questions mailing list