[erlang-questions] Strings as Lists
Richard Carlsson
richardc@REDACTED
Fri Feb 15 11:27:47 CET 2008
Dmitrii 'Mamut' Dimandt wrote:
> Richard Carlsson wrote:
>> Strings as lists is simple and flexible (i.e., if you already have lists,
>> you don't need to add another data type). Functions that work on lists,
>> such as append, reverse, etc., can be used directly on strings; you
>> don't need to program in different styles if you're traversing a list
>> or a string; etc.
> This is only true for ASCII text ;) Non-ASCII gets screwed up badly:
>
> lists:reverse("text") %% gives you "txet"
> lists:reverse("текст") %% Russian for text becomes
> [130,209,129,209,186,208,181,208,130,209] which is clearly not what I
> wanted :)
That's because the second line is currently not a legal Erlang program.
The tokenizer will assume that your source code is encoded using Latin-1,
and since you are giving the compiler garbage input, it gives you garbage
output. Basically, the compiler thinks that you wrote "Ñ\202екÑ\201Ñ\202",
not "текст", and the reverse of that is indeed "\202Ñ\201ѺеÐ\202Ñ",
which is what you got (regardless of what you _wanted_).
What Erlang needs to support non Latin-1 languages, is filters for decoding
input and encoding output. (Right now, you have to write the conversion
functions yourself if you want to work with Russian text.) The internal
string representation - lists of integers using one integer per code
point - needs no modification, whether it's ASCII, Latin-1, or Unicode;
what I said before applies equally well to all of them. Multibyte encodings
are not practical for general string manipulations regardless of how they
are stored in memory.
/Richard
More information about the erlang-questions
mailing list