[erlang-questions] Strings as Lists

Fri Feb 15 12:06:12 CET 2008

On 15 Feb 2008, at 11:27 , Richard Carlsson wrote:

> Dmitrii 'Mamut' Dimandt wrote:
>> Richard Carlsson wrote:
>>> Strings as lists is simple and flexible (i.e., if you already have  
>>> lists,
>>> you don't need to add another data type). Functions that work on  
>>> lists,
>>> such as append, reverse, etc., can be used directly on strings; you
>>> don't need to program in different styles if you're traversing a  
>>> list
>>> or a string; etc.
>> This is only true for ASCII text ;) Non-ASCII gets screwed up badly:
>>
>> lists:reverse("text") %% gives you "txet"
>> lists:reverse("текст") %% Russian for text becomes
>> [130,209,129,209,186,208,181,208,130,209] which is clearly not what I
>> wanted :)
>
> That's because the second line is currently not a legal Erlang  
> program.
> The tokenizer will assume that your source code is encoded using  
> Latin-1,
> and since you are giving the compiler garbage input, it gives you  
> garbage
> output. Basically, the compiler thinks that you wrote "Ñ 
> \202ÐµÐºÑ\201Ñ\202",
> not "текст", and the reverse of that is indeed "\202Ñ 
> \201ÑºÐµÐ\202Ñ",
> which is what you got (regardless of what you _wanted_).
>
> What Erlang needs to support non Latin-1 languages, is filters for  
> decoding
> input and encoding output.

Yep. How extensive would be the changes to perform to have a  
configurable tokenizer? Something like Python where you can specify  
the encoding of your source code if you want something other than the  
default (which, in python, is ASCII)?