[erlang-questions] unicode in string literals

Tue Jul 31 10:40:59 CEST 2012

On 2012-07-31, at 10:10 , Loïc Hoguin wrote:

> On 07/31/2012 09:53 AM, Masklinn wrote:
>> On 2012-07-31, at 09:39 , Michel Rijnders wrote:
>> 
>>> On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <erlang@REDACTED> wrote:
>>>> Is "encoding(...)"  a good idea?
>>>> 
>>>> There are four reasonable alternatives
>>>> 
>>>>    a) - all files are Latin1
>>>>    b) - all files are UTF8
>>>>    c) - all files are Latin1 or UTF8 and you guess
>>>>    d) - all files are Latin1 or UTF8 or anything else and you tell
>>> 
>>> I understand it is quite drastic but I would prefer a separate data
>>> type for (unicode) strings.
>> 
>> For historical reasons? Because on technical grounds, the existing
>> scheme would work nicely by declaring that the integers are code points.
>> And because Unicode is identical to latin-1 in the first 256 codepoints,
>> latin1 strings would be identical.
>> 
>> The `string` module would probably need to be fixed to be unicode-aware
>> (or deprecated and removed altogether in favor of the unicode one), but
>> I'm not sure there are good reasons to change the datatype.[-1]
>> 
>> On the other hand, a dedicated datatype could allow things like Python's
>> new Flexible String Representation[0] where an explicit "list of code
>> points" would not allow such flexibility.
>> 
>> The only thing I'd rather avoid is moving from "list of latin-1 bytes" to
>> "list of utf-8 bytes", that's just crap.
> 
> If strings are kept as lists:
> 
> - there is no way to identify a variable as being a list or latin1 string or utf8 string
> - you would have to keep track of what encoding your list is in

None applies, strings would be lists of codepoints, the original
encoding has been long forgotten at that point and is utterly
irrelevant.

> - you would have to do some type conversion when you use them with functions like gen_tcp:send, which don't accept lists of integers > 255

You would have to encode the list to whatever is expected by the other
side, on input strings.

> If strings are a new type:
> 
> - you don't care about the encoding most of the time, Erlang is the one who should; if you want to know the encoding you could use a new BIF encoding(String)
> - you can convert encoding without caring about what the previous encoding was, for example str:convert(Str, utf8); if it was utf8 it doesn't change a thing, if it wasn't it's converted
> - you can export it as a list or binary in the encoding you want, for example str:to_binary(Str, utf8)

See above, none of these makes sense unless you assume that the
string-list is a list of bytes in a specific encoding which does not
make sense either, in the first place.

> - you don't need to do type conversion when using it, Erlang can use the string type directly

How can't Erlang use the string-list type directly? That's what it
currently does. There's no conversion.