[erlang-questions] unicode in string literals

Tue Jul 31 09:53:52 CEST 2012

On 2012-07-31, at 09:39 , Michel Rijnders wrote:

> On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <erlang@REDACTED> wrote:
>> Is "encoding(...)"  a good idea?
>> 
>> There are four reasonable alternatives
>> 
>>    a) - all files are Latin1
>>    b) - all files are UTF8
>>    c) - all files are Latin1 or UTF8 and you guess
>>    d) - all files are Latin1 or UTF8 or anything else and you tell
> 
> I understand it is quite drastic but I would prefer a separate data
> type for (unicode) strings.

For historical reasons? Because on technical grounds, the existing
scheme would work nicely by declaring that the integers are code points.
And because Unicode is identical to latin-1 in the first 256 codepoints,
latin1 strings would be identical.

The `string` module would probably need to be fixed to be unicode-aware
(or deprecated and removed altogether in favor of the unicode one), but
I'm not sure there are good reasons to change the datatype.[-1]

On the other hand, a dedicated datatype could allow things like Python's
new Flexible String Representation[0] where an explicit "list of code
points" would not allow such flexibility.

The only thing I'd rather avoid is moving from "list of latin-1 bytes" to
"list of utf-8 bytes", that's just crap.

[-1] Actually there's one now that I re-think about it thanks to your
     previous mail about lists:reverse: naive list methods will completely
     break combining characters or decomposed (NFD and NFKD) strings, even
     if strings are encoded as lists of codepoints.

[0] http://www.python.org/dev/peps/pep-0393/ where strings are opaque and
    can dynamically changed their internal representation between latin-1,
    UCS2 and UCS4 to best fit their content, one could even add rope-like
    structures so that strings are internally mixed between the
    representations if there is cause to)