[erlang-questions] unicode in string literals

Loïc Hoguin essen@REDACTED
Tue Jul 31 10:10:14 CEST 2012


On 07/31/2012 09:53 AM, Masklinn wrote:
> On 2012-07-31, at 09:39 , Michel Rijnders wrote:
>
>> On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <erlang@REDACTED> wrote:
>>> Is "encoding(...)"  a good idea?
>>>
>>> There are four reasonable alternatives
>>>
>>>     a) - all files are Latin1
>>>     b) - all files are UTF8
>>>     c) - all files are Latin1 or UTF8 and you guess
>>>     d) - all files are Latin1 or UTF8 or anything else and you tell
>>
>> I understand it is quite drastic but I would prefer a separate data
>> type for (unicode) strings.
>
> For historical reasons? Because on technical grounds, the existing
> scheme would work nicely by declaring that the integers are code points.
> And because Unicode is identical to latin-1 in the first 256 codepoints,
> latin1 strings would be identical.
>
> The `string` module would probably need to be fixed to be unicode-aware
> (or deprecated and removed altogether in favor of the unicode one), but
> I'm not sure there are good reasons to change the datatype.[-1]
>
> On the other hand, a dedicated datatype could allow things like Python's
> new Flexible String Representation[0] where an explicit "list of code
> points" would not allow such flexibility.
>
> The only thing I'd rather avoid is moving from "list of latin-1 bytes" to
> "list of utf-8 bytes", that's just crap.

If strings are kept as lists:

- there is no way to identify a variable as being a list or latin1 
string or utf8 string
- you would have to keep track of what encoding your list is in
- you would have to do some type conversion when you use them with 
functions like gen_tcp:send, which don't accept lists of integers > 255

If strings are a new type:

- you don't care about the encoding most of the time, Erlang is the one 
who should; if you want to know the encoding you could use a new BIF 
encoding(String)
- you don't need to do type conversion when using it, Erlang can use the 
string type directly
- you can convert encoding without caring about what the previous 
encoding was, for example str:convert(Str, utf8); if it was utf8 it 
doesn't change a thing, if it wasn't it's converted
- you can export it as a list or binary in the encoding you want, for 
example str:to_binary(Str, utf8)
- you still need to specify the encoding when converting a list or 
binary to string, but maybe we could have niceties like << 
Str/string-utf8 >>?

-- 
Loïc Hoguin
Erlang Cowboy
Nine Nines
http://ninenines.eu





More information about the erlang-questions mailing list