[erlang-questions] unicode in string literals
Tue Jul 31 10:10:14 CEST 2012
On 07/31/2012 09:53 AM, Masklinn wrote:
> On 2012-07-31, at 09:39 , Michel Rijnders wrote:
>> On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <> wrote:
>>> Is "encoding(...)" a good idea?
>>> There are four reasonable alternatives
>>> a) - all files are Latin1
>>> b) - all files are UTF8
>>> c) - all files are Latin1 or UTF8 and you guess
>>> d) - all files are Latin1 or UTF8 or anything else and you tell
>> I understand it is quite drastic but I would prefer a separate data
>> type for (unicode) strings.
> For historical reasons? Because on technical grounds, the existing
> scheme would work nicely by declaring that the integers are code points.
> And because Unicode is identical to latin-1 in the first 256 codepoints,
> latin1 strings would be identical.
> The `string` module would probably need to be fixed to be unicode-aware
> (or deprecated and removed altogether in favor of the unicode one), but
> I'm not sure there are good reasons to change the datatype.[-1]
> On the other hand, a dedicated datatype could allow things like Python's
> new Flexible String Representation where an explicit "list of code
> points" would not allow such flexibility.
> The only thing I'd rather avoid is moving from "list of latin-1 bytes" to
> "list of utf-8 bytes", that's just crap.
If strings are kept as lists:
- there is no way to identify a variable as being a list or latin1
string or utf8 string
- you would have to keep track of what encoding your list is in
- you would have to do some type conversion when you use them with
functions like gen_tcp:send, which don't accept lists of integers > 255
If strings are a new type:
- you don't care about the encoding most of the time, Erlang is the one
who should; if you want to know the encoding you could use a new BIF
- you don't need to do type conversion when using it, Erlang can use the
string type directly
- you can convert encoding without caring about what the previous
encoding was, for example str:convert(Str, utf8); if it was utf8 it
doesn't change a thing, if it wasn't it's converted
- you can export it as a list or binary in the encoding you want, for
example str:to_binary(Str, utf8)
- you still need to specify the encoding when converting a list or
binary to string, but maybe we could have niceties like <<
More information about the erlang-questions