[erlang-questions] unicode in string literals

Tue Jul 31 11:36:03 CEST 2012

There are many pros and cons for switching from Latin-1 to UTF-8 (or
whatever else which will nullify pretty much the understanding of byte
character). On one hand, lists:reverse/1 really messes up the characters in
the list (to follow the first example, the output of "a∞b" in Latin-1 is
totally different from the output of lists:reverse("b∞a") in Latin-1 - the
default now). On the other hand, having, for example, Polish characters
like "Ą Ę Ć" or French "Ç Î" or German "Ö ß" or Turkish "Ş" and so on
(things become more complicated if we add languages based on different
alphabet/symbols) in the code would require your editor to have support for
those languages or else you will see really strange characters there. I do
not deny some specific projects would benefit from such a character
encoding, but think of maintaining such a code in an international
environment.

"-encoding()" can make quite a mess in a file. Think of an open source
project in which devs from different countries append their own code. You
will see a lot of "-encoding()" directives in a single file.

I might be wrong, but, switching to default UTF-8, wouldn't that force the
compiler to use 2-byte (at least) per character? If so, for example, what
about the databases based on Erlang for projects using strict Latin-1?

My point here is that the string manipulation should be kept apart from the
code itself and to have two modules for manipulating normal lists and
IO-lists (e.g., by extending unicode module). But that would be my own
preference.

CGS

On Tue, Jul 31, 2012 at 10:10 AM, Loïc Hoguin <essen@REDACTED> wrote:

> On 07/31/2012 09:53 AM, Masklinn wrote:
>
>> On 2012-07-31, at 09:39 , Michel Rijnders wrote:
>>
>>  On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <erlang@REDACTED> wrote:
>>>
>>>> Is "encoding(...)"  a good idea?
>>>>
>>>> There are four reasonable alternatives
>>>>
>>>>     a) - all files are Latin1
>>>>     b) - all files are UTF8
>>>>     c) - all files are Latin1 or UTF8 and you guess
>>>>     d) - all files are Latin1 or UTF8 or anything else and you tell
>>>>
>>>
>>> I understand it is quite drastic but I would prefer a separate data
>>> type for (unicode) strings.
>>>
>>
>> For historical reasons? Because on technical grounds, the existing
>> scheme would work nicely by declaring that the integers are code points.
>> And because Unicode is identical to latin-1 in the first 256 codepoints,
>> latin1 strings would be identical.
>>
>> The `string` module would probably need to be fixed to be unicode-aware
>> (or deprecated and removed altogether in favor of the unicode one), but
>> I'm not sure there are good reasons to change the datatype.[-1]
>>
>> On the other hand, a dedicated datatype could allow things like Python's
>> new Flexible String Representation[0] where an explicit "list of code
>> points" would not allow such flexibility.
>>
>> The only thing I'd rather avoid is moving from "list of latin-1 bytes" to
>> "list of utf-8 bytes", that's just crap.
>>
>
> If strings are kept as lists:
>
> - there is no way to identify a variable as being a list or latin1 string
> or utf8 string
> - you would have to keep track of what encoding your list is in
> - you would have to do some type conversion when you use them with
> functions like gen_tcp:send, which don't accept lists of integers > 255
>
> If strings are a new type:
>
> - you don't care about the encoding most of the time, Erlang is the one
> who should; if you want to know the encoding you could use a new BIF
> encoding(String)
> - you don't need to do type conversion when using it, Erlang can use the
> string type directly
> - you can convert encoding without caring about what the previous encoding
> was, for example str:convert(Str, utf8); if it was utf8 it doesn't change a
> thing, if it wasn't it's converted
> - you can export it as a list or binary in the encoding you want, for
> example str:to_binary(Str, utf8)
> - you still need to specify the encoding when converting a list or binary
> to string, but maybe we could have niceties like << Str/string-utf8 >>?
>
> --
> Loïc Hoguin
> Erlang Cowboy
> Nine Nines
> http://ninenines.eu
>
>
>
> ______________________________**_________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/**listinfo/erlang-questions<http://erlang.org/mailman/listinfo/erlang-questions>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120731/d64ff09a/attachment.htm>