[erlang-questions] unicode in string literals
Tue Jul 31 11:36:03 CEST 2012
There are many pros and cons for switching from Latin-1 to UTF-8 (or
whatever else which will nullify pretty much the understanding of byte
character). On one hand, lists:reverse/1 really messes up the characters in
the list (to follow the first example, the output of "a∞b" in Latin-1 is
totally different from the output of lists:reverse("b∞a") in Latin-1 - the
default now). On the other hand, having, for example, Polish characters
like "Ą Ę Ć" or French "Ç Î" or German "Ö ß" or Turkish "Ş" and so on
(things become more complicated if we add languages based on different
alphabet/symbols) in the code would require your editor to have support for
those languages or else you will see really strange characters there. I do
not deny some specific projects would benefit from such a character
encoding, but think of maintaining such a code in an international
"-encoding()" can make quite a mess in a file. Think of an open source
project in which devs from different countries append their own code. You
will see a lot of "-encoding()" directives in a single file.
I might be wrong, but, switching to default UTF-8, wouldn't that force the
compiler to use 2-byte (at least) per character? If so, for example, what
about the databases based on Erlang for projects using strict Latin-1?
My point here is that the string manipulation should be kept apart from the
code itself and to have two modules for manipulating normal lists and
IO-lists (e.g., by extending unicode module). But that would be my own
On Tue, Jul 31, 2012 at 10:10 AM, Loïc Hoguin <> wrote:
> On 07/31/2012 09:53 AM, Masklinn wrote:
>> On 2012-07-31, at 09:39 , Michel Rijnders wrote:
>> On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <> wrote:
>>>> Is "encoding(...)" a good idea?
>>>> There are four reasonable alternatives
>>>> a) - all files are Latin1
>>>> b) - all files are UTF8
>>>> c) - all files are Latin1 or UTF8 and you guess
>>>> d) - all files are Latin1 or UTF8 or anything else and you tell
>>> I understand it is quite drastic but I would prefer a separate data
>>> type for (unicode) strings.
>> For historical reasons? Because on technical grounds, the existing
>> scheme would work nicely by declaring that the integers are code points.
>> And because Unicode is identical to latin-1 in the first 256 codepoints,
>> latin1 strings would be identical.
>> The `string` module would probably need to be fixed to be unicode-aware
>> (or deprecated and removed altogether in favor of the unicode one), but
>> I'm not sure there are good reasons to change the datatype.[-1]
>> On the other hand, a dedicated datatype could allow things like Python's
>> new Flexible String Representation where an explicit "list of code
>> points" would not allow such flexibility.
>> The only thing I'd rather avoid is moving from "list of latin-1 bytes" to
>> "list of utf-8 bytes", that's just crap.
> If strings are kept as lists:
> - there is no way to identify a variable as being a list or latin1 string
> or utf8 string
> - you would have to keep track of what encoding your list is in
> - you would have to do some type conversion when you use them with
> functions like gen_tcp:send, which don't accept lists of integers > 255
> If strings are a new type:
> - you don't care about the encoding most of the time, Erlang is the one
> who should; if you want to know the encoding you could use a new BIF
> - you don't need to do type conversion when using it, Erlang can use the
> string type directly
> - you can convert encoding without caring about what the previous encoding
> was, for example str:convert(Str, utf8); if it was utf8 it doesn't change a
> thing, if it wasn't it's converted
> - you can export it as a list or binary in the encoding you want, for
> example str:to_binary(Str, utf8)
> - you still need to specify the encoding when converting a list or binary
> to string, but maybe we could have niceties like << Str/string-utf8 >>?
> Loïc Hoguin
> Erlang Cowboy
> Nine Nines
> erlang-questions mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions