[erlang-questions] unicode in string literals

Tue Jul 31 09:36:40 CEST 2012

Hi,

On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <erlang@REDACTED> wrote:
> Is "encoding(...)"  a good idea?
>
> There are four reasonable alternatives
>     a) - all files are Latin1
>     b) - all files are UTF8
>     c) - all files are Latin1 or UTF8 and you guess
>     d) - all files are Latin1 or UTF8 or anything else and you tell

By the question above, do you mean to imply that '-encoding(...)' will
allow mixed encodings in a project, which is not a reasonable
alternative?

> Today we do a).
> What would be the consequences of changing to b) in (say) the next
> major release?
>
> This would break some code - but how much? - how much code is there
> with non Latin1 printable characters
> in string literals?

I don't think that would be the single problem, but also all the code
that assumes that source code is latin-1. Also, tools that handle
source code will need to be able to recognize both the old and new
encodings, as they might need to have to work with an older version of
a file, before the conversion.

Another question that needs to be answered is also what encoding will
the source code use outside strings and quoted atoms and comments: do
we want atoms and variable names to be utf8 too? Because  I've seen at
least an example of code that uses extended latin-1 characters in
those places.

Also, what should string manipulation functions do by default, should
they assume an encoding? I think the only way to remain sane would be
to have a special string type, tagged with the encoding -- as it is
now, one can use string manipulation functions on lists of arbitrary
integers and list manipulation functions on strings.

Would a syntactic construct like u"some string" that returns a tagged
utf8 string help?

best regards,
Vlad