[erlang-questions] unicode in string literals

Wed Aug 1 03:56:49 CEST 2012

On 31/07/2012, at 7:36 PM, Vlad Dumitrescu wrote:
> By the question above, do you mean to imply that '-encoding(...)' will
> allow mixed encodings in a project, which is not a reasonable
> alternative?

It's not clear to me what you mean by a 'project',
but why should a module written by someone who wants
comments in Māori (note the macron? Latin-4 or Unicode needed)
use a module written by someone who wants comments in Swedish?

It's no worse (and no better!) than having a 'project' where
some of the files assume tabs are set every 8 characters and
some of them assume tabs are set every 4 characters.  It's a
thing you need written down; it's a thing your tools need to
understand; and it's a situation that doesn't need to persist
with sources that are under your control.

> I don't think that would be the single problem, but also all the code
> that assumes that source code is latin-1. Also, tools that handle
> source code will need to be able to recognize both the old and new
> encodings, as they might need to have to work with an older version of
> a file, before the conversion.

The whole point of an -encoding directive is that it is something
that syntaxtools should handle; by the time your code gets an AST
or a token list, encodings are entirely a thing of the past.

Gambit Scheme allows different files in a program to use different
encodings.  It's no big deal:  _only_ the code that converts between
a stream of bytes and a stream of characters knows anything about
encodings; internally it's all Unicode.

I haven't done this yet for my Smalltalk compiler because there
are other more urgent issues (like working around C compilers that
are trying to be helpful but fail), but the design work is done and
it should leave the tokeniser running at about the same speed as
the old Latin-1-only tokeniser.

There *will* be a period when I want to keep my old Latin-1 files
(don't fix what isn't broken) but want to start using Unicode in
new work.

SWI Prolog actually lets you change the encoding within a file,
which sounds crazy but maybe Jan wanted the machinery to be there
in case someone wanted ISO 2022 support.  (Because that's basically
what 2022 *is*: switching encoding aspects on the fly.)
Why should a Japanese programmer be forbidden to write in her own
script just because some of the source files that get loaded at
run time are encoded in Latin 1?

> 
> Another question that needs to be answered is also what encoding will
> the source code use outside strings and quoted atoms and comments

"Encoding" is a whole-file property.  If the comments are encoded in
ISO 8859-5 (ISO Cyrillic), so are the strings, and if the strings are
encoded in ISO 8859-5, so are the atoms, both quoted and unquoted.
Encoding logically concerns the interface between the tokeniser and
the external byte stream (in the Unisys ClearPath MCP systems
translation between encodings is done by the operating system before
the data become available to the program).  Once the changeover has
been made, the tokeniser should think that *all* characters are
Unicode characters.

> : do
> we want atoms and variable names to be utf8 too? Because  I've seen at
> least an example of code that uses extended latin-1 characters in
> those places.

That's not a problem.  If a file is encoded in ISO Latin 1, then certain
Unicode characters are encoded a certain way, BUT once into the tokeniser,
nobody knows or cares what that was.  If another file is encoded in UTF-8,
then certain Unicode characters are encoded in a different way, BUT once
into the tokeniser, nobody knows or cares what that was.

Encode "(a×2)÷4 = ½a" as 28,61,47,32,29,f7,34,20,3d,20,bd,61 (Latin-1)
or as 28,61,c3,97,32,29,c3,b7,34,20,3d,20,c2,bd,61 (UTF-8),
and as long as the tokeniser knows what it's getting, it should make
*no* difference to what you get, namely the list
[40,97,215,50,41,247,52,32,61,32,189,97] of integers one per Unicode
code-point.  That's how it works in SWI Prolog.

> Also, what should string manipulation functions do by default, should
> they assume an encoding?

No.  That would make life insanely complicated.  (Well, let's face it,
Unicode is already barking mad; this would make it *rabid* barking mad.)

> I think the only way to remain sane would be
> to have a special string type, tagged with the encoding

No, that's a way to go completely crazy.

The simple way is to distinguish between an inside and an outside.
INSIDE, everything is just Unicode.  OUTSIDE is where the wild
things are.  Encodings are *ONLY* relevant when you switch
between text encoded as byte sequences and text represented as
Unicode code point sequences.

I mean, can you *imagine* the complexity if "0" =:= "0" fails
because the first is tagged as Latin-1 and the second is tagged
as UTF-8? 

How Unicode code-point sequences are represented inside the
machine-level representation of an Erlang atom, Erlang source code
should have no reason whatever to care.  They could be UTF8; they
could be UTF16; they could be SCSU; they could be BOCU; they could
be something else entirely.

Converting between strings and binaries is the one place where Erlang
source code should have any reason to care, and it does have a reason
to care.  But you will perceive that it is the *binary* that needs to
be associated with an encoding, not the *string*.
of the system
> 
> Would a syntactic construct like u"some string" that returns a tagged
> utf8 string help?

No.  However, <<"some string"/utf8>> *would* make sense.