[erlang-questions] A proposal for Unicode variable and atom names in Erlang.

Thu Oct 25 06:20:21 CEST 2012

On 23/10/2012, at 10:20 PM, Jesper Louis Andersen wrote:
> 
> Google Go takes two stances differently:
> 
> * There is *no* normalization. This means that you can write the same symbol using one codepoint or with two code points combining into the same representation. Of course this is the conservative stance where it is expected that people do not do silly things. But my guess is that it is much easier to handle. Is there a specific reason to pick normalization, apart from the obvious one? I see some similarities to tabs vs spaces for indentation here.

Normalisation is a pain in the πρωκτος.  The only thing worse is _not_ doing it.
(As it happens, I am planning to rewrite the tokeniser of my Smalltalk system to
accept Unicode -- the run-time already does -- and this is one of the issues I've
been thinking about.)

I can see four options:
 (1) say that different encodings of the same text are different
 (2) leave it undefined whether they are different
 (3) say that it's someone else's problem (like XML 1.0, which says
     "Characters in names should be expressed using Normalization Form C"
     but leaves it to the author to make it so)
 (4) require normalisation.

The issue is a severely practical one:  can two people with different editors
edit the same source file?  As you sapiently observe, this _is_ very like tabs
vs spaces: your editor may think tabs are every 3 columns, but mine thinks they
are every 8, and you didn't tell _me_ otherwise.  (Again, my Smalltalk system
discerns method and class boundaries using indentation, and it has paid off to
enforce no-tabs-in-source-files at check-in.)  Of the options above, it is
only option (4) that makes multiple editors safe to use.

As it happens, I _have_ had the experience of typing exactly what I saw and having
it fail to match, so I do not want to see anyone else suffering the same fate.

> * In Go, identifiers are exported if they begin with a codepoint in class Lu. This is also a very conservative stance since now your programs must use an Lu codepoint for variable names if we just ported that solution to Erlang. But it is quite simple again, and very easy to handle from a parser perspective.

Restriction to Lu is not an option for Erlang.  We *have* to continue to
allow "_" as well, which is a Pc character, not an Lu character.  And if
we allow _that_ Pc character, why not the others?  They aren't used for
anything else in Erlang.

We really have to allow Lt as well.  It would be surpassing strange if
Ljudevit was a variable but ǈudevit was not.
There are 31 "Lt" letters in Unicode 6.  Of those, 27 are Greek.
The other 4 exist for the sake of Croatian (which has an alphabet of 30
letters).  As it happens, my maternal grandfather came from a small
town not far from Dubrovnik.  Do I want to be the one to tell 4.4 million
people who look rather like Granddad Covič they can't write a variable
name in their own language using their own letters?  No, not really.

From a lexical analyser perspective, scanning variable names requires
just two character sets: things that can begin a variable and things
that can continue one.  How those sets are derived really has no effect
whatever on how complicated the parsing is.  Scanning unquoted atoms is
admittedly tricky, but that's entirely down to Erlang's _existing_
treatment of "." and "@"; without those two to worry about we'd just
have atom starts and atom continuations and again the derivation of
the sets would make no difference to the scanner's complexity.