[erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

Wed Oct 31 15:44:14 CET 2012

Allthough there might be opinions on whether allowing Unicode variable
and atom names is a good idea, I would like to discuss EEP 40 itself.
In a previous thread there was much said about Unicode or not but I only
found the following about EEP 40, hoping I did not miss anything valuable:

On Thu, Oct 25, 2012 at 05:20:21PM +1300, Richard O'Keefe wrote:
> 
> On 23/10/2012, at 10:20 PM, Jesper Louis Andersen wrote:
> > 
> > Google Go takes two stances differently:
> > 
> > * There is *no* normalization. This means that you can write the same symbol using one codepoint or with two code points combining into the same representation. Of course this is the conservative stance where it is expected that people do not do silly things. But my guess is that it is much easier to handle. Is there a specific reason to pick normalization, apart from the obvious one? I see some similarities to tabs vs spaces for indentation here.
> 
> Normalisation is a pain in the πρωκτος.  The only thing worse is _not_ doing it.
> (As it happens, I am planning to rewrite the tokeniser of my Smalltalk system to
> accept Unicode -- the run-time already does -- and this is one of the issues I've
> been thinking about.)
> 
> I can see four options:
>  (1) say that different encodings of the same text are different
>  (2) leave it undefined whether they are different
>  (3) say that it's someone else's problem (like XML 1.0, which says
>      "Characters in names should be expressed using Normalization Form C"
>      but leaves it to the author to make it so)
>  (4) require normalisation.
> 
> The issue is a severely practical one:  can two people with different editors
> edit the same source file?  As you sapiently observe, this _is_ very like tabs
> vs spaces: your editor may think tabs are every 3 columns, but mine thinks they
> are every 8, and you didn't tell _me_ otherwise.  (Again, my Smalltalk system
> discerns method and class boundaries using indentation, and it has paid off to
> enforce no-tabs-in-source-files at check-in.)  Of the options above, it is
> only option (4) that makes multiple editors safe to use.
> 
> As it happens, I _have_ had the experience of typing exactly what I saw and having
> it fail to match, so I do not want to see anyone else suffering the same fate.
> 
> > * In Go, identifiers are exported if they begin with a codepoint in class Lu. This is also a very conservative stance since now your programs must use an Lu codepoint for variable names if we just ported that solution to Erlang. But it is quite simple again, and very easy to handle from a parser perspective.
> 
> Restriction to Lu is not an option for Erlang.  We *have* to continue to
> allow "_" as well, which is a Pc character, not an Lu character.  And if
> we allow _that_ Pc character, why not the others?  They aren't used for
> anything else in Erlang.
> 
> We really have to allow Lt as well.  It would be surpassing strange if
> Ljudevit was a variable but ǈudevit was not.
> There are 31 "Lt" letters in Unicode 6.  Of those, 27 are Greek.
> The other 4 exist for the sake of Croatian (which has an alphabet of 30
> letters).  As it happens, my maternal grandfather came from a small
> town not far from Dubrovnik.  Do I want to be the one to tell 4.4 million
> people who look rather like Granddad Covič they can't write a variable
> name in their own language using their own letters?  No, not really.
> 
> >From a lexical analyser perspective, scanning variable names requires
> just two character sets: things that can begin a variable and things
> that can continue one.  How those sets are derived really has no effect
> whatever on how complicated the parsing is.  Scanning unquoted atoms is
> admittedly tricky, but that's entirely down to Erlang's _existing_
> treatment of "." and "@"; without those two to worry about we'd just
> have atom starts and atom continuations and again the derivation of
> the sets would make no difference to the scanner's complexity.
> 

That was the discussion so far. Here follows my thoughts.

Set notation mistake?
---------------------

I do not understand the BNF definition of variable in the EEP:
    variable ::= var_start var_continue*

    var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ Pc ∪ Other_ID_Start)

    var_continue ::= XID_Continue U "@" 

As I read the Unicode XID_Start definition 
<http://www.unicode.org/Public/6.2.0/ucd/DerivedCoreProperties.txt>
there are no general category Pc (Connector_Punctuation) characters in
XID_Start, hence will there be no such in the set intersection
(which as I understand '∩' should mean) defining var_start. Therefore
U+5F LOW LINE aka '_' Underscore is not allowed to start a variable.

Is there something wrong in that set notation, or what did I misunderstand?

Was it not ment to be:
    var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc

More restricted variable names
------------------------------

Nevertheless, I would like a slightly more conservative change in how Erlang
should use Unicode in variable names and unquoted atoms.

I want to be able to read printed source code on a paper and at least
understand if Ƽ = count() has a variable, an atom or an integer to the left.
This is an impossible goal because we can today e.g Cyrillic А in any .erl
file and that will look as it should compile but it will not.

So I have to change that requirement into; if it compiles I want to be able
to tell from a noncolour printed source code listing what the semantics is.

Therefore I think a more conservative rule for variable start is needed:
    variable ::= var_start var_continue*

    var_start ::= ("A".."Z" ∪ "_")

    var_continue ::= XID_Continue ∪ "@"

I hereby ditch the characters "À".."Ö" ∪ "Ø".."Þ" that are allowed today since
if they are allowed there is no telling which of all accents are allowed
and so we have to allow all LATIN CAPITAL and therefor all GREEK, CYRILLIC,
ARMENIAN, GEORGIAN, GLAGOLITIC, COPTIC and DESERET CAPITAL letters,
and that is a too big set to handle for a human. Tools would become
essential.

I think it is better to restrict to a subset of 7-bit US-ASCII. Decent
editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
character is under the cursor and if it is A..Z or _ under U+7F it is a
variable start. That is a possible set to memorize even for non-english
programmers especially considering all reserved words are in 7-bit US-ASCII
and hence Erlang programmers must be somewhat familiar with that charset.

Removing the Latin-1 characters > 128 will need warnings in one release
introduction later, and probably an non-unicode compile flag. But I do not
think that many have used such characters to start variables so far.

We can then define mst_variable (maybe singleton variable) much like
in the proposed EEP:
    mst_variable ::= mst_var_start var_continue*

    mst_var_start ::= "_" ("A".."Z" ∪ "a".."z" ∪ "0".."9" ∪ "_" ∪ "@")

An alternative suggestion is to allow "@" as var_start:
    variable ::= var_start var_continue*
    var_start ::= ("A".."Z" ∪ "_" ∪ "@")

which require no change from today for maybe singleton variables:
    mst_var_start ::= "_"

I can not think of anything partically bad with allowing @隠者 as a
variable name. The "@" makes it distinct from an atom, and "@" is
one of the variable prefix characters in perl (good or bad?!).

The underscore
--------------

I would like to argue against allowing all Unicode general category Pc
(Connector_Punctuation) character in place of "_". This class contain
in Unicode 6.2 these characters:
    U+5F;   LOW LINE
    U+2034; UNDERTIE
    U+2040; CHARACTER TIE
    U+2054; INVERTED UNDERTIE
    U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
    U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
    U+FE4D; DASHED LOW LINE
    U+FE4E; CENTERLINE LOW LINE
    U+FE4F; WAVY LOW LINE
    U+FF3F; FULLWIDTH LOW LINE

Of these at least U+2040 "⁀" is horizontal at the top of the line
and U+FE33 "︳" looks like a vertical bar (I guess intended for
vertical flow chinese) so they do not resemble "_" very much.
Allowing all these would make it hard to remember if a given
character is category Pc or something else e.g "|". Therefore
I think it will be enough to allow U+5F LOW LINE ("_", underscore).

An Erlang programmer will have to be able to enter many other
7-bit US-ASCII punctuation characters e.g ".,?:;%'" so
the underscore should pose no particular problem.

Unquoted atoms
--------------

The EEP proposes:
    atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
	| "." (Ll ∪ Lo)

I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
be excluded so an atom can not start with a capital looking letter,
but Pc ⊄ XID_Start so there is no reason to subtract it, and why
subtract Lo (Other_Letter)?

There also seems to be a typo in the definition of unquoted_atom
where an iteration of atom_continue is missing.

I propose:
    unquoted_atom ::= atom_start atom_continue*

    atom_start ::= atom_start_char
	| "." atom_start_char

    atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)

    atom_continue ::= XID_Continue ∪ "@"
	| "." XID_Continue

General explanation
-------------------

I think the EEP could benefit from explaining more about the used character
classes, what kind of stability annex #31 is designed to give and such.

When I did read the EEP it took several days of Unicode standard reading to
start understanding, and I think many hesitate before trying to understand
the EEP, which is a pity.

My first concern was about if I write code for one Unicode Erlang release
in the future, will then that code be valid for subsequent Erlang releases
based on later Unicode standards. It seems annex #31 is very much targeted
at solving that problem, and Unicode in itself is much about stability in
subsequent standards, so that problem seems handled, but I am not sure yet.

For example the EEP and my proposal both define atom_start to be XID_Start
minus a set containing uppercase and titlecase letters. XID_Start is
derived from ID_Start, and ID_Start contains Other_ID_Start. I have failed
in finding which codepoints are contained in Other_ID_Start. All I have
found is that it is used to give future stability to ID_Start so that when
the standard has to remove some codepoint from ID_Start it will be added
to Other_ID_Start and therefore XID_Start will not have lost a codepoint
so old code will still be valid.

But since we here define atom_start as above, moving a character from Lu
or Lt into Other_ID_Start will remove it from atom_start and old code
using it will not compile. If I am not mistaken. The same applies to
the EEP's definition of var_start.

I have not managed to find any stability statements from the Unicode
Consortium about if that could happen, much because I have not had
the time yet. Maybe instead the definition of atom_start above is
flawed and should use set unions only instead...?

I anyway miss this kind of stability reasoning/explanation in the EEP.

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB