[erlang-questions] A proposal for Unicode variable and atom names in Erlang.

Fri Oct 19 08:06:43 CEST 2012

If it were still possible to submit EEPs in plain text,
this would be an EEP.  If someone else would like to
package this up as an EEP and submit it (under their
name, mine, or both), feel free.

Forces:
 (1) Support for Unicode continues to increase, with
     minimal source code support about to arrive.
 (2) Unicode variable names and unquoted atoms are not
     here yet, so now is the time to settle on a design.
 (3) They will need to come.  There may be legal or
     institutional reasons why unicode-capable languages
     are required.  Some people just want to use their
     own language and script.  Erlang's strength in
     network applications means that being able to
     represent Internationalized Domain Names as unquoted
     atoms would be just as much of a convenience as
     being able to represent ASCII domain names like
     www.example.com (which needs no quotes in Erlang) is.
 (4) There is a framework for Unicode identifiers in
     Unicode standard annex 31 (UAX#31), and several
     programming languages, including Ada, Java,
     C++, C, C#, Javascript, and Python (section 2.3 of
     http://docs.python.org/release/3.1.5/reference/lexical_analysis.html
     and see also http://www.python.org/dev/peps/pep-3131/
 (5) Existing Erlang identifiers should remain valid,
     including ones containing "@" and ".".
 (6) Existing Erlang support features, such as ignoring
     names of the form [_][a-zA-Z0-9_]* when reporting
     singleton variables, should not be broken.
 (7) We should not "steal" any characters to use as "magic
     markers" for variables because they might be needed for
     other purposes.  A good (bad) example of this is "?", which
     could be used for several things if it were not used for macros.     

Reference

    Names of sets of characters, XID_Start, XID_Continue, Lu, Lt, Lo, Pc,
    Other_Id_Start, are drawn from Unicode and UAX#31.

	Lu = upper case letters
	Lt = title case letters
        Pc = connector punctuators, including the low line (_) and
             a number of other characters like undertie (‿).
	Other_Id_Start = script capital p, estimated symbol,
             katakana-hiragana voiced sound mark, and
             katakana-hiragana semi-voiced sound mark.

Variables

    variable ::= var_start var_continue*

    var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ Pc ∪ Other_Id_Start)

    var_continue ::= XID_Continue U "@"

    The choice of XID here follows Python.  It ensures that the normalisation
    of a variable is still a variable.  In fact Unicode variables should be
    normalised.  Unicode has enough look-alike characters that we cannot hope
    for "look the same <=> are the same" to be true, but we should go _some_
    way in that direction.

    Variables in scripts that do not distinguish letter case have to
    begin with _some_ special character to ensure that they are not
    mistaken for unquoted atoms.  There are 10 Pc characters in the Basic
    Multilingual Plane.  The Erlang parser treats a variable beginning
    with an underscore specially: there will be no complaint if it is a
    singleton.  There are 9 other Pc characters for which this special
    treatment is not applied.  Of course, someone might be using fonts
    that do include say Arabic letters but not say the undertie.  We can
    deal with that by revising the underscore rule.

	Variable does not begin with a Pc character =>
		should not be a singleton.

	Variable is just a Pc character and nothing else =>
		is a wild card.

	Variable begins with a Pc character followed by a
	Latin-1 character =>
		may be a singleton.

	Variable begins with a Pc character following by
	a character outside the Latin-1 range =>
		should not be a singleton.

    Thus ‿ is a wild-card, 隠者 is an atom, _隠者 should not be
    a singleton, but __隠者 _may_ be a singleton.  This rule is a
    consistent generalisation of the existing rule.

Unquoted atoms

    unquoted_atom ::= atom_start atom_continue

    atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Lo ∪ Pc)
                |  "." (Ll ∪ Lo)

    atom_continue ::= XID_Continue U "@"
                   |  "." (Ll ∪ Lo)

    Again the choice of XID follows Python, and ensures that the
    normalisation of an unquoted atom is still an unquoted atom.
    Unquoted atoms should be normalised.

    The details of Erlang unquoted atoms are somewhat subtle; I have
    checked my understanding experimentally.

Keywords

    Keywords have the form of unquoted atoms.  No new keywords are
    introduced.

Specifics

-  Any Python identifier or keyword is
   an Erlang variable or unquoted atom or keyword.

-  @ signs may occur freely in variables and unquoted atoms except as the
   first character, as now.

-  dots may not be followed by capital letters, digits, or underscores,
   as now.

-  I am not sure whether modifier letters should be allowed after a dot.

-  I am not sure what to do with the Other_ID_Start characters.
   Script capital p _looks_ like a capital p and even has "capital" in
   its name.  All other "* SCRIPT CAPITAL *" characters are upper case
   letters.  Surely it should be allowed to start a variable.
   The estimated sign looks like an enlarged lower case e; other symbols
   that look like letters are classified as letters.  You'd expect this
   to begin an atom.  As for the Katakana-Hiragana voicing marks, I have
   no intuition whatever.  Assigning the whole group to atoms seems
   safest.

-  All existing variable names and unquoted atoms remain legal, and no
   new variable or atom forms using only Latin-1 characters have been
   introduced.

Trouble spot