[erlang-questions] A proposal for Unicode variable and atom names in Erlang.

Fri Oct 19 17:10:27 CEST 2012

I have converted the text below into EEP 40 and enlisted it.

    http://www.erlang.org/eeps/eep-0040.html

It would be nice if Richard could check it and see if anything
got lost in translation. Especially the odd line "Trouble Spot"
is not supposed to be there, I guess, but I kept it.

/ Raimo Niskanen, as EEP editor

On Fri, Oct 19, 2012 at 07:06:43PM +1300, Richard O'Keefe wrote:
> If it were still possible to submit EEPs in plain text,
> this would be an EEP.  If someone else would like to
> package this up as an EEP and submit it (under their
> name, mine, or both), feel free.
> 
> Forces:
>  (1) Support for Unicode continues to increase, with
>      minimal source code support about to arrive.
>  (2) Unicode variable names and unquoted atoms are not
>      here yet, so now is the time to settle on a design.
>  (3) They will need to come.  There may be legal or
>      institutional reasons why unicode-capable languages
>      are required.  Some people just want to use their
>      own language and script.  Erlang's strength in
>      network applications means that being able to
>      represent Internationalized Domain Names as unquoted
>      atoms would be just as much of a convenience as
>      being able to represent ASCII domain names like
>      www.example.com (which needs no quotes in Erlang) is.
>  (4) There is a framework for Unicode identifiers in
>      Unicode standard annex 31 (UAX#31), and several
>      programming languages, including Ada, Java,
>      C++, C, C#, Javascript, and Python (section 2.3 of
>      http://docs.python.org/release/3.1.5/reference/lexical_analysis.html
>      and see also http://www.python.org/dev/peps/pep-3131/
>  (5) Existing Erlang identifiers should remain valid,
>      including ones containing "@" and ".".
>  (6) Existing Erlang support features, such as ignoring
>      names of the form [_][a-zA-Z0-9_]* when reporting
>      singleton variables, should not be broken.
>  (7) We should not "steal" any characters to use as "magic
>      markers" for variables because they might be needed for
>      other purposes.  A good (bad) example of this is "?", which
>      could be used for several things if it were not used for macros.     
> 
> Reference
> 
>     Names of sets of characters, XID_Start, XID_Continue, Lu, Lt, Lo, Pc,
>     Other_Id_Start, are drawn from Unicode and UAX#31.
> 
> 	Lu = upper case letters
> 	Lt = title case letters
>         Pc = connector punctuators, including the low line (_) and
>              a number of other characters like undertie (‿).
> 	Other_Id_Start = script capital p, estimated symbol,
>              katakana-hiragana voiced sound mark, and
>              katakana-hiragana semi-voiced sound mark.
> 
> Variables
> 
>     variable ::= var_start var_continue*
> 
>     var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ Pc ∪ Other_Id_Start)
> 
>     var_continue ::= XID_Continue U "@"
> 
>     The choice of XID here follows Python.  It ensures that the normalisation
>     of a variable is still a variable.  In fact Unicode variables should be
>     normalised.  Unicode has enough look-alike characters that we cannot hope
>     for "look the same <=> are the same" to be true, but we should go _some_
>     way in that direction.
> 
>     Variables in scripts that do not distinguish letter case have to
>     begin with _some_ special character to ensure that they are not
>     mistaken for unquoted atoms.  There are 10 Pc characters in the Basic
>     Multilingual Plane.  The Erlang parser treats a variable beginning
>     with an underscore specially: there will be no complaint if it is a
>     singleton.  There are 9 other Pc characters for which this special
>     treatment is not applied.  Of course, someone might be using fonts
>     that do include say Arabic letters but not say the undertie.  We can
>     deal with that by revising the underscore rule.
> 
> 	Variable does not begin with a Pc character =>
> 		should not be a singleton.
> 
> 	Variable is just a Pc character and nothing else =>
> 		is a wild card.
> 
> 	Variable begins with a Pc character followed by a
> 	Latin-1 character =>
> 		may be a singleton.
> 
> 	Variable begins with a Pc character following by
> 	a character outside the Latin-1 range =>
> 		should not be a singleton.
> 
>     Thus ‿ is a wild-card, 隠者 is an atom, _隠者 should not be
>     a singleton, but __隠者 _may_ be a singleton.  This rule is a
>     consistent generalisation of the existing rule.
> 
> Unquoted atoms
> 
>     unquoted_atom ::= atom_start atom_continue
> 
>     atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Lo ∪ Pc)
>                 |  "." (Ll ∪ Lo)
> 
>     atom_continue ::= XID_Continue U "@"
>                    |  "." (Ll ∪ Lo)
> 
>     Again the choice of XID follows Python, and ensures that the
>     normalisation of an unquoted atom is still an unquoted atom.
>     Unquoted atoms should be normalised.
> 
>     The details of Erlang unquoted atoms are somewhat subtle; I have
>     checked my understanding experimentally.
> 
> Keywords
> 
>     Keywords have the form of unquoted atoms.  No new keywords are
>     introduced.
> 
> Specifics
> 
> -  Any Python identifier or keyword is
>    an Erlang variable or unquoted atom or keyword.
> 
> -  @ signs may occur freely in variables and unquoted atoms except as the
>    first character, as now.
> 
> -  dots may not be followed by capital letters, digits, or underscores,
>    as now.
> 
> -  I am not sure whether modifier letters should be allowed after a dot.
> 
> -  I am not sure what to do with the Other_ID_Start characters.
>    Script capital p _looks_ like a capital p and even has "capital" in
>    its name.  All other "* SCRIPT CAPITAL *" characters are upper case
>    letters.  Surely it should be allowed to start a variable.
>    The estimated sign looks like an enlarged lower case e; other symbols
>    that look like letters are classified as letters.  You'd expect this
>    to begin an atom.  As for the Katakana-Hiragana voicing marks, I have
>    no intuition whatever.  Assigning the whole group to atoms seems
>    safest.
> 
> -  All existing variable names and unquoted atoms remain legal, and no
>    new variable or atom forms using only Latin-1 characters have been
>    introduced.
> 
> Trouble spot
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB