[erlang-questions] A proposal for Unicode variable and atom names in Erlang.

Fri Oct 19 17:10:10 CEST 2012

Hi Richard,

I don't concur with the motion but here you are, with compliments, my attempt at mark-downing your proposal.

I don't second it because it always ended in tears somewhere else in the compile chain when I dared using Umlauts where it was allowed.

I suspect that you are not burned by those earlier attempts to shake ASCII, in those cases for the valid reason to leave crippled names (replacing Umlauts with ASCII characters) behind.

I did not do this: "Send your EEP submission to the EEP editors <eeps@REDACTED>"

I left a note on markdown and on-top EEP requirements in the text below, for your convenience.

See attachment for a HTML version that I get using markdown.lua ( http://www.frykholm.se/files/markdown.lua ). As you pointed out before, markdown is a trial and error thing and your results, and those of the OTP team, may vary. But it's a start I guess.

Best,
Henning

Begin EEP (this line is not part of it):
    Author: Richard O'Keefe <ok(at)cs(dot)otago(dot)ac(dot)nz>
    Status: Draft
    Type: Standards Track
    Created: 19-Oct-2012
    Erlang-Version: R16B
    Post-History: 19-Oct-2012
    Replaces:
****
EEP XXX: Unicode Variable And Atom Names
----

Abstract
========

Variable and atom names should be allowed to use any Unicode characters
instead of only Latin-1 characters.

Forces
------

1. Support for Unicode continues to increase, with
    minimal source code support about to arrive.  

2. Unicode variable names and unquoted atoms are not
    here yet, so now is the time to settle on a design.  

3. They will need to come.  There may be legal or
    institutional reasons why unicode-capable languages
    are required.  Some people just want to use their
    own language and script.  Erlang's strength in
    network applications means that being able to
    represent Internationalized Domain Names as unquoted
    atoms would be just as much of a convenience as
    being able to represent ASCII domain names like
    www.example.com (which needs no quotes in Erlang) is.  

4. There is a framework for Unicode identifiers in
    Unicode standard annex 31 (UAX#31), and several
    programming languages, including Ada, Java,
    C++, C, C#, Javascript, and Python (section 2.3 of
    [Python's Lexical Analysis][PyLex] and see also
    [PEP 3131][]).    

5. Existing Erlang identifiers should remain valid,
    including ones containing "@" and ".".  

6. Existing Erlang support features, such as ignoring
    names of the form \[\_][a-zA-Z0-9\_]* when reporting
    singleton variables, should not be broken.  

7. We should not "steal" any characters to use as "magic
    markers" for variables because they might be needed for
    other purposes.  A good (bad) example of this is "?", which
    could be used for several things if it were not used for macros.     

Rationale
=========

Names of sets of characters, XID\_Start, XID\_Continue, Lu, Lt, Lo, Pc,
Other\_Id\_Start, are drawn from Unicode and UAX#31.

    Lu = upper case letters  
    Lt = title case letters  
    Pc = connector punctuators, including the low line (_) and
            a number of other characters like undertie (‿).  
    Other_Id_Start = script capital p, estimated symbol,
            katakana-hiragana voiced sound mark, and
            katakana-hiragana semi-voiced sound mark.  

Variables
---------

    variable ::= var_start var_continue*  

    var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ Pc ∪ Other_Id_Start)  

    var_continue ::= XID_Continue U "@"   

The choice of XID here follows Python.  It ensures that the normalisation
of a variable is still a variable.  In fact Unicode variables should be
normalised.  Unicode has enough look-alike characters that we cannot hope
for "look the same <=> are the same" to be true, but we should go _some_
way in that direction.  

Variables in scripts that do not distinguish letter case have to
begin with _some_ special character to ensure that they are not
mistaken for unquoted atoms.  There are 10 Pc characters in the Basic
Multilingual Plane.  The Erlang parser treats a variable beginning
with an underscore specially: there will be no complaint if it is a
singleton.  There are 9 other Pc characters for which this special
treatment is not applied.  Of course, someone might be using fonts
that do include say Arabic letters but not say the undertie.  We can
deal with that by revising the underscore rule.

    Variable does not begin with a Pc character =>
        should not be a singleton.  

    Variable is just a Pc character and nothing else =>
        is a wild card.  

    Variable begins with a Pc character followed by a
    Latin-1 character =>
        may be a singleton.  

    Variable begins with a Pc character following by
    a character outside the Latin-1 range =>
        should not be a singleton.  

Thus ‿ is a wild-card, 隠者 is an atom, \_隠者 should not be
a singleton, but \_\_隠者 \_may\_ be a singleton.  This rule is a
consistent generalisation of the existing rule.  

Unquoted Atoms
--------------

    unquoted_atom ::= atom_start atom_continue  

    atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Lo ∪ Pc)
               |  "." (Ll ∪ Lo)  

    atom_continue ::= XID_Continue U "@"
                  |  "." (Ll ∪ Lo)  

Again the choice of XID follows Python, and ensures that the
normalisation of an unquoted atom is still an unquoted atom.
Unquoted atoms should be normalised.  

The details of Erlang unquoted atoms are somewhat subtle; I have
checked my understanding experimentally.  

Keywords
--------

Keywords have the form of unquoted atoms.  No new keywords are
introduced.

Specifics
---------

-  Any Python identifier or keyword is
  an Erlang variable or unquoted atom or keyword.  

-   @ signs may occur freely in variables and unquoted atoms except as the
  first character, as now.  

-   dots may not be followed by capital letters, digits, or underscores,
  as now.  

-   I am not sure whether modifier letters should be allowed after a dot.  

-   I am not sure what to do with the Other\_ID\_Start characters.
  Script capital p _looks_ like a capital p and even has "capital" in
  its name.  All other "* SCRIPT CAPITAL *" characters are upper case
  letters.  Surely it should be allowed to start a variable.
  The estimated sign looks like an enlarged lower case e; other symbols
  that look like letters are classified as letters.  You'd expect this
  to begin an atom.  As for the Katakana-Hiragana voicing marks, I have
  no intuition whatever.  Assigning the whole group to atoms seems
  safest.  

-   All existing variable names and unquoted atoms remain legal, and no
  new variable or atom forms using only Latin-1 characters have been
  introduced.  

Edit Notes
==========

For convenience, quoting [EEP 33], the EEP markdown template:  

See the [Markdown][] Syntax for general formatting syntax.  *On top* of
this Markdown EEPs has these requirements:  

You must adhere to the Emacs convention of adding two spaces at the
end of every sentence.  You should fill your paragraphs to column 70,
but under no circumstances should your lines extend past column 79.
If your code samples spill over column 79, you should rewrite them.  

Tab characters must never appear in the document at all.  

When referencing an external web page in the body of an EEP, you
should include the title of the page in the text, with a footnote
reference to the URL.  Do not include the URL in the body text of the
EEP.  E.g:

    Refer to the [Erlang Language web site][1] for more details.

    :

    [1]: http://www.erlang.org
        "Erlang Programming Language"

Footnote references ... are invisible in the [Markdown][]
generated HTML.

[PyLex]: http://docs.python.org/release/3.1.5/reference/lexical_analysis.html
    "2. Lexical Analysis - Python 3.1.5 Documentation"

[PEP 3131]: http://www.python.org/dev/peps/pep-3131/
    "PEP 3131 -- Supporting Non-ASCII Identifiers"

[EEP 33]: eep-0033.md
    "Sample Markdown EEP Template"

[Markdown]: http://daringfireball.net/projects/markdown/
   "Markdown Home Page"

Copyright
=========

*Pending the author's acknowledgement:*

This document has been placed in the public domain.

[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"

:End of EEP (this line is not part of it)

On Oct 19, 2012, at 8:06 AM, "Richard O'Keefe" <ok@REDACTED> wrote:

> If it were still possible to submit EEPs in plain text,
> this would be an EEP.  If someone else would like to
> package this up as an EEP and submit it (under their
> name, mine, or both), feel free.
> 
> Forces:
> (1) Support for Unicode continues to increase, with
>     minimal source code support about to arrive.
> (2) Unicode variable names and unquoted atoms are not
>     here yet, so now is the time to settle on a design.
> (3) They will need to come.  There may be legal or
>     institutional reasons why unicode-capable languages
>     are required.  Some people just want to use their
>     own language and script.  Erlang's strength in
>     network applications means that being able to
>     represent Internationalized Domain Names as unquoted
>     atoms would be just as much of a convenience as
>     being able to represent ASCII domain names like
>     www.example.com (which needs no quotes in Erlang) is.
> (4) There is a framework for Unicode identifiers in
>     Unicode standard annex 31 (UAX#31), and several
>     programming languages, including Ada, Java,
>     C++, C, C#, Javascript, and Python (section 2.3 of
>     http://docs.python.org/release/3.1.5/reference/lexical_analysis.html
>     and see also http://www.python.org/dev/peps/pep-3131/
> (5) Existing Erlang identifiers should remain valid,
>     including ones containing "@" and ".".
> (6) Existing Erlang support features, such as ignoring
>     names of the form [_][a-zA-Z0-9_]* when reporting
>     singleton variables, should not be broken.
> (7) We should not "steal" any characters to use as "magic
>     markers" for variables because they might be needed for
>     other purposes.  A good (bad) example of this is "?", which
>     could be used for several things if it were not used for macros.     
> 
> Reference
> 
>    Names of sets of characters, XID_Start, XID_Continue, Lu, Lt, Lo, Pc,
>    Other_Id_Start, are drawn from Unicode and UAX#31.
> 
> 	Lu = upper case letters
> 	Lt = title case letters
>        Pc = connector punctuators, including the low line (_) and
>             a number of other characters like undertie (‿).
> 	Other_Id_Start = script capital p, estimated symbol,
>             katakana-hiragana voiced sound mark, and
>             katakana-hiragana semi-voiced sound mark.
> 
> Variables
> 
>    variable ::= var_start var_continue*
> 
>    var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ Pc ∪ Other_Id_Start)
> 
>    var_continue ::= XID_Continue U "@"
> 
>    The choice of XID here follows Python.  It ensures that the normalisation
>    of a variable is still a variable.  In fact Unicode variables should be
>    normalised.  Unicode has enough look-alike characters that we cannot hope
>    for "look the same <=> are the same" to be true, but we should go _some_
>    way in that direction.
> 
>    Variables in scripts that do not distinguish letter case have to
>    begin with _some_ special character to ensure that they are not
>    mistaken for unquoted atoms.  There are 10 Pc characters in the Basic
>    Multilingual Plane.  The Erlang parser treats a variable beginning
>    with an underscore specially: there will be no complaint if it is a
>    singleton.  There are 9 other Pc characters for which this special
>    treatment is not applied.  Of course, someone might be using fonts
>    that do include say Arabic letters but not say the undertie.  We can
>    deal with that by revising the underscore rule.
> 
> 	Variable does not begin with a Pc character =>
> 		should not be a singleton.
> 
> 	Variable is just a Pc character and nothing else =>
> 		is a wild card.
> 
> 	Variable begins with a Pc character followed by a
> 	Latin-1 character =>
> 		may be a singleton.
> 
> 	Variable begins with a Pc character following by
> 	a character outside the Latin-1 range =>
> 		should not be a singleton.
> 
>    Thus ‿ is a wild-card, 隠者 is an atom, _隠者 should not be
>    a singleton, but __隠者 _may_ be a singleton.  This rule is a
>    consistent generalisation of the existing rule.
> 
> Unquoted atoms
> 
>    unquoted_atom ::= atom_start atom_continue
> 
>    atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Lo ∪ Pc)
>                |  "." (Ll ∪ Lo)
> 
>    atom_continue ::= XID_Continue U "@"
>                   |  "." (Ll ∪ Lo)
> 
>    Again the choice of XID follows Python, and ensures that the
>    normalisation of an unquoted atom is still an unquoted atom.
>    Unquoted atoms should be normalised.
> 
>    The details of Erlang unquoted atoms are somewhat subtle; I have
>    checked my understanding experimentally.
> 
> Keywords
> 
>    Keywords have the form of unquoted atoms.  No new keywords are
>    introduced.
> 
> Specifics
> 
> -  Any Python identifier or keyword is
>   an Erlang variable or unquoted atom or keyword.
> 
> -  @ signs may occur freely in variables and unquoted atoms except as the
>   first character, as now.
> 
> -  dots may not be followed by capital letters, digits, or underscores,
>   as now.
> 
> -  I am not sure whether modifier letters should be allowed after a dot.
> 
> -  I am not sure what to do with the Other_ID_Start characters.
>   Script capital p _looks_ like a capital p and even has "capital" in
>   its name.  All other "* SCRIPT CAPITAL *" characters are upper case
>   letters.  Surely it should be allowed to start a variable.
>   The estimated sign looks like an enlarged lower case e; other symbols
>   that look like letters are classified as letters.  You'd expect this
>   to begin an atom.  As for the Katakana-Hiragana voicing marks, I have
>   no intuition whatever.  Assigning the whole group to atoms seems
>   safest.
> 
> -  All existing variable names and unquoted atoms remain legal, and no
>   new variable or atom forms using only Latin-1 characters have been
>   introduced.
> 
> Trouble spot
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121019/63ddb06c/attachment.html>
-------------- next part --------------