<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>

<head>

        <meta http-equiv="content-type" content="text/html; charset=utf-8" />

        <title>Abstract</title>

        <link rel="stylesheet" type="text/css" href="default.css" />

</head>

<body>

<p>I don't concur with the motion but here you are, with compliments:</p>

<p>("Send your EEP submission to the EEP editors <a href="&#x6d;a&#x69;l&#x74;o&#x3a;ee&#x70;s&#x40;e&#x72;l&#x61;ng&#x2e;o&#x72;g">&#x65;e&#x70;s&#x40;e&#x72;la&#x6e;g&#x2e;o&#x72;g</a>").</p>

<pre><code>Author: Richard O'Keefe <ok(at)cs(dot)otago(dot)ac(dot)nz>

Status: Draft

Type: Standards Track

Created: 19-Oct-2012

Erlang-Version: R16B

Post-History: 19-Oct-2012

Replaces:

</code></pre>

<hr/>

<h2>EEP XXX: Unicode Variable And Atom Names</h2>

<h1>Abstract</h1>

<p>Variable and atom names should be allowed to use any Unicode characters

instead of only Latin-1 characters.</p>

<h2>Forces</h2>

<ol>

    <li><p>Support for Unicode continues to increase, with

    minimal source code support about to arrive.  </p></li>

    <li><p>Unicode variable names and unquoted atoms are not

    here yet, so now is the time to settle on a design.  </p></li>

    <li><p>They will need to come.  There may be legal or

    institutional reasons why unicode-capable languages

    are required.  Some people just want to use their

    own language and script.  Erlang's strength in

    network applications means that being able to

    represent Internationalized Domain Names as unquoted

    atoms would be just as much of a convenience as

    being able to represent ASCII domain names like

    www.example.com (which needs no quotes in Erlang) is.  </p></li>

    <li><p>There is a framework for Unicode identifiers in

    Unicode standard annex 31 (UAX#31), and several

    programming languages, including Ada, Java,

    C++, C, C#, Javascript, and Python (section 2.3 of

    <a href="http://docs.python.org/release/3.1.5/reference/lexical_analysis.html" title="2. Lexical Analysis - Python 3.1.5 Documentation">Python's Lexical Analysis</a> and see also

    <a href="http://www.python.org/dev/peps/pep-3131/" title="PEP 3131 -- Supporting Non-ASCII Identifiers">PEP 3131</a>).    </p></li>

    <li><p>Existing Erlang identifiers should remain valid,

    including ones containing "@" and ".".  </p></li>

    <li><p>Existing Erlang support features, such as ignoring

    names of the form [_][a-zA-Z0-9_]* when reporting

    singleton variables, should not be broken.  </p></li>

    <li><p>We should not "steal" any characters to use as "magic

    markers" for variables because they might be needed for

    other purposes.  A good (bad) example of this is "?", which

    could be used for several things if it were not used for macros.     </p></li>

</ol>

<h1>Rationale</h1>

<p>Names of sets of characters, XID_Start, XID_Continue, Lu, Lt, Lo, Pc,

Other_Id_Start, are drawn from Unicode and UAX#31.</p>

<pre><code>Lu = upper case letters  

Lt = title case letters  

Pc = connector punctuators, including the low line (_) and

        a number of other characters like undertie (â€¿).  

Other_Id_Start = script capital p, estimated symbol,

        katakana-hiragana voiced sound mark, and

        katakana-hiragana semi-voiced sound mark.  

</code></pre>

<h2>Variables</h2>

<pre><code>variable ::= var_start var_continue*  

var_start ::= XID_Start âˆ© (Lu âˆª Lt âˆª Pc âˆª Other_Id_Start)  

var_continue ::= XID_Continue U "@"   

</code></pre>

<p>The choice of XID here follows Python.  It ensures that the normalisation

of a variable is still a variable.  In fact Unicode variables should be

normalised.  Unicode has enough look-alike characters that we cannot hope

for "look the same <=> are the same" to be true, but we should go <em>some</em>

way in that direction.  </p>

<p>Variables in scripts that do not distinguish letter case have to

begin with <em>some</em> special character to ensure that they are not

mistaken for unquoted atoms.  There are 10 Pc characters in the Basic

Multilingual Plane.  The Erlang parser treats a variable beginning

with an underscore specially: there will be no complaint if it is a

singleton.  There are 9 other Pc characters for which this special

treatment is not applied.  Of course, someone might be using fonts

that do include say Arabic letters but not say the undertie.  We can

deal with that by revising the underscore rule.</p>

<pre><code>Variable does not begin with a Pc character =>

    should not be a singleton.  

Variable is just a Pc character and nothing else =>

    is a wild card.  

Variable begins with a Pc character followed by a

Latin-1 character =>

    may be a singleton.  

Variable begins with a Pc character following by

a character outside the Latin-1 range =>

    should not be a singleton.  

</code></pre>

<p>Thus â€¿ is a wild-card, éš è€… is an atom, _éš è€… should not be

a singleton, but __éš è€… _may_ be a singleton.  This rule is a

consistent generalisation of the existing rule.  </p>

<h2>Unquoted Atoms</h2>

<pre><code>unquoted_atom ::= atom_start atom_continue  

atom_start ::= XID_Start \ (Lu âˆª Lt âˆª Lo âˆª Pc)

           |  "." (Ll âˆª Lo)  

atom_continue ::= XID_Continue U "@"

              |  "." (Ll âˆª Lo)  

</code></pre>

<p>Again the choice of XID follows Python, and ensures that the

normalisation of an unquoted atom is still an unquoted atom.

Unquoted atoms should be normalised.  </p>

<p>The details of Erlang unquoted atoms are somewhat subtle; I have

checked my understanding experimentally.  </p>

<h2>Keywords</h2>

<p>Keywords have the form of unquoted atoms.  No new keywords are

introduced.</p>

<h2>Specifics</h2>

<ul>

    <li><p>Any Python identifier or keyword is

    an Erlang variable or unquoted atom or keyword.  </p></li>

    <li><p>@ signs may occur freely in variables and unquoted atoms except as the

    first character, as now.  </p></li>

    <li><p>dots may not be followed by capital letters, digits, or underscores,

    as now.  </p></li>

    <li><p>I am not sure whether modifier letters should be allowed after a dot.  </p></li>

    <li><p>I am not sure what to do with the Other_ID_Start characters.

    Script capital p <em>looks</em> like a capital p and even has "capital" in

    its name.  All other "* SCRIPT CAPITAL *" characters are upper case

    letters.  Surely it should be allowed to start a variable.

    The estimated sign looks like an enlarged lower case e; other symbols

    that look like letters are classified as letters.  You'd expect this

    to begin an atom.  As for the Katakana-Hiragana voicing marks, I have

    no intuition whatever.  Assigning the whole group to atoms seems

    safest.  </p></li>

    <li><p>All existing variable names and unquoted atoms remain legal, and no

    new variable or atom forms using only Latin-1 characters have been

    introduced.  </p></li>

</ul>

<h1>Edit Notes</h1>

<p>For convenience, quoting [EEP 33], the EEP markdown template:  </p>

<p>See the <a href="http://daringfireball.net/projects/markdown/" title="Markdown Home Page">Markdown</a> Syntax for general formatting syntax.  <em>On top</em> of

this Markdown EEPs has these requirements:  </p>

<p>You must adhere to the Emacs convention of adding two spaces at the

end of every sentence.  You should fill your paragraphs to column 70,

but under no circumstances should your lines extend past column 79.

If your code samples spill over column 79, you should rewrite them.  </p>

<p>Tab characters must never appear in the document at all.  </p>

<p>When referencing an external web page in the body of an EEP, you

should include the title of the page in the text, with a footnote

reference to the URL.  Do not include the URL in the body text of the

EEP.  E.g:</p>

<pre><code>Refer to the [Erlang Language web site][1] for more details.

:

[1]: http://www.erlang.org

    "Erlang Programming Language"

</code></pre>

<p>Footnote references ... are invisible in the <a href="http://daringfireball.net/projects/markdown/" title="Markdown Home Page">Markdown</a>

generated HTML.</p>

<h1>Copyright</h1>

<p><em>Pending the author's acknowledgement:</em></p>

<p>This document has been placed in the public domain.</p>

</body></html>