Author: Richard A. O'Keefe <ok(at)cs(dot)otago(dot)ac(dot)nz>
    Status: Draft
    Type: Standards Track
    Erlang-version: R15B02
    Created: 19-Oct-2012
    Post-History: 19-Oct-2012
****
EEP 40: A proposal for Unicode variable and atom names in Erlang
----


Abstract
========

This EEP proposes how to extend variable and atom names in Erlang
to contain Unicode characters in a backwards compatible way.


[Note]: <> (Underscores in regular text below are backslash escaped)
[Note]: <> (due to a weird Markdown rule for emphasis within words.)
[Note]: <> (So e.g. where you find XID\_Start it means XID_Start.)


Forces
======

1.  Support for Unicode continues to increase, with
    minimal source code support about to arrive.
2.  Unicode variable names and unquoted atoms are not
    here yet, so now is the time to settle on a design.
3.  They will need to come.  There may be legal or
    institutional reasons why unicode-capable languages
    are required.  Some people just want to use their
    own language and script.  Erlang's strength in
    network applications means that being able to
    represent Internationalized Domain Names as unquoted
    atoms would be just as much of a convenience as
    being able to represent ASCII domain names like
    www.example.com (which needs no quotes in Erlang) is.
4.  There is a framework for Unicode identifiers in [UAX#31][],
    used by several programming languages, including Ada, Java,
    C++, C, C#, Javascript, and Python (section 2.3 of [Python Lexical][],
    and see also [PEP 3131][]).
5.  Existing Erlang identifiers should remain valid,
    including ones containing "@" and ".".
6.  Existing Erlang support features, such as ignoring
    variables that start with underscore when reporting
    singleton variables, should not be broken.
7.  We should not "steal" any characters to use as "magic
    markers" for variables because they might be needed for
    other purposes.  A good (bad) example of this is "?", which
    could be used for several things if it were not used for macros.
8.  Character sequences in Latin-1 that are not legal variable or
    atom names now should not be made into such by this specification.    


Reference
=========

Names of sets of characters, XID\_Start, XID\_Continue, Lu, Lt, Lo, Pc,
Other\_Id\_Start, are drawn from [Unicode][] and [UAX#31][].

	Lu = upper case letters
	Lt = title case letters
        Pc = connector punctuators, including the low line (_) and
             a number of other characters like undertie (‿).
	Other_Id_Start = script capital p, estimated symbol,
             katakana-hiragana voiced sound mark, and
             katakana-hiragana semi-voiced sound mark.


Specification
=============

Variables
---------

    variable ::= var_start var_continue*

    var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_Id_Start)) ∪ Pc

    var_continue ::= XID_Continue ∪ "@"

The choice of XID here follows Python.  It ensures that the normalisation
of a variable is still a variable.  In fact Unicode variables should be
normalised.  Unicode has enough look-alike characters that we cannot hope
for "look the same <=> are the same" to be true, but we should go _some_
way in that direction.

Variables in scripts that do not distinguish letter case have to
begin with _some_ special character to ensure that they are not
mistaken for unquoted atoms.  There are 10 Pc characters in the Basic
Multilingual Plane.  The Erlang parser treats a variable beginning
with an underscore specially: there will be no complaint if it is a
singleton.  There are 9 other Pc characters for which this special
treatment is not applied.  Of course, someone might be using fonts
that do include say Arabic letters but not say the undertie.  We can
deal with that by revising the underscore rule.

	Variable does not begin with a Pc character =>
		should not be a singleton.

	Variable is just a Pc character and nothing else =>
		is a wild card.

	Variable begins with a Pc character followed by a
	Latin-1 character =>
		may be a singleton.

	Variable begins with a Pc character following by
	a character outside the Latin-1 range =>
		should not be a singleton.

Thus ‿ is a wild-card, 隠者 is an atom, \_隠者 should not be
a singleton, but \_\_隠者 _may_ be a singleton.  This rule is a
consistent generalisation of the existing rule.

Unquoted atoms
--------------

	unquoted_atom ::= atom_start atom_continue*

	atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
	            |  "." (Ll ∪ Lo)

	atom_continue ::= XID_Continue | "@"
	               |  "." (Ll ∪ Lo)

Again the choice of XID follows Python, and ensures that the
normalisation of an unquoted atom is still an unquoted atom.
Unquoted atoms should be normalised.

The details of Erlang unquoted atoms are somewhat subtle; I have
checked my understanding experimentally.

Keywords
--------

Keywords have the form of unquoted atoms.  No new keywords are
introduced.

### Specifics ###

-   Any Python identifier or keyword is
    an Erlang variable or unquoted atom or keyword
    unless it begins with "ª" or "º".

-   @ signs may occur freely in variables and unquoted atoms except as the
    first character, as now.

-   dots may not be followed by capital letters, digits, or underscores,
    as now.

-   I am not sure whether modifier letters should be allowed after a dot.

-   I am not sure what to do with the Other\_ID\_Start characters.
    Script capital p _looks_ like a capital p and even has "capital" in
    its name.  All other "* SCRIPT CAPITAL *" characters are upper case
    letters.  Surely it should be allowed to start a variable.
    The estimated sign looks like an enlarged lower case e; other symbols
    that look like letters are classified as letters.  You'd expect this
    to begin an atom.  As for the Katakana-Hiragana voicing marks, I have
    no intuition whatever.  Assigning the whole group to atoms seems
    safest.

-   All existing variable names and unquoted atoms remain legal, and no
    new variable or atom forms using only Latin-1 characters have been
    introduced.

Rationale
=========

While Erlang files meant to be shared with a wide audience should
still be written in English, if people are working in a group fluent
in some language on requirements also written in that language, it
is desirable that they should be able to stay close to the terminology
of the requirements lest they introduce translation errors.

The whole design flows in the direction "if someone wants to use their
own script in an Erlang file, they should be able to do so comfortably
in a way that is generally consistent with other programming
languages."

This _does_ mean that there will be Erlang source files that a skilled
Erlang programmer is unable to decipher because of the unfamiliarity
of the script.  With over 110,000 characters in Unicode 6, this is
just going to happen no matter what we do.  Once Unicode strings are
available, can quoted Unicode atoms be far behind?  And once they are
possible, refusing unquoted Unicode atoms does not salvage universal
readability.  All it would accomplish is to annoy people by requiring
single quotation marks to be used liberally.  Old Algol programmers
will recall only too clearly how much of an impairment to readability
a hailstorm of single quotation marks was.  And if you can use
γαμμα as an atom, does it make any sense to refuse Γαμμα?

There are three ways we have to customize the UAX 31 definition.

	- We have to continue to support "@" in variables and
          "@" and "." in unquoted atoms for backwards compatibility.

        - We have to continue to forbid unquoted atoms beginning
          with the Latin-1 masculine and feminine ordinal indicators.

	- We have to distinguish between variables and unquoted atoms.

Trouble spot
------------

It is highly desirable that a legal Erlang text should remain legal
even as Unicode is revised.  [UAX#31][] and [Stability][] very nearly
give us what we need.  The one problem that seems to be technically
possible is that an upper or lower case letter without an opposite
case counterpart might change its General Category (while being given
the Other\_ID\_Start property if it ceased to be a letter at all),
so an identifier beginning with such a cased orphan might switch from
variable to unquoted atom or vice versa.  Some cased orphans do exist,
like LATIN LETTER SMALL CAPITAL M, but what would a capital capital M
be?

One possibility is to raise the issue with the Unicode consortium and
leave this unresolved until they reply.

Another possibility would be to say that an Lu character may only
begin a variable if it has a lower-case counterpart, and an Ll
character may only begin an unquoted atom if it has an upper-case
counterpart.  Since "ß" and "ÿ" have upper-case counterparts in
Unicode, Latin-1 unquoted atoms would not be affected by such a rule.
The great mass of Lo characters would also be unaffected.

[References]: <>
"==========="

[Unicode]: http://www.unicode.org/versions/Unicode6.2.0/
    "The Unicode Standard version 6.2.0"

[UAX#31]: http://www.unicode.org/reports/tr31/
    "Unicode Standard Annex 31"

[Python Lexical]: http://docs.python.org/release/3.1.5/reference/lexical_analysis.html
    "Python Lexical Analysis"

[PEP 3131]: http://www.python.org/dev/peps/pep-3131/
    "Python Enhancement Proposal 3131"

[Stability]: http://www.unicode.org/policies/stability_policy.html
    "Unicode Character Encoding Stability Policy"


Copyright
=========
This document has been placed in the public domain.


[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"
[VimVar]: <> " vim: set fileencoding=utf-8 "