Language change proposal

Wed Nov 5 02:25:29 CET 2003

I wrote:
> By the way, the Unicode book spells out clear, simple, and usable rules
> for identifier syntax.

Joachim Durchholz <joachim.durchholz@REDACTED> replied:
    Ah, wonderful.
    Do you have a URL, or a set of promising Google keywords?

Well, it doesn't take the brain of a Feynman to figure out that
the Unicode book is the best place to look, or failing that, www.unicode.org.

In fact it's Section 5.15 "Identifiers" in the Unicode 4.0 book,
and a draft replacement for that section can be found in
http://www.unicode.org/reports/tr31/

    "The formal syntax provided here is intended to capture the general
    intent that an identifier consists of a string of characters that
    begins with a letter or an ideograph, and then includes any number
    of letters, ideographs, digits, or underscores.  Each programming
    language standard has its own identifier syntax; different
    programming languages have different conventions for the use of
    certain characters from the ASCII range ($, @, #, _) in identifiers.
    To extend such a syntax to cover the full behavior of a Unicode
    implementation, implementers need only combine these specific rules
    with the sample syntax provided here.

    Syntactic Rule

    <identifier> := <identifier_start>
                   (<identifier_start> | <identifier_extend>)* "

Since Erlang _doesn't_ use anything other than letters, digits, and
underscores, the Unicode rules would apply exactly.

There are some subtleties to all this concerning normalisation
and the non-breaking format characters, but once you've figured out how
to represent a classification scheme for over a million characters
economically (not, actually, all that hard), the rest is easy.