[erlang-questions] A proposal for Unicode variable and atom names in Erlang.
Raimo Niskanen
raimo+erlang-questions@REDACTED
Fri Oct 19 17:10:27 CEST 2012
I have converted the text below into EEP 40 and enlisted it.
http://www.erlang.org/eeps/eep-0040.html
It would be nice if Richard could check it and see if anything
got lost in translation. Especially the odd line "Trouble Spot"
is not supposed to be there, I guess, but I kept it.
/ Raimo Niskanen, as EEP editor
On Fri, Oct 19, 2012 at 07:06:43PM +1300, Richard O'Keefe wrote:
> If it were still possible to submit EEPs in plain text,
> this would be an EEP. If someone else would like to
> package this up as an EEP and submit it (under their
> name, mine, or both), feel free.
>
> Forces:
> (1) Support for Unicode continues to increase, with
> minimal source code support about to arrive.
> (2) Unicode variable names and unquoted atoms are not
> here yet, so now is the time to settle on a design.
> (3) They will need to come. There may be legal or
> institutional reasons why unicode-capable languages
> are required. Some people just want to use their
> own language and script. Erlang's strength in
> network applications means that being able to
> represent Internationalized Domain Names as unquoted
> atoms would be just as much of a convenience as
> being able to represent ASCII domain names like
> www.example.com (which needs no quotes in Erlang) is.
> (4) There is a framework for Unicode identifiers in
> Unicode standard annex 31 (UAX#31), and several
> programming languages, including Ada, Java,
> C++, C, C#, Javascript, and Python (section 2.3 of
> http://docs.python.org/release/3.1.5/reference/lexical_analysis.html
> and see also http://www.python.org/dev/peps/pep-3131/
> (5) Existing Erlang identifiers should remain valid,
> including ones containing "@" and ".".
> (6) Existing Erlang support features, such as ignoring
> names of the form [_][a-zA-Z0-9_]* when reporting
> singleton variables, should not be broken.
> (7) We should not "steal" any characters to use as "magic
> markers" for variables because they might be needed for
> other purposes. A good (bad) example of this is "?", which
> could be used for several things if it were not used for macros.
>
> Reference
>
> Names of sets of characters, XID_Start, XID_Continue, Lu, Lt, Lo, Pc,
> Other_Id_Start, are drawn from Unicode and UAX#31.
>
> Lu = upper case letters
> Lt = title case letters
> Pc = connector punctuators, including the low line (_) and
> a number of other characters like undertie (‿).
> Other_Id_Start = script capital p, estimated symbol,
> katakana-hiragana voiced sound mark, and
> katakana-hiragana semi-voiced sound mark.
>
> Variables
>
> variable ::= var_start var_continue*
>
> var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ Pc ∪ Other_Id_Start)
>
> var_continue ::= XID_Continue U "@"
>
> The choice of XID here follows Python. It ensures that the normalisation
> of a variable is still a variable. In fact Unicode variables should be
> normalised. Unicode has enough look-alike characters that we cannot hope
> for "look the same <=> are the same" to be true, but we should go _some_
> way in that direction.
>
> Variables in scripts that do not distinguish letter case have to
> begin with _some_ special character to ensure that they are not
> mistaken for unquoted atoms. There are 10 Pc characters in the Basic
> Multilingual Plane. The Erlang parser treats a variable beginning
> with an underscore specially: there will be no complaint if it is a
> singleton. There are 9 other Pc characters for which this special
> treatment is not applied. Of course, someone might be using fonts
> that do include say Arabic letters but not say the undertie. We can
> deal with that by revising the underscore rule.
>
> Variable does not begin with a Pc character =>
> should not be a singleton.
>
> Variable is just a Pc character and nothing else =>
> is a wild card.
>
> Variable begins with a Pc character followed by a
> Latin-1 character =>
> may be a singleton.
>
> Variable begins with a Pc character following by
> a character outside the Latin-1 range =>
> should not be a singleton.
>
> Thus ‿ is a wild-card, 隠者 is an atom, _隠者 should not be
> a singleton, but __隠者 _may_ be a singleton. This rule is a
> consistent generalisation of the existing rule.
>
> Unquoted atoms
>
> unquoted_atom ::= atom_start atom_continue
>
> atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Lo ∪ Pc)
> | "." (Ll ∪ Lo)
>
> atom_continue ::= XID_Continue U "@"
> | "." (Ll ∪ Lo)
>
> Again the choice of XID follows Python, and ensures that the
> normalisation of an unquoted atom is still an unquoted atom.
> Unquoted atoms should be normalised.
>
> The details of Erlang unquoted atoms are somewhat subtle; I have
> checked my understanding experimentally.
>
> Keywords
>
> Keywords have the form of unquoted atoms. No new keywords are
> introduced.
>
> Specifics
>
> - Any Python identifier or keyword is
> an Erlang variable or unquoted atom or keyword.
>
> - @ signs may occur freely in variables and unquoted atoms except as the
> first character, as now.
>
> - dots may not be followed by capital letters, digits, or underscores,
> as now.
>
> - I am not sure whether modifier letters should be allowed after a dot.
>
> - I am not sure what to do with the Other_ID_Start characters.
> Script capital p _looks_ like a capital p and even has "capital" in
> its name. All other "* SCRIPT CAPITAL *" characters are upper case
> letters. Surely it should be allowed to start a variable.
> The estimated sign looks like an enlarged lower case e; other symbols
> that look like letters are classified as letters. You'd expect this
> to begin an atom. As for the Katakana-Hiragana voicing marks, I have
> no intuition whatever. Assigning the whole group to atoms seems
> safest.
>
> - All existing variable names and unquoted atoms remain legal, and no
> new variable or atom forms using only Latin-1 characters have been
> introduced.
>
> Trouble spot
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
--
/ Raimo Niskanen, Erlang/OTP, Ericsson AB
More information about the erlang-questions
mailing list