[erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

Thu Nov 1 06:27:10 CET 2012

A non-text attachment was scrubbed...
Name: eep-0040.md
Type: application/octet-stream
Size: 9906 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121101/b20d5f5c/attachment.obj>
-------------- next part --------------

On 1/11/2012, at 3:44 AM, Raimo Niskanen wrote:
> 
> Was it not ment to be:
>    var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc

Yes.  I made a mistake there.
> 
> More restricted variable names
> ------------------------------
> 
> Nevertheless, I would like a slightly more conservative change in how Erlang
> should use Unicode in variable names and unquoted atoms.
> 
> I want to be able to read printed source code on a paper and at least
> understand if Ƽ = count() has a variable, an atom or an integer to the left.
> This is an impossible goal because we can today e.g Cyrillic А in any .erl
> file and that will look as it should compile but it will not.

I am a little puzzled here.  U+0410 (CYRILLIC CAPITAL LETTER A) looks
like this:  А.  I grant you that it is somewhere between exceptionally
difficult and impossible to tell an A from an А from an Α (Latin
capital A, Cyrillic, and Greek respectively).  But they are all capital
letters.  The point of the proposal is that since А (U+0410) is a
capital letter, А = count() _should_ compile.

If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
that would have been hard to tell from a six, true.
But I don't see how this is any different from the fact that in a script
you don't know, you cannot tell _what_ a character is.
For example, I had a student this year whose native language was I
believe Malayalam.  I can't tell a Malayalam letter from a digit from
a punctuation mark.

Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?

Ah!  Emacs to the rescue.  It's the LATIN CAPITAL LETTER TONE FIVE.
Nothing to do with Cyrillic.

Reverting to the Middle Welsh letter, if I cannot tell a small letter
from a digit, does that mean that every unquoted atom should begin
with an English letter?  (I cannot say "a Latin letter", because
ỽ _is_ a member of the extended Latin script.)

No, I'm sorry.  This is ridiculous.  Expecting everybody to begin
_their_ variables which you will almost certainly never see to begin
with an ASCII letter so _you_ can tell this from that; what sense does
that make?  If it is in a script you cannot read, then you cannot read it.

Can we just try, for a minute or to, to entertain a rather wild idea?
Here's the idea:  most programmers are adults.  They can make informed
choices.  If they *want* you to read their code, they are smart enough
to write in a script you can read.  If they decide that it's more
important to them that _they_ can read comfortably, that's their
decision to make.  If you want a Malayalam-speaker to write code for
you, put the language (English, Finnish, whatever) in the contract.

I have a confession to make.  My multiple-programming-languages to
multiple-styled-output-formats tool is currently Latin-1 only.
That's because it's for _me_; nobody paid me to write it and I didn't
expect anyone else to find it useful (although someone did).  It can,
for example, be configured to generate HTML, and it can be made to
wrap keywords in <B> and could as easily wrap variables in <U>.  It
would probably take me about a week to revised the thing to use
Unicode.  So then I'd have a tool that could generate printed listings
with variables underlined, without needing to slap untold numbers of
people in the face with the notion that they are and must remain
second-class world citizens.

> So I have to change that requirement into; if it compiles I want to be able
> to tell from a noncolour printed source code listing what the semantics is.

You are, in fact, proposing a backwards-incompatible change to Erlang,
in order to achieve a goal which is not in general achievable, and not
in my view worth achieving if you could.

Let's be realistic here.  If you cannot read any of the words, it is not
going to do you any good to tell the variables from the atoms from the
numbers.  Let's take an example.  I took a snippet of Erlang out of
the Erlang/OTP release and transliterated the English letters to
Russian ones.  If you _don't_ read the Cyrillic script, precisely what
good does it do you to know which are the variables?  If you _do_ read
the Cyrillic script, this will seem to you to be complete gibberish,
so imagine it's a language you don't know.

ҵӄҽҲӃҸҾҽ({ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,Ґӂ0,ҥұ,ҥҳұ}, ҐӃҾҼҜҾҳ, ҢӃ0) ->
    try
        {ҐӂҼ,ҔҽӃӁӈқҰұҴһ,ҢӃ} = ҲҶ_ҵӄҽ(ҥұ, Ґӂ0, ҥҳұ, ҐӃҾҼҜҾҳ, {ҝҰҼҴ,ҐӁҸӃӈ}, ҢӃ0),
        ҕӄҽҲ = {ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,ҔҽӃӁӈқҰұҴһ,ҐӂҼ},
        {ҕӄҽҲ,ҢӃ}
    catch
        ҒһҰӂӂ:ҔӁӁҾӁ ->
            ҢӃҰҲҺ = ҴӁһҰҽҶ:ҶҴӃ_ӂӃҰҲҺӃӁҰҲҴ(),
            ҸҾ:ҵӆӁҸӃҴ("ҕӄҽҲӃҸҾҽ: ~ӆ/~ӆ\ҽ", [ҝҰҼҴ,ҐӁҸӃӈ]),
            ҴӁһҰҽҶ:ӁҰҸӂҴ(ҒһҰӂӂ, ҔӁӁҾӁ, ҢӃҰҲҺ)
    end.

ҲҶ_ҵӄҽ(қҴӂ, җӅӂ, ҥҳұ, ҐӃҾҼҜҾҳ, ҝҰҼҴҐӁҸӃӈ, ҢӃ0) ->
    {ҕҸ,ҢӃ1} = ҽҴӆ_һҰұҴһ(ҢӃ0),
    {ҕһ,ҢӃ2} = һҾҲҰһ_ҵӄҽҲ_һҰұҴһ(ҝҰҼҴҐӁҸӃӈ, ҢӃ1),

    ґҴҵ = ҲһҴҰӁ_ҳҴҰҳ(#ӂӁ{ӁҴҶ=ҵҾһҳһ(fun ({ӅҰӁ,ҥ}, ҡҴҶ) ->
                                           ҿӄӃ_ӁҴҶ(ҥ, ҡҴҶ)
                                     end, [], җӅӂ),
                        ӂӃҺ=[]}, 0, ҥҳұ),
    {ґ2,_ҐҵӃ,ҢӃ3} = ҲҶ_һҸӂӃ(қҴӂ, 0, ҥҳұ, ґҴҵ,
       ҢӃ2#ҲҶ{ұӃӈҿҴ=ҴӇҸӃ,ұҵҰҸһ=ҕҸ,ҵҸҽҵҾ=ҕҸ,Ҹӂ_ӃҾҿ_ұһҾҲҺ=ӃӁӄҴ}),
    {ҝҰҼҴ,ҐӁҸӃӈ} = ҝҰҼҴҐӁҸӃӈ,
    Ґ = [{һҰұҴһ,ҕҸ},{ҵӄҽҲ_ҸҽҵҾ,ҐӃҾҼҜҾҳ,{ҰӃҾҼ,ҝҰҼҴ},ҐӁҸӃӈ},
         {һҰұҴһ,ҕһ}|ґ2],
    {Ґ,ҕһ,ҢӃ3}.

I don't know about you, but I wouldn't dare to touch this.
It DOES NOT MATTER TO me which words are variables and which
are not, because that knowledge is not useful to me.

(By the way, it should now be clear that in a context like this
you'll _know_ that something is a Cyrillic capital A because
everything else is Cyrillic -- there are no capital letters in
keywords -- so what would a Latin capital A be doing there?)

Does that mean there will be Erlang files that I cannot read and
Raimo Niskanen cannot read?  Certainly it does. Does that mean a
big problem for us?  No.  Nobody is going to _expect_ us to read
it.  If someone ships us source code we can't read we shan't use
it.

Is this a NEW problem?  No.  It is already possible to use some
surprising languages in ASCII (Klingon, Ancient Egyptian, Greek
with a little ingenuity, ...) so ever since Erlang began, we've
had the possibility of entire files being written in words that
we did not understand.  If you don't know what the *functions*
are about, what good does it do you to know which tokens are
variables?

I once had to maintain a large chunk of Prolog written by a
very clever programmer whose idea of good variable naming
style came from old BASIC (one letter, or one letter and one
digit).  I could see _which_ tokens were the variables, but
not _what_ the variable names meant.  I had to figure it out
from the predicate names.  So from actual experience I can
tell you

	JUST KNOWING WHICH TOKENS ARE VARIABLES IS
	NEXT TO USELESS.

> I think it is better to restrict to a subset of 7-bit US-ASCII.

Yeah!  Let's make Erlang ASCII-only!  (Too bad about my father's
middle name: Æneas.  Perfectly good English name, from Latin.)

> Decent
> editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
> character is under the cursor and if it is A..Z or _ under U+7F it is a
> variable start.

I'm using Aquamacs.
From the Aquamacs help:
	Emacs buffers and strings support a large repertoire of 
	characters from many different scripts, allowing users to
	type and display text in almost any known written language.

	To support this multitude of characters and scripts,
	Emacs closely follows the Unicode Standard.
It's Meta-X describe-char, not Ctrl-X describe-char,
and it works perfectly with Unicode characters.
Here's sample output:

        character: Ҳ (1202, #o2262, #x4b2)
preferred charset: unicode (Unicode (ISO10646))
       code point: 0x04B2
           syntax: w 	which means: word
         category: .:Base, y:Cyrillic
      buffer code: #xD2 #xB2
        file code: #xD2 #xB2 (encoded by coding system utf-8)
          display: by this font (glyph code)
    nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)

Character code properties: customize what to show
  name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
  old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
  general-category: Lu (Letter, Uppercase)

Trying this in Vim, it tells me what the numeric codes
of a letter are, but not that it is a letter.

> 
> The underscore
> --------------
> 
> I would like to argue against allowing all Unicode general category Pc
> (Connector_Punctuation) character in place of "_". This class contain
> in Unicode 6.2 these characters:
>    U+5F;   LOW LINE
>    U+2034; UNDERTIE
>    U+2040; CHARACTER TIE
>    U+2054; INVERTED UNDERTIE
>    U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
>    U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
>    U+FE4D; DASHED LOW LINE
>    U+FE4E; CENTERLINE LOW LINE
>    U+FE4F; WAVY LOW LINE
>    U+FF3F; FULLWIDTH LOW LINE
> 
> Of these at least U+2040 "⁀" is horizontal at the top of the line

If it looks horizontal, you have a very poor font.
It's _supposed_ to look more like a c rotated 90 degrees
clockwise and flattened a bit.

> and U+FE33 "︳" looks like a vertical bar (I guess intended for
> vertical flow chinese) so they do not resemble "_" very much.

Who said they were _supposed_ to resemble "_"?
Not me.

I can see your point here, but allowing-all-of-Pc *is* the
Unicode UAX#31 recommendation.    We *have* to tailor the
definition somewhat for the sake of backwards compatibility
(dots and at signs).  We *could* tailor it here, but it is
definitely advantageous to have at least one more Pc
character reasons given in the EEP.

> Allowing all these would make it hard to remember if a given
> character is category Pc or something else e.g "|".

You are not *supposed* to remember what each and every character is.

BECAUSE YOU CAN'T.

If there's anyone who can, I don't want to meet them.
What _else_ could we talk about?

There are 110,117 defined characters in Unicode 6.2.
(The figure was 110,116 in Unicode 6.1 and 6.2 added one more.)
NOBODY is expected to know what all these characters are.

The idea is not
	"if a character is to appear in an Erlang file,
	 everybody must know what it means"
but
	"if someone wants to use their own script in
	 an Erlang file, they should be able to do so
	 in a way that is generally consistent with
	 other programming languages."

The idea that a character should be forbidden unless YOU
recognise it would take us right back to ASCII or Latin 1.
Please, do not put the cart before the horse.

It is perfectly acceptable to say "If someone wants to share
Erlang code with people in other countries, they should use
characters that all those people recognise."  In the 21st
century it is no longer acceptable to say "nobody may use a
character unless I remember what it is."
> 
> Unquoted atoms
> --------------
> 
> The EEP proposes:
>    atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
> 	| "." (Ll ∪ Lo)
> 
> I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
> be excluded so an atom can not start with a capital looking letter,
> but Pc ⊄ XID_Start so there is no reason to subtract it, and why
> subtract Lo (Other_Letter)?

There is also no *harm* in making it obvious that variables
*can* start with Pc characters and unquoted atoms *cannot*.

Why subtract Lo?  That was a combination of a backwards compatibility
issue and an oversight.

The backwards compatibility issue is that
ªº are Lo characters and are not allowed to begin an Erlang atom.
The oversight was forgetting that this category was the one with
most of the characters I wanted to allow.

This should read

    atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
                |  "." (Ll ∪ Lo)

> There also seems to be a typo in the definition of unquoted_atom
> where an iteration of atom_continue is missing.
> 
> I propose:
>    unquoted_atom ::= atom_start atom_continue*

Yes.
> 
>    atom_start ::= atom_start_char
> 	| "." atom_start_char

That will allow Latin-1 atoms that are not now legal.
> 
>    atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)
> 
>    atom_continue ::= XID_Continue ∪ "@"
> 	| "." XID_Continue

That will allow Latin-1 atoms that are not now legal.

> General explanation
> -------------------
> 
> I think the EEP could benefit from explaining more about the used character
> classes, what kind of stability annex #31 is designed to give and such.
> 
> When I did read the EEP it took several days of Unicode standard reading to
> start understanding, and I think many hesitate before trying to understand
> the EEP, which is a pity.

Well, yes.  Is it my job to repeat all the material in the Unicode
standard?  I don't think so.  I mean, the thing's telephone-book size!
> 
> My first concern was about if I write code for one Unicode Erlang release
> in the future, will then that code be valid for subsequent Erlang releases
> based on later Unicode standards.

Yes.  Section 1.1 of UAX#31 could hardly be more explicit.   Well,
maybe it could, which is why it points to
http://www.unicode.org/policies/stability_policy.html
which says

 - Once a character is XID_Continue,
   it must continue to be so in all future versions.
 - If a character is XID_Start then it must also be XID_Continue.
 - Once a character is XID_Start,
   it must continue to be so in all future versions.

amongst other things.

> For example the EEP and my proposal both define atom_start to be XID_Start
> minus a set containing uppercase and titlecase letters. XID_Start is
> derived from ID_Start, and ID_Start contains Other_ID_Start. I have failed
> in finding which codepoints are contained in Other_ID_Start.

To start with, the purpose of Other_ID_Start is to provide stability.
Any character which _used_ to be an ID_Start but because of some change
would have ceased to be so will be given that property to compensate.

The properties Other_ID_Start and Other_ID_Continue are listed in
Proplist.txt in the Unicode data base.  Here's the current set:

# ================================================

2118          ; Other_ID_Start # Sm       SCRIPT CAPITAL P
212E          ; Other_ID_Start # So       ESTIMATED SYMBOL
309B..309C    ; Other_ID_Start # Sk   [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

# Total code points: 4

# ================================================

00B7          ; Other_ID_Continue # Po       MIDDLE DOT
0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE

# Total code points: 12

> But since we here define atom_start as above, moving a character from Lu
> or Lt into Other_ID_Start will remove it from atom_start and old code
> using it will not compile.

Lu and Lt are "General Categories".  Other_ID_Start is a "property".

OK, now we've got a genuine technical problem.

The set of characters that can begin a variable-OR-an-unquoted-atom
can only grow.  That much stability we're promised.

If a character changes from Lu to Lt or Other_ID_Start,
no problem.  If a character changes from Lt to Lu or
Other_ID_Start, no problem.  But if a character changes
from Lu/Lt to Ll/Lo or vice versa, we have a problem.

Perhaps we can appeal to this:
	Once a character is encoded, its properties may still be
	changed, but not in such a way as to change the fundamental
	identity of the character.
	...
	For example, the representative glyph for U+0061 “A”
	cannot be changed to “B”; the General_Category for
	U+0061 “A” cannot be changed to Ll (lowercase letter)
	...

Case Pair stability _nearly_ gives us what we want.
	If two characters form a case pair in a version of Unicode,
	they will remain a case pair in each subsequent version of Unicode.

	If two characters do not form a case pair in a version of Unicode,
	they will never become a case pair in any subsequent version of Unicode.
That is, if "D" and "d" are unequal defined characters such that
lower("D") = "d" and upper("d") = "D", then this will remain true.
This means that
	If "D" is an Lu character now and "d" the corresponding Ll
	character, they are going to remain a case pair.
So we could fiddle a bit and say
	Lu + Lt + Pc + (Other_ID_Start such that lower(x) != x)
is what we're after.

This doesn't handle the situation where there is a cased letter now
but not its case opposite, as Latin-1 had y-umlaut and sharp s as
lower case letters with no upper case version.  But when case opposites
for them did go into Unicode, they didn't change.

I don't think we actually have a problem.

However, the attached revision to EEP 40 has two recommendations.