[erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

Thu Nov 1 17:52:39 CET 2012

On Thu, Nov 01, 2012 at 06:27:10PM +1300, Richard O'Keefe wrote:
> 
> 
> On 1/11/2012, at 3:44 AM, Raimo Niskanen wrote:
: :
> > 
> > More restricted variable names
> > ------------------------------
> > 
> > Nevertheless, I would like a slightly more conservative change in how Erlang
> > should use Unicode in variable names and unquoted atoms.
> > 
> > I want to be able to read printed source code on a paper and at least
> > understand if Ƽ = count() has a variable, an atom or an integer to the left.
> > This is an impossible goal because we can today e.g Cyrillic А in any .erl
> > file and that will look as it should compile but it will not.
> 
> I am a little puzzled here.  U+0410 (CYRILLIC CAPITAL LETTER A) looks
> like this:  А.  I grant you that it is somewhere between exceptionally
> difficult and impossible to tell an A from an А from an Α (Latin
> capital A, Cyrillic, and Greek respectively).  But they are all capital
> letters.  The point of the proposal is that since А (U+0410) is a
> capital letter, А = count() _should_ compile.

I think that point, which is a good one, did not come through in the
proposal, but the updated version of yours have a very good
rationale that makes it clearer.

> 
> If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
> that would have been hard to tell from a six, true.
> But I don't see how this is any different from the fact that in a script
> you don't know, you cannot tell _what_ a character is.
> For example, I had a student this year whose native language was I
> believe Malayalam.  I can't tell a Malayalam letter from a digit from
> a punctuation mark.
> 
> Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?
> 
> Ah!  Emacs to the rescue.  It's the LATIN CAPITAL LETTER TONE FIVE.
> Nothing to do with Cyrillic.

Sorry I mixed examples here and pushed you on a side track. The TONE FIVE
was an example of not knowing the symbol's general category. The Cyrillic A
was an example of a similary looking glyph to A in US-ASCII. 

> 
> Reverting to the Middle Welsh letter, if I cannot tell a small letter
> from a digit, does that mean that every unquoted atom should begin
> with an English letter?  (I cannot say "a Latin letter", because
> ỽ _is_ a member of the extended Latin script.)
> 
> No, I'm sorry.  This is ridiculous.  Expecting everybody to begin
> _their_ variables which you will almost certainly never see to begin
> with an ASCII letter so _you_ can tell this from that; what sense does
> that make?  If it is in a script you cannot read, then you cannot read it.
> 
> Can we just try, for a minute or to, to entertain a rather wild idea?
> Here's the idea:  most programmers are adults.  They can make informed
> choices.  If they *want* you to read their code, they are smart enough
> to write in a script you can read.  If they decide that it's more
> important to them that _they_ can read comfortably, that's their
> decision to make.  If you want a Malayalam-speaker to write code for
> you, put the language (English, Finnish, whatever) in the contract.
> 
> I have a confession to make.  My multiple-programming-languages to
> multiple-styled-output-formats tool is currently Latin-1 only.
> That's because it's for _me_; nobody paid me to write it and I didn't
> expect anyone else to find it useful (although someone did).  It can,
> for example, be configured to generate HTML, and it can be made to
> wrap keywords in <B> and could as easily wrap variables in <U>.  It
> would probably take me about a week to revised the thing to use
> Unicode.  So then I'd have a tool that could generate printed listings
> with variables underlined, without needing to slap untold numbers of
> people in the face with the notion that they are and must remain
> second-class world citizens.
> 
> > So I have to change that requirement into; if it compiles I want to be able
> > to tell from a noncolour printed source code listing what the semantics is.
> 
> You are, in fact, proposing a backwards-incompatible change to Erlang,
> in order to achieve a goal which is not in general achievable, and not
> in my view worth achieving if you could.
> 
> Let's be realistic here.  If you cannot read any of the words, it is not
> going to do you any good to tell the variables from the atoms from the
> numbers.  Let's take an example.  I took a snippet of Erlang out of
> the Erlang/OTP release and transliterated the English letters to
> Russian ones.  If you _don't_ read the Cyrillic script, precisely what
> good does it do you to know which are the variables?  If you _do_ read
> the Cyrillic script, this will seem to you to be complete gibberish,
> so imagine it's a language you don't know.

So here is what seems to be the core question:

I say I want to be able to see the difference between a variable and an
unquoted atom even if I can not make sense of the variables and atoms names'.
I say it would be possible to achieve this by enforcing a small set of first
letters for variables. Then we would require a variable to start with
US-ASCII CAPITAL, "_" or "@".

You say that goal of mine is a lost cause because I will not have any use of
being able to tell the difference between telling the difference between a
variable and an atom anyway. And trying to achieve this by making backwards
incompatible changes is plain ridicilous.

Fair enough.

Just adding "@" to the current set of characters allowed to start a variable
would not be a backwards compatible change, or? But it would be ugly to allow
some Latin capitals while not the Latin extended nor Cyrillic etc.

> 
> ҵӄҽҲӃҸҾҽ({ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,Ґӂ0,ҥұ,ҥҳұ}, ҐӃҾҼҜҾҳ, ҢӃ0) ->
>     try
>         {ҐӂҼ,ҔҽӃӁӈқҰұҴһ,ҢӃ} = ҲҶ_ҵӄҽ(ҥұ, Ґӂ0, ҥҳұ, ҐӃҾҼҜҾҳ, {ҝҰҼҴ,ҐӁҸӃӈ}, ҢӃ0),
>         ҕӄҽҲ = {ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,ҔҽӃӁӈқҰұҴһ,ҐӂҼ},
>         {ҕӄҽҲ,ҢӃ}
>     catch
>         ҒһҰӂӂ:ҔӁӁҾӁ ->
>             ҢӃҰҲҺ = ҴӁһҰҽҶ:ҶҴӃ_ӂӃҰҲҺӃӁҰҲҴ(),
>             ҸҾ:ҵӆӁҸӃҴ("ҕӄҽҲӃҸҾҽ: ~ӆ/~ӆ\ҽ", [ҝҰҼҴ,ҐӁҸӃӈ]),
>             ҴӁһҰҽҶ:ӁҰҸӂҴ(ҒһҰӂӂ, ҔӁӁҾӁ, ҢӃҰҲҺ)
>     end.
> 
> ҲҶ_ҵӄҽ(қҴӂ, җӅӂ, ҥҳұ, ҐӃҾҼҜҾҳ, ҝҰҼҴҐӁҸӃӈ, ҢӃ0) ->
>     {ҕҸ,ҢӃ1} = ҽҴӆ_һҰұҴһ(ҢӃ0),
>     {ҕһ,ҢӃ2} = һҾҲҰһ_ҵӄҽҲ_һҰұҴһ(ҝҰҼҴҐӁҸӃӈ, ҢӃ1),
> 
>     ґҴҵ = ҲһҴҰӁ_ҳҴҰҳ(#ӂӁ{ӁҴҶ=ҵҾһҳһ(fun ({ӅҰӁ,ҥ}, ҡҴҶ) ->
>                                            ҿӄӃ_ӁҴҶ(ҥ, ҡҴҶ)
>                                      end, [], җӅӂ),
>                         ӂӃҺ=[]}, 0, ҥҳұ),
>     {ґ2,_ҐҵӃ,ҢӃ3} = ҲҶ_һҸӂӃ(қҴӂ, 0, ҥҳұ, ґҴҵ,
>        ҢӃ2#ҲҶ{ұӃӈҿҴ=ҴӇҸӃ,ұҵҰҸһ=ҕҸ,ҵҸҽҵҾ=ҕҸ,Ҹӂ_ӃҾҿ_ұһҾҲҺ=ӃӁӄҴ}),
>     {ҝҰҼҴ,ҐӁҸӃӈ} = ҝҰҼҴҐӁҸӃӈ,
>     Ґ = [{һҰұҴһ,ҕҸ},{ҵӄҽҲ_ҸҽҵҾ,ҐӃҾҼҜҾҳ,{ҰӃҾҼ,ҝҰҼҴ},ҐӁҸӃӈ},
>          {һҰұҴһ,ҕһ}|ґ2],
>     {Ґ,ҕһ,ҢӃ3}.
> 
> I don't know about you, but I wouldn't dare to touch this.
> It DOES NOT MATTER TO me which words are variables and which
> are not, because that knowledge is not useful to me.
> 
> (By the way, it should now be clear that in a context like this
> you'll _know_ that something is a Cyrillic capital A because
> everything else is Cyrillic -- there are no capital letters in
> keywords -- so what would a Latin capital A be doing there?)
> 
> Does that mean there will be Erlang files that I cannot read and
> Raimo Niskanen cannot read?  Certainly it does. Does that mean a
> big problem for us?  No.  Nobody is going to _expect_ us to read
> it.  If someone ships us source code we can't read we shan't use
> it.
> 
> Is this a NEW problem?  No.  It is already possible to use some
> surprising languages in ASCII (Klingon, Ancient Egyptian, Greek
> with a little ingenuity, ...) so ever since Erlang began, we've
> had the possibility of entire files being written in words that
> we did not understand.  If you don't know what the *functions*
> are about, what good does it do you to know which tokens are
> variables?
> 
> I once had to maintain a large chunk of Prolog written by a
> very clever programmer whose idea of good variable naming
> style came from old BASIC (one letter, or one letter and one
> digit).  I could see _which_ tokens were the variables, but
> not _what_ the variable names meant.  I had to figure it out
> from the predicate names.  So from actual experience I can
> tell you
> 
> 	JUST KNOWING WHICH TOKENS ARE VARIABLES IS
> 	NEXT TO USELESS.

You have a point. Now it is clearer to me.

> 
> > I think it is better to restrict to a subset of 7-bit US-ASCII.
> 
> Yeah!  Let's make Erlang ASCII-only!  (Too bad about my father's
> middle name: Æneas.  Perfectly good English name, from Latin.)

I was of course talking about the start of a variable, not the
entire language. I am not that stupid. His variable could be
__Æneas, or @Æneas (the latter is unreadable).

> 
> > Decent
> > editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
> > character is under the cursor and if it is A..Z or _ under U+7F it is a
> > variable start.
> 
> I'm using Aquamacs.
> From the Aquamacs help:
> 	Emacs buffers and strings support a large repertoire of 
> 	characters from many different scripts, allowing users to
> 	type and display text in almost any known written language.
> 
> 	To support this multitude of characters and scripts,
> 	Emacs closely follows the Unicode Standard.
> It's Meta-X describe-char, not Ctrl-X describe-char,

Yes. Meta-X. My mistake.

> and it works perfectly with Unicode characters.
> Here's sample output:
> 
>         character: Ҳ (1202, #o2262, #x4b2)
> preferred charset: unicode (Unicode (ISO10646))
>        code point: 0x04B2
>            syntax: w 	which means: word
>          category: .:Base, y:Cyrillic
>       buffer code: #xD2 #xB2
>         file code: #xD2 #xB2 (encoded by coding system utf-8)
>           display: by this font (glyph code)
>     nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)
> 
> Character code properties: customize what to show
>   name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
>   old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
>   general-category: Lu (Letter, Uppercase)
> 
> Trying this in Vim, it tells me what the numeric codes
> of a letter are, but not that it is a letter.

Yes. I know. I gave the example.

So in Vim you can easilly see if the character is less than 128.
But not if it is a letter.

> 
> > 
> > The underscore
> > --------------
> > 
> > I would like to argue against allowing all Unicode general category Pc
> > (Connector_Punctuation) character in place of "_". This class contain
> > in Unicode 6.2 these characters:
> >    U+5F;   LOW LINE
> >    U+2034; UNDERTIE
> >    U+2040; CHARACTER TIE
> >    U+2054; INVERTED UNDERTIE
> >    U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
> >    U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
> >    U+FE4D; DASHED LOW LINE
> >    U+FE4E; CENTERLINE LOW LINE
> >    U+FE4F; WAVY LOW LINE
> >    U+FF3F; FULLWIDTH LOW LINE
> > 
> > Of these at least U+2040 "⁀" is horizontal at the top of the line
> 
> If it looks horizontal, you have a very poor font.
> It's _supposed_ to look more like a c rotated 90 degrees
> clockwise and flattened a bit.

Yes that describes it better. A horizontal flat C, rounded up.

> 
> > and U+FE33 "︳" looks like a vertical bar (I guess intended for
> > vertical flow chinese) so they do not resemble "_" very much.
> 
> Who said they were _supposed_ to resemble "_"?
> Not me.

No. I did, because for me that would indicate the character's purpose.

> 
> I can see your point here, but allowing-all-of-Pc *is* the
> Unicode UAX#31 recommendation.    We *have* to tailor the
> definition somewhat for the sake of backwards compatibility
> (dots and at signs).  We *could* tailor it here, but it is
> definitely advantageous to have at least one more Pc
> character reasons given in the EEP.

Sorry I can not find those reasons. I find reasons and agree
that if we allow more than "_" we should allow all in Pc,
but I do not see why we need more than "_" other than because
it is UAX#31's recommendation.

> 
> > Allowing all these would make it hard to remember if a given
> > character is category Pc or something else e.g "|".
> 
> You are not *supposed* to remember what each and every character is.
> 
> BECAUSE YOU CAN'T.
> 
> If there's anyone who can, I don't want to meet them.
> What _else_ could we talk about?
> 
> There are 110,117 defined characters in Unicode 6.2.
> (The figure was 110,116 in Unicode 6.1 and 6.2 added one more.)
> NOBODY is expected to know what all these characters are.
> 
> The idea is not
> 	"if a character is to appear in an Erlang file,
> 	 everybody must know what it means"
> but
> 	"if someone wants to use their own script in
> 	 an Erlang file, they should be able to do so
> 	 in a way that is generally consistent with
> 	 other programming languages."
> 
> The idea that a character should be forbidden unless YOU
> recognise it would take us right back to ASCII or Latin 1.
> Please, do not put the cart before the horse.
> 
> It is perfectly acceptable to say "If someone wants to share
> Erlang code with people in other countries, they should use
> characters that all those people recognise."  In the 21st
> century it is no longer acceptable to say "nobody may use a
> character unless I remember what it is."

I said I want to be able to understand the semantics without
knowing all characters. Is that a straw man attack?

The wildcard variable is "_" and starting a variable with that
character has a special meaning to the compiler. Why do we need
more aliases for that character?

> > 
> > Unquoted atoms
> > --------------
> > 
> > The EEP proposes:
> >    atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
> > 	| "." (Ll ∪ Lo)
> > 
> > I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
> > be excluded so an atom can not start with a capital looking letter,
> > but Pc ⊄ XID_Start so there is no reason to subtract it, and why
> > subtract Lo (Other_Letter)?
> 
> There is also no *harm* in making it obvious that variables
> *can* start with Pc characters and unquoted atoms *cannot*.

Point taken. I agree.

> 
> Why subtract Lo?  That was a combination of a backwards compatibility
> issue and an oversight.
> 
> The backwards compatibility issue is that
> ªº are Lo characters and are not allowed to begin an Erlang atom.

Would that be an issue? Since they are in Lo should we not start
allowing them?

> The oversight was forgetting that this category was the one with
> most of the characters I wanted to allow.

I guessed so.

> 
> This should read
> 
>     atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
>                 |  "." (Ll ∪ Lo)

Ok. Now I get it. But should it not be the same set after a dot
as at the start?

> 
> > There also seems to be a typo in the definition of unquoted_atom
> > where an iteration of atom_continue is missing.
> > 
> > I propose:
> >    unquoted_atom ::= atom_start atom_continue*
> 
> Yes.
> > 
> >    atom_start ::= atom_start_char
> > 	| "." atom_start_char
> 
> That will allow Latin-1 atoms that are not now legal.
> > 
> >    atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)
> > 
> >    atom_continue ::= XID_Continue ∪ "@"
> > 	| "." XID_Continue
> 
> That will allow Latin-1 atoms that are not now legal.
> 
> > General explanation
> > -------------------
> > 
> > I think the EEP could benefit from explaining more about the used character
> > classes, what kind of stability annex #31 is designed to give and such.
> > 
> > When I did read the EEP it took several days of Unicode standard reading to
> > start understanding, and I think many hesitate before trying to understand
> > the EEP, which is a pity.
> 
> Well, yes.  Is it my job to repeat all the material in the Unicode
> standard?  I don't think so.  I mean, the thing's telephone-book size!

No. The rationale in your new version is a great improvement.
Pointers and reasons are what is needed.

> > 
> > My first concern was about if I write code for one Unicode Erlang release
> > in the future, will then that code be valid for subsequent Erlang releases
> > based on later Unicode standards.
> 
> Yes.  Section 1.1 of UAX#31 could hardly be more explicit.   Well,
> maybe it could, which is why it points to
> http://www.unicode.org/policies/stability_policy.html
> which says
> 
>  - Once a character is XID_Continue,
>    it must continue to be so in all future versions.
>  - If a character is XID_Start then it must also be XID_Continue.
>  - Once a character is XID_Start,
>    it must continue to be so in all future versions.
> 
> amongst other things.

Thank you. The Unicode standard is hard to navigate.

> 
> > For example the EEP and my proposal both define atom_start to be XID_Start
> > minus a set containing uppercase and titlecase letters. XID_Start is
> > derived from ID_Start, and ID_Start contains Other_ID_Start. I have failed
> > in finding which codepoints are contained in Other_ID_Start.
> 
> To start with, the purpose of Other_ID_Start is to provide stability.
> Any character which _used_ to be an ID_Start but because of some change
> would have ceased to be so will be given that property to compensate.
> 
> The properties Other_ID_Start and Other_ID_Continue are listed in
> Proplist.txt in the Unicode data base.  Here's the current set:

So that's where it is... It is difficult to find out where the
different properties are attached to characters.

> 
> # ================================================
> 
> 2118          ; Other_ID_Start # Sm       SCRIPT CAPITAL P
> 212E          ; Other_ID_Start # So       ESTIMATED SYMBOL
> 309B..309C    ; Other_ID_Start # Sk   [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
> 
> # Total code points: 4
> 
> # ================================================
> 
> 00B7          ; Other_ID_Continue # Po       MIDDLE DOT
> 0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
> 1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
> 19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE
> 
> # Total code points: 12
> 
> > But since we here define atom_start as above, moving a character from Lu
> > or Lt into Other_ID_Start will remove it from atom_start and old code
> > using it will not compile.
> 
> 
> Lu and Lt are "General Categories".  Other_ID_Start is a "property".
> 
> OK, now we've got a genuine technical problem.
> 
> The set of characters that can begin a variable-OR-an-unquoted-atom
> can only grow.  That much stability we're promised.
> 
> If a character changes from Lu to Lt or Other_ID_Start,
> no problem.  If a character changes from Lt to Lu or
> Other_ID_Start, no problem.  But if a character changes
> from Lu/Lt to Ll/Lo or vice versa, we have a problem.

I agree that moving a character from Lu or Lt to Other_Id_Start would
increase the set of atom_start characters.

For the characters "ªº" you above called that a backwards compatibility
issue, which I doubt it is. Ignoring that issue would simplify atom_start.

I still think I still see a problem, though:
	unquoted_atom ::= atom_start atom_continue*

	atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Pc ∪ "ªº")
	            |  "." (Ll ∪ Lo)

	atom_continue ::= XID_Continue | "@"
	               |  "." (Ll ∪ Lo)

Where XID_Start is practically:
	(Lu ∪ Ll ∪ Lt ∪ Lm ∪ Lo ∪ Nl ∪ Other_ID_Start)
	    \ Pattern_Syntax \ Pattern_White_Space

If a character moves from Ll or Lo to Other_ID_Start it will suddenly
become not allowed after a ".". Right?

Should not the set after a "." be about the same as at the start?
	unquoted_atom ::= atom_start atom_continue*
	atom_start ::= atom_start_char | "." atom_start_char
	atom_continue ::= XID_Continue | "@" | "." atom_start_char
	atom_start_char ::= XID_Start \ (Lu ∪ Lt ∪ Pc ∪ "ªº")

> 
> Perhaps we can appeal to this:
> 	Once a character is encoded, its properties may still be
> 	changed, but not in such a way as to change the fundamental
> 	identity of the character.
> 	...
> 	For example, the representative glyph for U+0061 “A”
> 	cannot be changed to “B”; the General_Category for
> 	U+0061 “A” cannot be changed to Ll (lowercase letter)
> 	...
> 
> Case Pair stability _nearly_ gives us what we want.
> 	If two characters form a case pair in a version of Unicode,
> 	they will remain a case pair in each subsequent version of Unicode.
> 
> 	If two characters do not form a case pair in a version of Unicode,
> 	they will never become a case pair in any subsequent version of Unicode.
> That is, if "D" and "d" are unequal defined characters such that
> lower("D") = "d" and upper("d") = "D", then this will remain true.
> This means that
> 	If "D" is an Lu character now and "d" the corresponding Ll
> 	character, they are going to remain a case pair.
> So we could fiddle a bit and say
> 	Lu + Lt + Pc + (Other_ID_Start such that lower(x) != x)
> is what we're after.
> 
> This doesn't handle the situation where there is a cased letter now
> but not its case opposite, as Latin-1 had y-umlaut and sharp s as
> lower case letters with no upper case version.  But when case opposites
> for them did go into Unicode, they didn't change.
> 
> I don't think we actually have a problem.

I think you are right.

> 
> However, the attached revision to EEP 40 has two recommendations.
> 
> 

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB