[erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.
Raimo Niskanen
raimo+erlang-questions@REDACTED
Thu Nov 1 17:52:39 CET 2012
On Thu, Nov 01, 2012 at 06:27:10PM +1300, Richard O'Keefe wrote:
>
>
> On 1/11/2012, at 3:44 AM, Raimo Niskanen wrote:
: :
> >
> > More restricted variable names
> > ------------------------------
> >
> > Nevertheless, I would like a slightly more conservative change in how Erlang
> > should use Unicode in variable names and unquoted atoms.
> >
> > I want to be able to read printed source code on a paper and at least
> > understand if Ƽ = count() has a variable, an atom or an integer to the left.
> > This is an impossible goal because we can today e.g Cyrillic А in any .erl
> > file and that will look as it should compile but it will not.
>
> I am a little puzzled here. U+0410 (CYRILLIC CAPITAL LETTER A) looks
> like this: А. I grant you that it is somewhere between exceptionally
> difficult and impossible to tell an A from an А from an Α (Latin
> capital A, Cyrillic, and Greek respectively). But they are all capital
> letters. The point of the proposal is that since А (U+0410) is a
> capital letter, А = count() _should_ compile.
I think that point, which is a good one, did not come through in the
proposal, but the updated version of yours have a very good
rationale that makes it clearer.
>
> If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
> that would have been hard to tell from a six, true.
> But I don't see how this is any different from the fact that in a script
> you don't know, you cannot tell _what_ a character is.
> For example, I had a student this year whose native language was I
> believe Malayalam. I can't tell a Malayalam letter from a digit from
> a punctuation mark.
>
> Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?
>
> Ah! Emacs to the rescue. It's the LATIN CAPITAL LETTER TONE FIVE.
> Nothing to do with Cyrillic.
Sorry I mixed examples here and pushed you on a side track. The TONE FIVE
was an example of not knowing the symbol's general category. The Cyrillic A
was an example of a similary looking glyph to A in US-ASCII.
>
> Reverting to the Middle Welsh letter, if I cannot tell a small letter
> from a digit, does that mean that every unquoted atom should begin
> with an English letter? (I cannot say "a Latin letter", because
> ỽ _is_ a member of the extended Latin script.)
>
> No, I'm sorry. This is ridiculous. Expecting everybody to begin
> _their_ variables which you will almost certainly never see to begin
> with an ASCII letter so _you_ can tell this from that; what sense does
> that make? If it is in a script you cannot read, then you cannot read it.
>
> Can we just try, for a minute or to, to entertain a rather wild idea?
> Here's the idea: most programmers are adults. They can make informed
> choices. If they *want* you to read their code, they are smart enough
> to write in a script you can read. If they decide that it's more
> important to them that _they_ can read comfortably, that's their
> decision to make. If you want a Malayalam-speaker to write code for
> you, put the language (English, Finnish, whatever) in the contract.
>
> I have a confession to make. My multiple-programming-languages to
> multiple-styled-output-formats tool is currently Latin-1 only.
> That's because it's for _me_; nobody paid me to write it and I didn't
> expect anyone else to find it useful (although someone did). It can,
> for example, be configured to generate HTML, and it can be made to
> wrap keywords in <B> and could as easily wrap variables in <U>. It
> would probably take me about a week to revised the thing to use
> Unicode. So then I'd have a tool that could generate printed listings
> with variables underlined, without needing to slap untold numbers of
> people in the face with the notion that they are and must remain
> second-class world citizens.
>
> > So I have to change that requirement into; if it compiles I want to be able
> > to tell from a noncolour printed source code listing what the semantics is.
>
> You are, in fact, proposing a backwards-incompatible change to Erlang,
> in order to achieve a goal which is not in general achievable, and not
> in my view worth achieving if you could.
>
> Let's be realistic here. If you cannot read any of the words, it is not
> going to do you any good to tell the variables from the atoms from the
> numbers. Let's take an example. I took a snippet of Erlang out of
> the Erlang/OTP release and transliterated the English letters to
> Russian ones. If you _don't_ read the Cyrillic script, precisely what
> good does it do you to know which are the variables? If you _do_ read
> the Cyrillic script, this will seem to you to be complete gibberish,
> so imagine it's a language you don't know.
So here is what seems to be the core question:
I say I want to be able to see the difference between a variable and an
unquoted atom even if I can not make sense of the variables and atoms names'.
I say it would be possible to achieve this by enforcing a small set of first
letters for variables. Then we would require a variable to start with
US-ASCII CAPITAL, "_" or "@".
You say that goal of mine is a lost cause because I will not have any use of
being able to tell the difference between telling the difference between a
variable and an atom anyway. And trying to achieve this by making backwards
incompatible changes is plain ridicilous.
Fair enough.
Just adding "@" to the current set of characters allowed to start a variable
would not be a backwards compatible change, or? But it would be ugly to allow
some Latin capitals while not the Latin extended nor Cyrillic etc.
>
> ҵӄҽҲӃҸҾҽ({ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,Ґӂ0,ҥұ,ҥҳұ}, ҐӃҾҼҜҾҳ, ҢӃ0) ->
> try
> {ҐӂҼ,ҔҽӃӁӈқҰұҴһ,ҢӃ} = ҲҶ_ҵӄҽ(ҥұ, Ґӂ0, ҥҳұ, ҐӃҾҼҜҾҳ, {ҝҰҼҴ,ҐӁҸӃӈ}, ҢӃ0),
> ҕӄҽҲ = {ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,ҔҽӃӁӈқҰұҴһ,ҐӂҼ},
> {ҕӄҽҲ,ҢӃ}
> catch
> ҒһҰӂӂ:ҔӁӁҾӁ ->
> ҢӃҰҲҺ = ҴӁһҰҽҶ:ҶҴӃ_ӂӃҰҲҺӃӁҰҲҴ(),
> ҸҾ:ҵӆӁҸӃҴ("ҕӄҽҲӃҸҾҽ: ~ӆ/~ӆ\ҽ", [ҝҰҼҴ,ҐӁҸӃӈ]),
> ҴӁһҰҽҶ:ӁҰҸӂҴ(ҒһҰӂӂ, ҔӁӁҾӁ, ҢӃҰҲҺ)
> end.
>
> ҲҶ_ҵӄҽ(қҴӂ, җӅӂ, ҥҳұ, ҐӃҾҼҜҾҳ, ҝҰҼҴҐӁҸӃӈ, ҢӃ0) ->
> {ҕҸ,ҢӃ1} = ҽҴӆ_һҰұҴһ(ҢӃ0),
> {ҕһ,ҢӃ2} = һҾҲҰһ_ҵӄҽҲ_һҰұҴһ(ҝҰҼҴҐӁҸӃӈ, ҢӃ1),
>
> ґҴҵ = ҲһҴҰӁ_ҳҴҰҳ(#ӂӁ{ӁҴҶ=ҵҾһҳһ(fun ({ӅҰӁ,ҥ}, ҡҴҶ) ->
> ҿӄӃ_ӁҴҶ(ҥ, ҡҴҶ)
> end, [], җӅӂ),
> ӂӃҺ=[]}, 0, ҥҳұ),
> {ґ2,_ҐҵӃ,ҢӃ3} = ҲҶ_һҸӂӃ(қҴӂ, 0, ҥҳұ, ґҴҵ,
> ҢӃ2#ҲҶ{ұӃӈҿҴ=ҴӇҸӃ,ұҵҰҸһ=ҕҸ,ҵҸҽҵҾ=ҕҸ,Ҹӂ_ӃҾҿ_ұһҾҲҺ=ӃӁӄҴ}),
> {ҝҰҼҴ,ҐӁҸӃӈ} = ҝҰҼҴҐӁҸӃӈ,
> Ґ = [{һҰұҴһ,ҕҸ},{ҵӄҽҲ_ҸҽҵҾ,ҐӃҾҼҜҾҳ,{ҰӃҾҼ,ҝҰҼҴ},ҐӁҸӃӈ},
> {һҰұҴһ,ҕһ}|ґ2],
> {Ґ,ҕһ,ҢӃ3}.
>
> I don't know about you, but I wouldn't dare to touch this.
> It DOES NOT MATTER TO me which words are variables and which
> are not, because that knowledge is not useful to me.
>
> (By the way, it should now be clear that in a context like this
> you'll _know_ that something is a Cyrillic capital A because
> everything else is Cyrillic -- there are no capital letters in
> keywords -- so what would a Latin capital A be doing there?)
>
> Does that mean there will be Erlang files that I cannot read and
> Raimo Niskanen cannot read? Certainly it does. Does that mean a
> big problem for us? No. Nobody is going to _expect_ us to read
> it. If someone ships us source code we can't read we shan't use
> it.
>
> Is this a NEW problem? No. It is already possible to use some
> surprising languages in ASCII (Klingon, Ancient Egyptian, Greek
> with a little ingenuity, ...) so ever since Erlang began, we've
> had the possibility of entire files being written in words that
> we did not understand. If you don't know what the *functions*
> are about, what good does it do you to know which tokens are
> variables?
>
> I once had to maintain a large chunk of Prolog written by a
> very clever programmer whose idea of good variable naming
> style came from old BASIC (one letter, or one letter and one
> digit). I could see _which_ tokens were the variables, but
> not _what_ the variable names meant. I had to figure it out
> from the predicate names. So from actual experience I can
> tell you
>
> JUST KNOWING WHICH TOKENS ARE VARIABLES IS
> NEXT TO USELESS.
You have a point. Now it is clearer to me.
>
> > I think it is better to restrict to a subset of 7-bit US-ASCII.
>
> Yeah! Let's make Erlang ASCII-only! (Too bad about my father's
> middle name: Æneas. Perfectly good English name, from Latin.)
I was of course talking about the start of a variable, not the
entire language. I am not that stupid. His variable could be
__Æneas, or @Æneas (the latter is unreadable).
>
> > Decent
> > editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
> > character is under the cursor and if it is A..Z or _ under U+7F it is a
> > variable start.
>
> I'm using Aquamacs.
> From the Aquamacs help:
> Emacs buffers and strings support a large repertoire of
> characters from many different scripts, allowing users to
> type and display text in almost any known written language.
>
> To support this multitude of characters and scripts,
> Emacs closely follows the Unicode Standard.
> It's Meta-X describe-char, not Ctrl-X describe-char,
Yes. Meta-X. My mistake.
> and it works perfectly with Unicode characters.
> Here's sample output:
>
> character: Ҳ (1202, #o2262, #x4b2)
> preferred charset: unicode (Unicode (ISO10646))
> code point: 0x04B2
> syntax: w which means: word
> category: .:Base, y:Cyrillic
> buffer code: #xD2 #xB2
> file code: #xD2 #xB2 (encoded by coding system utf-8)
> display: by this font (glyph code)
> nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)
>
> Character code properties: customize what to show
> name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
> old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
> general-category: Lu (Letter, Uppercase)
>
> Trying this in Vim, it tells me what the numeric codes
> of a letter are, but not that it is a letter.
Yes. I know. I gave the example.
So in Vim you can easilly see if the character is less than 128.
But not if it is a letter.
>
> >
> > The underscore
> > --------------
> >
> > I would like to argue against allowing all Unicode general category Pc
> > (Connector_Punctuation) character in place of "_". This class contain
> > in Unicode 6.2 these characters:
> > U+5F; LOW LINE
> > U+2034; UNDERTIE
> > U+2040; CHARACTER TIE
> > U+2054; INVERTED UNDERTIE
> > U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
> > U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
> > U+FE4D; DASHED LOW LINE
> > U+FE4E; CENTERLINE LOW LINE
> > U+FE4F; WAVY LOW LINE
> > U+FF3F; FULLWIDTH LOW LINE
> >
> > Of these at least U+2040 "⁀" is horizontal at the top of the line
>
> If it looks horizontal, you have a very poor font.
> It's _supposed_ to look more like a c rotated 90 degrees
> clockwise and flattened a bit.
Yes that describes it better. A horizontal flat C, rounded up.
>
> > and U+FE33 "︳" looks like a vertical bar (I guess intended for
> > vertical flow chinese) so they do not resemble "_" very much.
>
> Who said they were _supposed_ to resemble "_"?
> Not me.
No. I did, because for me that would indicate the character's purpose.
>
> I can see your point here, but allowing-all-of-Pc *is* the
> Unicode UAX#31 recommendation. We *have* to tailor the
> definition somewhat for the sake of backwards compatibility
> (dots and at signs). We *could* tailor it here, but it is
> definitely advantageous to have at least one more Pc
> character reasons given in the EEP.
Sorry I can not find those reasons. I find reasons and agree
that if we allow more than "_" we should allow all in Pc,
but I do not see why we need more than "_" other than because
it is UAX#31's recommendation.
>
> > Allowing all these would make it hard to remember if a given
> > character is category Pc or something else e.g "|".
>
> You are not *supposed* to remember what each and every character is.
>
> BECAUSE YOU CAN'T.
>
> If there's anyone who can, I don't want to meet them.
> What _else_ could we talk about?
>
> There are 110,117 defined characters in Unicode 6.2.
> (The figure was 110,116 in Unicode 6.1 and 6.2 added one more.)
> NOBODY is expected to know what all these characters are.
>
> The idea is not
> "if a character is to appear in an Erlang file,
> everybody must know what it means"
> but
> "if someone wants to use their own script in
> an Erlang file, they should be able to do so
> in a way that is generally consistent with
> other programming languages."
>
> The idea that a character should be forbidden unless YOU
> recognise it would take us right back to ASCII or Latin 1.
> Please, do not put the cart before the horse.
>
> It is perfectly acceptable to say "If someone wants to share
> Erlang code with people in other countries, they should use
> characters that all those people recognise." In the 21st
> century it is no longer acceptable to say "nobody may use a
> character unless I remember what it is."
I said I want to be able to understand the semantics without
knowing all characters. Is that a straw man attack?
The wildcard variable is "_" and starting a variable with that
character has a special meaning to the compiler. Why do we need
more aliases for that character?
> >
> > Unquoted atoms
> > --------------
> >
> > The EEP proposes:
> > atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
> > | "." (Ll ∪ Lo)
> >
> > I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
> > be excluded so an atom can not start with a capital looking letter,
> > but Pc ⊄ XID_Start so there is no reason to subtract it, and why
> > subtract Lo (Other_Letter)?
>
> There is also no *harm* in making it obvious that variables
> *can* start with Pc characters and unquoted atoms *cannot*.
Point taken. I agree.
>
> Why subtract Lo? That was a combination of a backwards compatibility
> issue and an oversight.
>
> The backwards compatibility issue is that
> ªº are Lo characters and are not allowed to begin an Erlang atom.
Would that be an issue? Since they are in Lo should we not start
allowing them?
> The oversight was forgetting that this category was the one with
> most of the characters I wanted to allow.
I guessed so.
>
> This should read
>
> atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
> | "." (Ll ∪ Lo)
Ok. Now I get it. But should it not be the same set after a dot
as at the start?
>
> > There also seems to be a typo in the definition of unquoted_atom
> > where an iteration of atom_continue is missing.
> >
> > I propose:
> > unquoted_atom ::= atom_start atom_continue*
>
> Yes.
> >
> > atom_start ::= atom_start_char
> > | "." atom_start_char
>
> That will allow Latin-1 atoms that are not now legal.
> >
> > atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)
> >
> > atom_continue ::= XID_Continue ∪ "@"
> > | "." XID_Continue
>
> That will allow Latin-1 atoms that are not now legal.
>
> > General explanation
> > -------------------
> >
> > I think the EEP could benefit from explaining more about the used character
> > classes, what kind of stability annex #31 is designed to give and such.
> >
> > When I did read the EEP it took several days of Unicode standard reading to
> > start understanding, and I think many hesitate before trying to understand
> > the EEP, which is a pity.
>
> Well, yes. Is it my job to repeat all the material in the Unicode
> standard? I don't think so. I mean, the thing's telephone-book size!
No. The rationale in your new version is a great improvement.
Pointers and reasons are what is needed.
> >
> > My first concern was about if I write code for one Unicode Erlang release
> > in the future, will then that code be valid for subsequent Erlang releases
> > based on later Unicode standards.
>
> Yes. Section 1.1 of UAX#31 could hardly be more explicit. Well,
> maybe it could, which is why it points to
> http://www.unicode.org/policies/stability_policy.html
> which says
>
> - Once a character is XID_Continue,
> it must continue to be so in all future versions.
> - If a character is XID_Start then it must also be XID_Continue.
> - Once a character is XID_Start,
> it must continue to be so in all future versions.
>
> amongst other things.
Thank you. The Unicode standard is hard to navigate.
>
> > For example the EEP and my proposal both define atom_start to be XID_Start
> > minus a set containing uppercase and titlecase letters. XID_Start is
> > derived from ID_Start, and ID_Start contains Other_ID_Start. I have failed
> > in finding which codepoints are contained in Other_ID_Start.
>
> To start with, the purpose of Other_ID_Start is to provide stability.
> Any character which _used_ to be an ID_Start but because of some change
> would have ceased to be so will be given that property to compensate.
>
> The properties Other_ID_Start and Other_ID_Continue are listed in
> Proplist.txt in the Unicode data base. Here's the current set:
So that's where it is... It is difficult to find out where the
different properties are attached to characters.
>
> # ================================================
>
> 2118 ; Other_ID_Start # Sm SCRIPT CAPITAL P
> 212E ; Other_ID_Start # So ESTIMATED SYMBOL
> 309B..309C ; Other_ID_Start # Sk [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
>
> # Total code points: 4
>
> # ================================================
>
> 00B7 ; Other_ID_Continue # Po MIDDLE DOT
> 0387 ; Other_ID_Continue # Po GREEK ANO TELEIA
> 1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
> 19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE
>
> # Total code points: 12
>
> > But since we here define atom_start as above, moving a character from Lu
> > or Lt into Other_ID_Start will remove it from atom_start and old code
> > using it will not compile.
>
>
> Lu and Lt are "General Categories". Other_ID_Start is a "property".
>
> OK, now we've got a genuine technical problem.
>
> The set of characters that can begin a variable-OR-an-unquoted-atom
> can only grow. That much stability we're promised.
>
> If a character changes from Lu to Lt or Other_ID_Start,
> no problem. If a character changes from Lt to Lu or
> Other_ID_Start, no problem. But if a character changes
> from Lu/Lt to Ll/Lo or vice versa, we have a problem.
I agree that moving a character from Lu or Lt to Other_Id_Start would
increase the set of atom_start characters.
For the characters "ªº" you above called that a backwards compatibility
issue, which I doubt it is. Ignoring that issue would simplify atom_start.
I still think I still see a problem, though:
unquoted_atom ::= atom_start atom_continue*
atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Pc ∪ "ªº")
| "." (Ll ∪ Lo)
atom_continue ::= XID_Continue | "@"
| "." (Ll ∪ Lo)
Where XID_Start is practically:
(Lu ∪ Ll ∪ Lt ∪ Lm ∪ Lo ∪ Nl ∪ Other_ID_Start)
\ Pattern_Syntax \ Pattern_White_Space
If a character moves from Ll or Lo to Other_ID_Start it will suddenly
become not allowed after a ".". Right?
Should not the set after a "." be about the same as at the start?
unquoted_atom ::= atom_start atom_continue*
atom_start ::= atom_start_char | "." atom_start_char
atom_continue ::= XID_Continue | "@" | "." atom_start_char
atom_start_char ::= XID_Start \ (Lu ∪ Lt ∪ Pc ∪ "ªº")
>
> Perhaps we can appeal to this:
> Once a character is encoded, its properties may still be
> changed, but not in such a way as to change the fundamental
> identity of the character.
> ...
> For example, the representative glyph for U+0061 “A”
> cannot be changed to “B”; the General_Category for
> U+0061 “A” cannot be changed to Ll (lowercase letter)
> ...
>
> Case Pair stability _nearly_ gives us what we want.
> If two characters form a case pair in a version of Unicode,
> they will remain a case pair in each subsequent version of Unicode.
>
> If two characters do not form a case pair in a version of Unicode,
> they will never become a case pair in any subsequent version of Unicode.
> That is, if "D" and "d" are unequal defined characters such that
> lower("D") = "d" and upper("d") = "D", then this will remain true.
> This means that
> If "D" is an Lu character now and "d" the corresponding Ll
> character, they are going to remain a case pair.
> So we could fiddle a bit and say
> Lu + Lt + Pc + (Other_ID_Start such that lower(x) != x)
> is what we're after.
>
> This doesn't handle the situation where there is a cased letter now
> but not its case opposite, as Latin-1 had y-umlaut and sharp s as
> lower case letters with no upper case version. But when case opposites
> for them did go into Unicode, they didn't change.
>
> I don't think we actually have a problem.
I think you are right.
>
> However, the attached revision to EEP 40 has two recommendations.
>
>
--
/ Raimo Niskanen, Erlang/OTP, Ericsson AB
More information about the erlang-questions
mailing list