[erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

Thu Nov 1 13:36:13 CET 2012

I've looked through the proposal and don't understand why there are no proposal to add localized keywords?

Suppose I will be using atoms and variables that are easy to read in my own language. Then I'll definitely be frustrated if I have to write keywords in any other language. More than that, it will be very annoying to anyone who has to switch keyboard layout from English to native.

-- 
Dmitry Belyaev

On 01.11.2012, at 9:27, Richard O'Keefe wrote:

> <eep-0040.md>
> 
> On 1/11/2012, at 3:44 AM, Raimo Niskanen wrote:
>> 
>> Was it not ment to be:
>>   var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc
> 
> Yes.  I made a mistake there.
>> 
>> More restricted variable names
>> ------------------------------
>> 
>> Nevertheless, I would like a slightly more conservative change in how Erlang
>> should use Unicode in variable names and unquoted atoms.
>> 
>> I want to be able to read printed source code on a paper and at least
>> understand if Ƽ = count() has a variable, an atom or an integer to the left.
>> This is an impossible goal because we can today e.g Cyrillic А in any .erl
>> file and that will look as it should compile but it will not.
> 
> I am a little puzzled here.  U+0410 (CYRILLIC CAPITAL LETTER A) looks
> like this:  А.  I grant you that it is somewhere between exceptionally
> difficult and impossible to tell an A from an А from an Α (Latin
> capital A, Cyrillic, and Greek respectively).  But they are all capital
> letters.  The point of the proposal is that since А (U+0410) is a
> capital letter, А = count() _should_ compile.
> 
> If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
> that would have been hard to tell from a six, true.
> But I don't see how this is any different from the fact that in a script
> you don't know, you cannot tell _what_ a character is.
> For example, I had a student this year whose native language was I
> believe Malayalam.  I can't tell a Malayalam letter from a digit from
> a punctuation mark.
> 
> Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?
> 
> Ah!  Emacs to the rescue.  It's the LATIN CAPITAL LETTER TONE FIVE.
> Nothing to do with Cyrillic.
> 
> Reverting to the Middle Welsh letter, if I cannot tell a small letter
> from a digit, does that mean that every unquoted atom should begin
> with an English letter?  (I cannot say "a Latin letter", because
> ỽ _is_ a member of the extended Latin script.)
> 
> No, I'm sorry.  This is ridiculous.  Expecting everybody to begin
> _their_ variables which you will almost certainly never see to begin
> with an ASCII letter so _you_ can tell this from that; what sense does
> that make?  If it is in a script you cannot read, then you cannot read it.
> 
> Can we just try, for a minute or to, to entertain a rather wild idea?
> Here's the idea:  most programmers are adults.  They can make informed
> choices.  If they *want* you to read their code, they are smart enough
> to write in a script you can read.  If they decide that it's more
> important to them that _they_ can read comfortably, that's their
> decision to make.  If you want a Malayalam-speaker to write code for
> you, put the language (English, Finnish, whatever) in the contract.
> 
> I have a confession to make.  My multiple-programming-languages to
> multiple-styled-output-formats tool is currently Latin-1 only.
> That's because it's for _me_; nobody paid me to write it and I didn't
> expect anyone else to find it useful (although someone did).  It can,
> for example, be configured to generate HTML, and it can be made to
> wrap keywords in <B> and could as easily wrap variables in <U>.  It
> would probably take me about a week to revised the thing to use
> Unicode.  So then I'd have a tool that could generate printed listings
> with variables underlined, without needing to slap untold numbers of
> people in the face with the notion that they are and must remain
> second-class world citizens.
> 
>> So I have to change that requirement into; if it compiles I want to be able
>> to tell from a noncolour printed source code listing what the semantics is.
> 
> You are, in fact, proposing a backwards-incompatible change to Erlang,
> in order to achieve a goal which is not in general achievable, and not
> in my view worth achieving if you could.
> 
> Let's be realistic here.  If you cannot read any of the words, it is not
> going to do you any good to tell the variables from the atoms from the
> numbers.  Let's take an example.  I took a snippet of Erlang out of
> the Erlang/OTP release and transliterated the English letters to
> Russian ones.  If you _don't_ read the Cyrillic script, precisely what
> good does it do you to know which are the variables?  If you _do_ read
> the Cyrillic script, this will seem to you to be complete gibberish,
> so imagine it's a language you don't know.
> 
> ҵӄҽҲӃҸҾҽ({ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,Ґӂ0,ҥұ,ҥҳұ}, ҐӃҾҼҜҾҳ, ҢӃ0) ->
>    try
>        {ҐӂҼ,ҔҽӃӁӈқҰұҴһ,ҢӃ} = ҲҶ_ҵӄҽ(ҥұ, Ґӂ0, ҥҳұ, ҐӃҾҼҜҾҳ, {ҝҰҼҴ,ҐӁҸӃӈ}, ҢӃ0),
>        ҕӄҽҲ = {ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,ҔҽӃӁӈқҰұҴһ,ҐӂҼ},
>        {ҕӄҽҲ,ҢӃ}
>    catch
>        ҒһҰӂӂ:ҔӁӁҾӁ ->
>            ҢӃҰҲҺ = ҴӁһҰҽҶ:ҶҴӃ_ӂӃҰҲҺӃӁҰҲҴ(),
>            ҸҾ:ҵӆӁҸӃҴ("ҕӄҽҲӃҸҾҽ: ~ӆ/~ӆ\ҽ", [ҝҰҼҴ,ҐӁҸӃӈ]),
>            ҴӁһҰҽҶ:ӁҰҸӂҴ(ҒһҰӂӂ, ҔӁӁҾӁ, ҢӃҰҲҺ)
>    end.
> 
> ҲҶ_ҵӄҽ(қҴӂ, җӅӂ, ҥҳұ, ҐӃҾҼҜҾҳ, ҝҰҼҴҐӁҸӃӈ, ҢӃ0) ->
>    {ҕҸ,ҢӃ1} = ҽҴӆ_һҰұҴһ(ҢӃ0),
>    {ҕһ,ҢӃ2} = һҾҲҰһ_ҵӄҽҲ_һҰұҴһ(ҝҰҼҴҐӁҸӃӈ, ҢӃ1),
> 
>    ґҴҵ = ҲһҴҰӁ_ҳҴҰҳ(#ӂӁ{ӁҴҶ=ҵҾһҳһ(fun ({ӅҰӁ,ҥ}, ҡҴҶ) ->
>                                           ҿӄӃ_ӁҴҶ(ҥ, ҡҴҶ)
>                                     end, [], җӅӂ),
>                        ӂӃҺ=[]}, 0, ҥҳұ),
>    {ґ2,_ҐҵӃ,ҢӃ3} = ҲҶ_һҸӂӃ(қҴӂ, 0, ҥҳұ, ґҴҵ,
>       ҢӃ2#ҲҶ{ұӃӈҿҴ=ҴӇҸӃ,ұҵҰҸһ=ҕҸ,ҵҸҽҵҾ=ҕҸ,Ҹӂ_ӃҾҿ_ұһҾҲҺ=ӃӁӄҴ}),
>    {ҝҰҼҴ,ҐӁҸӃӈ} = ҝҰҼҴҐӁҸӃӈ,
>    Ґ = [{һҰұҴһ,ҕҸ},{ҵӄҽҲ_ҸҽҵҾ,ҐӃҾҼҜҾҳ,{ҰӃҾҼ,ҝҰҼҴ},ҐӁҸӃӈ},
>         {һҰұҴһ,ҕһ}|ґ2],
>    {Ґ,ҕһ,ҢӃ3}.
> 
> I don't know about you, but I wouldn't dare to touch this.
> It DOES NOT MATTER TO me which words are variables and which
> are not, because that knowledge is not useful to me.
> 
> (By the way, it should now be clear that in a context like this
> you'll _know_ that something is a Cyrillic capital A because
> everything else is Cyrillic -- there are no capital letters in
> keywords -- so what would a Latin capital A be doing there?)
> 
> Does that mean there will be Erlang files that I cannot read and
> Raimo Niskanen cannot read?  Certainly it does. Does that mean a
> big problem for us?  No.  Nobody is going to _expect_ us to read
> it.  If someone ships us source code we can't read we shan't use
> it.
> 
> Is this a NEW problem?  No.  It is already possible to use some
> surprising languages in ASCII (Klingon, Ancient Egyptian, Greek
> with a little ingenuity, ...) so ever since Erlang began, we've
> had the possibility of entire files being written in words that
> we did not understand.  If you don't know what the *functions*
> are about, what good does it do you to know which tokens are
> variables?
> 
> I once had to maintain a large chunk of Prolog written by a
> very clever programmer whose idea of good variable naming
> style came from old BASIC (one letter, or one letter and one
> digit).  I could see _which_ tokens were the variables, but
> not _what_ the variable names meant.  I had to figure it out
> from the predicate names.  So from actual experience I can
> tell you
> 
> 	JUST KNOWING WHICH TOKENS ARE VARIABLES IS
> 	NEXT TO USELESS.
> 
>> I think it is better to restrict to a subset of 7-bit US-ASCII.
> 
> Yeah!  Let's make Erlang ASCII-only!  (Too bad about my father's
> middle name: Æneas.  Perfectly good English name, from Latin.)
> 
>> Decent
>> editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
>> character is under the cursor and if it is A..Z or _ under U+7F it is a
>> variable start.
> 
> I'm using Aquamacs.
> From the Aquamacs help:
> 	Emacs buffers and strings support a large repertoire of 
> 	characters from many different scripts, allowing users to
> 	type and display text in almost any known written language.
> 
> 	To support this multitude of characters and scripts,
> 	Emacs closely follows the Unicode Standard.
> It's Meta-X describe-char, not Ctrl-X describe-char,
> and it works perfectly with Unicode characters.
> Here's sample output:
> 
>        character: Ҳ (1202, #o2262, #x4b2)
> preferred charset: unicode (Unicode (ISO10646))
>       code point: 0x04B2
>           syntax: w 	which means: word
>         category: .:Base, y:Cyrillic
>      buffer code: #xD2 #xB2
>        file code: #xD2 #xB2 (encoded by coding system utf-8)
>          display: by this font (glyph code)
>    nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)
> 
> Character code properties: customize what to show
>  name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
>  old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
>  general-category: Lu (Letter, Uppercase)
> 
> Trying this in Vim, it tells me what the numeric codes
> of a letter are, but not that it is a letter.
> 
>> 
>> The underscore
>> --------------
>> 
>> I would like to argue against allowing all Unicode general category Pc
>> (Connector_Punctuation) character in place of "_". This class contain
>> in Unicode 6.2 these characters:
>>   U+5F;   LOW LINE
>>   U+2034; UNDERTIE
>>   U+2040; CHARACTER TIE
>>   U+2054; INVERTED UNDERTIE
>>   U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
>>   U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
>>   U+FE4D; DASHED LOW LINE
>>   U+FE4E; CENTERLINE LOW LINE
>>   U+FE4F; WAVY LOW LINE
>>   U+FF3F; FULLWIDTH LOW LINE
>> 
>> Of these at least U+2040 "⁀" is horizontal at the top of the line
> 
> If it looks horizontal, you have a very poor font.
> It's _supposed_ to look more like a c rotated 90 degrees
> clockwise and flattened a bit.
> 
>> and U+FE33 "︳" looks like a vertical bar (I guess intended for
>> vertical flow chinese) so they do not resemble "_" very much.
> 
> Who said they were _supposed_ to resemble "_"?
> Not me.
> 
> I can see your point here, but allowing-all-of-Pc *is* the
> Unicode UAX#31 recommendation.    We *have* to tailor the
> definition somewhat for the sake of backwards compatibility
> (dots and at signs).  We *could* tailor it here, but it is
> definitely advantageous to have at least one more Pc
> character reasons given in the EEP.
> 
>> Allowing all these would make it hard to remember if a given
>> character is category Pc or something else e.g "|".
> 
> You are not *supposed* to remember what each and every character is.
> 
> BECAUSE YOU CAN'T.
> 
> If there's anyone who can, I don't want to meet them.
> What _else_ could we talk about?
> 
> There are 110,117 defined characters in Unicode 6.2.
> (The figure was 110,116 in Unicode 6.1 and 6.2 added one more.)
> NOBODY is expected to know what all these characters are.
> 
> The idea is not
> 	"if a character is to appear in an Erlang file,
> 	 everybody must know what it means"
> but
> 	"if someone wants to use their own script in
> 	 an Erlang file, they should be able to do so
> 	 in a way that is generally consistent with
> 	 other programming languages."
> 
> The idea that a character should be forbidden unless YOU
> recognise it would take us right back to ASCII or Latin 1.
> Please, do not put the cart before the horse.
> 
> It is perfectly acceptable to say "If someone wants to share
> Erlang code with people in other countries, they should use
> characters that all those people recognise."  In the 21st
> century it is no longer acceptable to say "nobody may use a
> character unless I remember what it is."
>> 
>> Unquoted atoms
>> --------------
>> 
>> The EEP proposes:
>>   atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
>> 	| "." (Ll ∪ Lo)
>> 
>> I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
>> be excluded so an atom can not start with a capital looking letter,
>> but Pc ⊄ XID_Start so there is no reason to subtract it, and why
>> subtract Lo (Other_Letter)?
> 
> There is also no *harm* in making it obvious that variables
> *can* start with Pc characters and unquoted atoms *cannot*.
> 
> Why subtract Lo?  That was a combination of a backwards compatibility
> issue and an oversight.
> 
> The backwards compatibility issue is that
> ªº are Lo characters and are not allowed to begin an Erlang atom.
> The oversight was forgetting that this category was the one with
> most of the characters I wanted to allow.
> 
> This should read
> 
>    atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
>                |  "." (Ll ∪ Lo)
> 
>> There also seems to be a typo in the definition of unquoted_atom
>> where an iteration of atom_continue is missing.
>> 
>> I propose:
>>   unquoted_atom ::= atom_start atom_continue*
> 
> Yes.
>> 
>>   atom_start ::= atom_start_char
>> 	| "." atom_start_char
> 
> That will allow Latin-1 atoms that are not now legal.
>> 
>>   atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)
>> 
>>   atom_continue ::= XID_Continue ∪ "@"
>> 	| "." XID_Continue
> 
> That will allow Latin-1 atoms that are not now legal.
> 
>> General explanation
>> -------------------
>> 
>> I think the EEP could benefit from explaining more about the used character
>> classes, what kind of stability annex #31 is designed to give and such.
>> 
>> When I did read the EEP it took several days of Unicode standard reading to
>> start understanding, and I think many hesitate before trying to understand
>> the EEP, which is a pity.
> 
> Well, yes.  Is it my job to repeat all the material in the Unicode
> standard?  I don't think so.  I mean, the thing's telephone-book size!
>> 
>> My first concern was about if I write code for one Unicode Erlang release
>> in the future, will then that code be valid for subsequent Erlang releases
>> based on later Unicode standards.
> 
> Yes.  Section 1.1 of UAX#31 could hardly be more explicit.   Well,
> maybe it could, which is why it points to
> http://www.unicode.org/policies/stability_policy.html
> which says
> 
> - Once a character is XID_Continue,
>   it must continue to be so in all future versions.
> - If a character is XID_Start then it must also be XID_Continue.
> - Once a character is XID_Start,
>   it must continue to be so in all future versions.
> 
> amongst other things.
> 
>> For example the EEP and my proposal both define atom_start to be XID_Start
>> minus a set containing uppercase and titlecase letters. XID_Start is
>> derived from ID_Start, and ID_Start contains Other_ID_Start. I have failed
>> in finding which codepoints are contained in Other_ID_Start.
> 
> To start with, the purpose of Other_ID_Start is to provide stability.
> Any character which _used_ to be an ID_Start but because of some change
> would have ceased to be so will be given that property to compensate.
> 
> The properties Other_ID_Start and Other_ID_Continue are listed in
> Proplist.txt in the Unicode data base.  Here's the current set:
> 
> # ================================================
> 
> 2118          ; Other_ID_Start # Sm       SCRIPT CAPITAL P
> 212E          ; Other_ID_Start # So       ESTIMATED SYMBOL
> 309B..309C    ; Other_ID_Start # Sk   [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
> 
> # Total code points: 4
> 
> # ================================================
> 
> 00B7          ; Other_ID_Continue # Po       MIDDLE DOT
> 0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
> 1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
> 19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE
> 
> # Total code points: 12
> 
>> But since we here define atom_start as above, moving a character from Lu
>> or Lt into Other_ID_Start will remove it from atom_start and old code
>> using it will not compile.
> 
> 
> Lu and Lt are "General Categories".  Other_ID_Start is a "property".
> 
> OK, now we've got a genuine technical problem.
> 
> The set of characters that can begin a variable-OR-an-unquoted-atom
> can only grow.  That much stability we're promised.
> 
> If a character changes from Lu to Lt or Other_ID_Start,
> no problem.  If a character changes from Lt to Lu or
> Other_ID_Start, no problem.  But if a character changes
> from Lu/Lt to Ll/Lo or vice versa, we have a problem.
> 
> Perhaps we can appeal to this:
> 	Once a character is encoded, its properties may still be
> 	changed, but not in such a way as to change the fundamental
> 	identity of the character.
> 	...
> 	For example, the representative glyph for U+0061 “A”
> 	cannot be changed to “B”; the General_Category for
> 	U+0061 “A” cannot be changed to Ll (lowercase letter)
> 	...
> 
> Case Pair stability _nearly_ gives us what we want.
> 	If two characters form a case pair in a version of Unicode,
> 	they will remain a case pair in each subsequent version of Unicode.
> 
> 	If two characters do not form a case pair in a version of Unicode,
> 	they will never become a case pair in any subsequent version of Unicode.
> That is, if "D" and "d" are unequal defined characters such that
> lower("D") = "d" and upper("d") = "D", then this will remain true.
> This means that
> 	If "D" is an Lu character now and "d" the corresponding Ll
> 	character, they are going to remain a case pair.
> So we could fiddle a bit and say
> 	Lu + Lt + Pc + (Other_ID_Start such that lower(x) != x)
> is what we're after.
> 
> This doesn't handle the situation where there is a cased letter now
> but not its case opposite, as Latin-1 had y-umlaut and sharp s as
> lower case letters with no upper case version.  But when case opposites
> for them did go into Unicode, they didn't change.
> 
> I don't think we actually have a problem.
> 
> However, the attached revision to EEP 40 has two recommendations.
> 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions