[erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.
Dmitry Belyaev
be.dmitry@REDACTED
Thu Nov 1 13:36:13 CET 2012
I've looked through the proposal and don't understand why there are no proposal to add localized keywords?
Suppose I will be using atoms and variables that are easy to read in my own language. Then I'll definitely be frustrated if I have to write keywords in any other language. More than that, it will be very annoying to anyone who has to switch keyboard layout from English to native.
--
Dmitry Belyaev
On 01.11.2012, at 9:27, Richard O'Keefe wrote:
> <eep-0040.md>
>
> On 1/11/2012, at 3:44 AM, Raimo Niskanen wrote:
>>
>> Was it not ment to be:
>> var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc
>
> Yes. I made a mistake there.
>>
>> More restricted variable names
>> ------------------------------
>>
>> Nevertheless, I would like a slightly more conservative change in how Erlang
>> should use Unicode in variable names and unquoted atoms.
>>
>> I want to be able to read printed source code on a paper and at least
>> understand if Ƽ = count() has a variable, an atom or an integer to the left.
>> This is an impossible goal because we can today e.g Cyrillic А in any .erl
>> file and that will look as it should compile but it will not.
>
> I am a little puzzled here. U+0410 (CYRILLIC CAPITAL LETTER A) looks
> like this: А. I grant you that it is somewhere between exceptionally
> difficult and impossible to tell an A from an А from an Α (Latin
> capital A, Cyrillic, and Greek respectively). But they are all capital
> letters. The point of the proposal is that since А (U+0410) is a
> capital letter, А = count() _should_ compile.
>
> If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
> that would have been hard to tell from a six, true.
> But I don't see how this is any different from the fact that in a script
> you don't know, you cannot tell _what_ a character is.
> For example, I had a student this year whose native language was I
> believe Malayalam. I can't tell a Malayalam letter from a digit from
> a punctuation mark.
>
> Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?
>
> Ah! Emacs to the rescue. It's the LATIN CAPITAL LETTER TONE FIVE.
> Nothing to do with Cyrillic.
>
> Reverting to the Middle Welsh letter, if I cannot tell a small letter
> from a digit, does that mean that every unquoted atom should begin
> with an English letter? (I cannot say "a Latin letter", because
> ỽ _is_ a member of the extended Latin script.)
>
> No, I'm sorry. This is ridiculous. Expecting everybody to begin
> _their_ variables which you will almost certainly never see to begin
> with an ASCII letter so _you_ can tell this from that; what sense does
> that make? If it is in a script you cannot read, then you cannot read it.
>
> Can we just try, for a minute or to, to entertain a rather wild idea?
> Here's the idea: most programmers are adults. They can make informed
> choices. If they *want* you to read their code, they are smart enough
> to write in a script you can read. If they decide that it's more
> important to them that _they_ can read comfortably, that's their
> decision to make. If you want a Malayalam-speaker to write code for
> you, put the language (English, Finnish, whatever) in the contract.
>
> I have a confession to make. My multiple-programming-languages to
> multiple-styled-output-formats tool is currently Latin-1 only.
> That's because it's for _me_; nobody paid me to write it and I didn't
> expect anyone else to find it useful (although someone did). It can,
> for example, be configured to generate HTML, and it can be made to
> wrap keywords in <B> and could as easily wrap variables in <U>. It
> would probably take me about a week to revised the thing to use
> Unicode. So then I'd have a tool that could generate printed listings
> with variables underlined, without needing to slap untold numbers of
> people in the face with the notion that they are and must remain
> second-class world citizens.
>
>> So I have to change that requirement into; if it compiles I want to be able
>> to tell from a noncolour printed source code listing what the semantics is.
>
> You are, in fact, proposing a backwards-incompatible change to Erlang,
> in order to achieve a goal which is not in general achievable, and not
> in my view worth achieving if you could.
>
> Let's be realistic here. If you cannot read any of the words, it is not
> going to do you any good to tell the variables from the atoms from the
> numbers. Let's take an example. I took a snippet of Erlang out of
> the Erlang/OTP release and transliterated the English letters to
> Russian ones. If you _don't_ read the Cyrillic script, precisely what
> good does it do you to know which are the variables? If you _do_ read
> the Cyrillic script, this will seem to you to be complete gibberish,
> so imagine it's a language you don't know.
>
> ҵӄҽҲӃҸҾҽ({ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,Ґӂ0,ҥұ,ҥҳұ}, ҐӃҾҼҜҾҳ, ҢӃ0) ->
> try
> {ҐӂҼ,ҔҽӃӁӈқҰұҴһ,ҢӃ} = ҲҶ_ҵӄҽ(ҥұ, Ґӂ0, ҥҳұ, ҐӃҾҼҜҾҳ, {ҝҰҼҴ,ҐӁҸӃӈ}, ҢӃ0),
> ҕӄҽҲ = {ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,ҔҽӃӁӈқҰұҴһ,ҐӂҼ},
> {ҕӄҽҲ,ҢӃ}
> catch
> ҒһҰӂӂ:ҔӁӁҾӁ ->
> ҢӃҰҲҺ = ҴӁһҰҽҶ:ҶҴӃ_ӂӃҰҲҺӃӁҰҲҴ(),
> ҸҾ:ҵӆӁҸӃҴ("ҕӄҽҲӃҸҾҽ: ~ӆ/~ӆ\ҽ", [ҝҰҼҴ,ҐӁҸӃӈ]),
> ҴӁһҰҽҶ:ӁҰҸӂҴ(ҒһҰӂӂ, ҔӁӁҾӁ, ҢӃҰҲҺ)
> end.
>
> ҲҶ_ҵӄҽ(қҴӂ, җӅӂ, ҥҳұ, ҐӃҾҼҜҾҳ, ҝҰҼҴҐӁҸӃӈ, ҢӃ0) ->
> {ҕҸ,ҢӃ1} = ҽҴӆ_һҰұҴһ(ҢӃ0),
> {ҕһ,ҢӃ2} = һҾҲҰһ_ҵӄҽҲ_һҰұҴһ(ҝҰҼҴҐӁҸӃӈ, ҢӃ1),
>
> ґҴҵ = ҲһҴҰӁ_ҳҴҰҳ(#ӂӁ{ӁҴҶ=ҵҾһҳһ(fun ({ӅҰӁ,ҥ}, ҡҴҶ) ->
> ҿӄӃ_ӁҴҶ(ҥ, ҡҴҶ)
> end, [], җӅӂ),
> ӂӃҺ=[]}, 0, ҥҳұ),
> {ґ2,_ҐҵӃ,ҢӃ3} = ҲҶ_һҸӂӃ(қҴӂ, 0, ҥҳұ, ґҴҵ,
> ҢӃ2#ҲҶ{ұӃӈҿҴ=ҴӇҸӃ,ұҵҰҸһ=ҕҸ,ҵҸҽҵҾ=ҕҸ,Ҹӂ_ӃҾҿ_ұһҾҲҺ=ӃӁӄҴ}),
> {ҝҰҼҴ,ҐӁҸӃӈ} = ҝҰҼҴҐӁҸӃӈ,
> Ґ = [{һҰұҴһ,ҕҸ},{ҵӄҽҲ_ҸҽҵҾ,ҐӃҾҼҜҾҳ,{ҰӃҾҼ,ҝҰҼҴ},ҐӁҸӃӈ},
> {һҰұҴһ,ҕһ}|ґ2],
> {Ґ,ҕһ,ҢӃ3}.
>
> I don't know about you, but I wouldn't dare to touch this.
> It DOES NOT MATTER TO me which words are variables and which
> are not, because that knowledge is not useful to me.
>
> (By the way, it should now be clear that in a context like this
> you'll _know_ that something is a Cyrillic capital A because
> everything else is Cyrillic -- there are no capital letters in
> keywords -- so what would a Latin capital A be doing there?)
>
> Does that mean there will be Erlang files that I cannot read and
> Raimo Niskanen cannot read? Certainly it does. Does that mean a
> big problem for us? No. Nobody is going to _expect_ us to read
> it. If someone ships us source code we can't read we shan't use
> it.
>
> Is this a NEW problem? No. It is already possible to use some
> surprising languages in ASCII (Klingon, Ancient Egyptian, Greek
> with a little ingenuity, ...) so ever since Erlang began, we've
> had the possibility of entire files being written in words that
> we did not understand. If you don't know what the *functions*
> are about, what good does it do you to know which tokens are
> variables?
>
> I once had to maintain a large chunk of Prolog written by a
> very clever programmer whose idea of good variable naming
> style came from old BASIC (one letter, or one letter and one
> digit). I could see _which_ tokens were the variables, but
> not _what_ the variable names meant. I had to figure it out
> from the predicate names. So from actual experience I can
> tell you
>
> JUST KNOWING WHICH TOKENS ARE VARIABLES IS
> NEXT TO USELESS.
>
>> I think it is better to restrict to a subset of 7-bit US-ASCII.
>
> Yeah! Let's make Erlang ASCII-only! (Too bad about my father's
> middle name: Æneas. Perfectly good English name, from Latin.)
>
>> Decent
>> editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
>> character is under the cursor and if it is A..Z or _ under U+7F it is a
>> variable start.
>
> I'm using Aquamacs.
> From the Aquamacs help:
> Emacs buffers and strings support a large repertoire of
> characters from many different scripts, allowing users to
> type and display text in almost any known written language.
>
> To support this multitude of characters and scripts,
> Emacs closely follows the Unicode Standard.
> It's Meta-X describe-char, not Ctrl-X describe-char,
> and it works perfectly with Unicode characters.
> Here's sample output:
>
> character: Ҳ (1202, #o2262, #x4b2)
> preferred charset: unicode (Unicode (ISO10646))
> code point: 0x04B2
> syntax: w which means: word
> category: .:Base, y:Cyrillic
> buffer code: #xD2 #xB2
> file code: #xD2 #xB2 (encoded by coding system utf-8)
> display: by this font (glyph code)
> nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)
>
> Character code properties: customize what to show
> name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
> old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
> general-category: Lu (Letter, Uppercase)
>
> Trying this in Vim, it tells me what the numeric codes
> of a letter are, but not that it is a letter.
>
>>
>> The underscore
>> --------------
>>
>> I would like to argue against allowing all Unicode general category Pc
>> (Connector_Punctuation) character in place of "_". This class contain
>> in Unicode 6.2 these characters:
>> U+5F; LOW LINE
>> U+2034; UNDERTIE
>> U+2040; CHARACTER TIE
>> U+2054; INVERTED UNDERTIE
>> U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
>> U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
>> U+FE4D; DASHED LOW LINE
>> U+FE4E; CENTERLINE LOW LINE
>> U+FE4F; WAVY LOW LINE
>> U+FF3F; FULLWIDTH LOW LINE
>>
>> Of these at least U+2040 "⁀" is horizontal at the top of the line
>
> If it looks horizontal, you have a very poor font.
> It's _supposed_ to look more like a c rotated 90 degrees
> clockwise and flattened a bit.
>
>> and U+FE33 "︳" looks like a vertical bar (I guess intended for
>> vertical flow chinese) so they do not resemble "_" very much.
>
> Who said they were _supposed_ to resemble "_"?
> Not me.
>
> I can see your point here, but allowing-all-of-Pc *is* the
> Unicode UAX#31 recommendation. We *have* to tailor the
> definition somewhat for the sake of backwards compatibility
> (dots and at signs). We *could* tailor it here, but it is
> definitely advantageous to have at least one more Pc
> character reasons given in the EEP.
>
>> Allowing all these would make it hard to remember if a given
>> character is category Pc or something else e.g "|".
>
> You are not *supposed* to remember what each and every character is.
>
> BECAUSE YOU CAN'T.
>
> If there's anyone who can, I don't want to meet them.
> What _else_ could we talk about?
>
> There are 110,117 defined characters in Unicode 6.2.
> (The figure was 110,116 in Unicode 6.1 and 6.2 added one more.)
> NOBODY is expected to know what all these characters are.
>
> The idea is not
> "if a character is to appear in an Erlang file,
> everybody must know what it means"
> but
> "if someone wants to use their own script in
> an Erlang file, they should be able to do so
> in a way that is generally consistent with
> other programming languages."
>
> The idea that a character should be forbidden unless YOU
> recognise it would take us right back to ASCII or Latin 1.
> Please, do not put the cart before the horse.
>
> It is perfectly acceptable to say "If someone wants to share
> Erlang code with people in other countries, they should use
> characters that all those people recognise." In the 21st
> century it is no longer acceptable to say "nobody may use a
> character unless I remember what it is."
>>
>> Unquoted atoms
>> --------------
>>
>> The EEP proposes:
>> atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
>> | "." (Ll ∪ Lo)
>>
>> I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
>> be excluded so an atom can not start with a capital looking letter,
>> but Pc ⊄ XID_Start so there is no reason to subtract it, and why
>> subtract Lo (Other_Letter)?
>
> There is also no *harm* in making it obvious that variables
> *can* start with Pc characters and unquoted atoms *cannot*.
>
> Why subtract Lo? That was a combination of a backwards compatibility
> issue and an oversight.
>
> The backwards compatibility issue is that
> ªº are Lo characters and are not allowed to begin an Erlang atom.
> The oversight was forgetting that this category was the one with
> most of the characters I wanted to allow.
>
> This should read
>
> atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
> | "." (Ll ∪ Lo)
>
>> There also seems to be a typo in the definition of unquoted_atom
>> where an iteration of atom_continue is missing.
>>
>> I propose:
>> unquoted_atom ::= atom_start atom_continue*
>
> Yes.
>>
>> atom_start ::= atom_start_char
>> | "." atom_start_char
>
> That will allow Latin-1 atoms that are not now legal.
>>
>> atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)
>>
>> atom_continue ::= XID_Continue ∪ "@"
>> | "." XID_Continue
>
> That will allow Latin-1 atoms that are not now legal.
>
>> General explanation
>> -------------------
>>
>> I think the EEP could benefit from explaining more about the used character
>> classes, what kind of stability annex #31 is designed to give and such.
>>
>> When I did read the EEP it took several days of Unicode standard reading to
>> start understanding, and I think many hesitate before trying to understand
>> the EEP, which is a pity.
>
> Well, yes. Is it my job to repeat all the material in the Unicode
> standard? I don't think so. I mean, the thing's telephone-book size!
>>
>> My first concern was about if I write code for one Unicode Erlang release
>> in the future, will then that code be valid for subsequent Erlang releases
>> based on later Unicode standards.
>
> Yes. Section 1.1 of UAX#31 could hardly be more explicit. Well,
> maybe it could, which is why it points to
> http://www.unicode.org/policies/stability_policy.html
> which says
>
> - Once a character is XID_Continue,
> it must continue to be so in all future versions.
> - If a character is XID_Start then it must also be XID_Continue.
> - Once a character is XID_Start,
> it must continue to be so in all future versions.
>
> amongst other things.
>
>> For example the EEP and my proposal both define atom_start to be XID_Start
>> minus a set containing uppercase and titlecase letters. XID_Start is
>> derived from ID_Start, and ID_Start contains Other_ID_Start. I have failed
>> in finding which codepoints are contained in Other_ID_Start.
>
> To start with, the purpose of Other_ID_Start is to provide stability.
> Any character which _used_ to be an ID_Start but because of some change
> would have ceased to be so will be given that property to compensate.
>
> The properties Other_ID_Start and Other_ID_Continue are listed in
> Proplist.txt in the Unicode data base. Here's the current set:
>
> # ================================================
>
> 2118 ; Other_ID_Start # Sm SCRIPT CAPITAL P
> 212E ; Other_ID_Start # So ESTIMATED SYMBOL
> 309B..309C ; Other_ID_Start # Sk [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
>
> # Total code points: 4
>
> # ================================================
>
> 00B7 ; Other_ID_Continue # Po MIDDLE DOT
> 0387 ; Other_ID_Continue # Po GREEK ANO TELEIA
> 1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
> 19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE
>
> # Total code points: 12
>
>> But since we here define atom_start as above, moving a character from Lu
>> or Lt into Other_ID_Start will remove it from atom_start and old code
>> using it will not compile.
>
>
> Lu and Lt are "General Categories". Other_ID_Start is a "property".
>
> OK, now we've got a genuine technical problem.
>
> The set of characters that can begin a variable-OR-an-unquoted-atom
> can only grow. That much stability we're promised.
>
> If a character changes from Lu to Lt or Other_ID_Start,
> no problem. If a character changes from Lt to Lu or
> Other_ID_Start, no problem. But if a character changes
> from Lu/Lt to Ll/Lo or vice versa, we have a problem.
>
> Perhaps we can appeal to this:
> Once a character is encoded, its properties may still be
> changed, but not in such a way as to change the fundamental
> identity of the character.
> ...
> For example, the representative glyph for U+0061 “A”
> cannot be changed to “B”; the General_Category for
> U+0061 “A” cannot be changed to Ll (lowercase letter)
> ...
>
> Case Pair stability _nearly_ gives us what we want.
> If two characters form a case pair in a version of Unicode,
> they will remain a case pair in each subsequent version of Unicode.
>
> If two characters do not form a case pair in a version of Unicode,
> they will never become a case pair in any subsequent version of Unicode.
> That is, if "D" and "d" are unequal defined characters such that
> lower("D") = "d" and upper("d") = "D", then this will remain true.
> This means that
> If "D" is an Lu character now and "d" the corresponding Ll
> character, they are going to remain a case pair.
> So we could fiddle a bit and say
> Lu + Lt + Pc + (Other_ID_Start such that lower(x) != x)
> is what we're after.
>
> This doesn't handle the situation where there is a cased letter now
> but not its case opposite, as Latin-1 had y-umlaut and sharp s as
> lower case letters with no upper case version. But when case opposites
> for them did go into Unicode, they didn't change.
>
> I don't think we actually have a problem.
>
> However, the attached revision to EEP 40 has two recommendations.
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
More information about the erlang-questions
mailing list