[erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

Fri Nov 2 10:35:39 CET 2012

On Fri, Nov 02, 2012 at 11:41:46AM +1300, Richard O'Keefe wrote:
> I'm not going to answer every point, because I'm supposed to be marking exams.
> That doesn't mean they aren't good points.

Looking forward to later then...

> 
> Next revision of the EEP: 

It is now updated and published.

Formally, the EEP updates should go to eeps@REDACTED, according to
http://www.erlang.org/eep.html. I have missed on procedures by not
mailing to that list when accepting this EEP, but that will improve...

> 
:
: :
> > 
> > The wildcard variable is "_" and starting a variable with that
> > character has a special meaning to the compiler. Why do we need
> > more aliases for that character?
> 
> BECAUSE that character has a special meaning,
> and the other characters are NOT aliases for it.
> 
> Maybe it's not in the EEP, but it certainly was in this mailing list.
> Someone was arguing against internationalisation on the grounds that
> 變量 couldn't be used as a variable name, and to the proposal that
> _變量 be used, it was claimed that the compiler would have to treat
> this as something that was supposed to occur just once, and so I
> pointed out that there are other Pc characters available, so that
> ⁀變量 or ‿變量 could be used.  It wasn't that word, and I think I
> didn't mention ⁀.  But the point was that we could retain the
> current reading of "_" unchanged and begin caseless words used as
> variable names with some other Pc character.  The idea is that the
> other Pc characters would or could be treated differently from "_".
> 
> In fact I do prefer that all the Pc characters should be treated
> the same, but at the moment the EEP offers both alternatives for
> consideration.

Ok. I misread it as there was only one suggestion and that was to
treat all Pc characters alike. I think it is still somewhat unclear
that only treating "_" special _is_ an alternative in the EEP.

Also I do not clearly see what problem is solved for someone using
fonts with say Arabic letters but not say the undertine, by revising
the underscore rule. Bear with me. I have never used another keyboard
than Swedish or English. Is it so that when using such a font there
is no Pc character available except for the "_" (and why is that
available?) so there must be a possibility to express both non-singleton
and maybe-singleton variables using just the "_"?

:
> You cannot even understand the lexical semantics without knowing
> the characters.  The most primitive level of "understand(ing)
> the semantics" I can imagine is being able to answer the question
> "Is this sequence of characters legal or not?"
> 
> Consider this example: "؂र॰." (U+0930, U+0970, usual full stop.)
> If you were trying to read that from a file, would it be a legal
> term?
> 
> No.  The first character is a letter, but the second character is
> classified as a punctuation mark.  I only know this because I was
> constantly referring to the tables while constructing the example.
> It will be instantly obvious, I imagine, to anyone familiar with
> the Devanagari script.  For that matter, hawaiɁi is or ought to
> be a perfectly good atom.  That glottal stop letter looked a lot
> like a question mark, didn't it?  So it might not have _looked_
> like an atom, but it would be one.

I have realized that. I wanted a lesser degree of understanding the
lexical semantics: If it passes the compiler (which that example
does not) I would like to be able to see which identifiers are
variables and which are atoms.

Also, e.g someone writing a syntax highlighter for Vim i guess would
appreciate a simple rule for how to recognize a variable.

> 
> If someone gives you an Erlang file written entirely in ASCII,
> but using the Klingon language, just how much would it help you
> to know where the variables began?  (Google Translate offers
> translation to Esperanto, why not Klingon?  I haven't opened my
> copy of the how-to-learn-Klingon book in 20 years.  Sigh.)

It would not help much, I agree. But if for example I get a bug report
about the compiler or runtime system not doing right for a few lines
of Klingon Erlang, it would be helpful to easily distinguish variables
from atoms.

> 
> >> 
> >> The backwards compatibility issue is that
> >> ªº are Lo characters and are not allowed to begin an Erlang atom.
> > 
> > Would that be an issue? Since they are in Lo should we not start
> > allowing them?
> 
> I wanted to preserve a somewhat stronger property than any I mentioned,
> namely that
> 	"this is a legal Erlang text using Latin-1 characters
> 	 under the old rules"
>      if and only if
> 	"this is a legal Erlang text using Latin-1 characters
> 	 under the new rules".
> 
> If anyone wants to propose allowing "ªº" at the beginning of an atom
> in Latin-1 Erlang, fine.  Doesn't bother me.  But I wasn't about to
> introduce _any_ incompatibility if I could avoid it.  In particular,
> it seems like a nice thing for the transition period that if you have
> an Erlang file that works in Unicode Erlang and happens to include
> nothing outside Latin-1 (a trivial mechanical check) it should be
> guaranteed to work in Latin-1 Erlang.

Ok. Good point. That sounds maybe essential. And now that goal is in the
latest version of the EEP. Very good.

:
: ::
> >> This should read
> >> 
> >>    atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
> >>                |  "." (Ll ∪ Lo)
> > 
> > Ok. Now I get it. But should it not be the same set after a dot
> > as at the start?
> 
> Consider
> 1> X = a.B.
> * 1: syntax error before: B
> 1> X = a._2.
> * 1: syntax error before: _2
> 1> X = a.3.
> * 1: syntax error before: 3
> 1> X = a.b.
> 'a.b'
> 
> That tells us that currently, only Ll characters are allowed
> after a dot in the continuation of an identifier.  That naturally
> generalised to (Ll ∪ Lo).  So I made "what can follow a dot" the
> same everywhere in an atom.  The mental model I had was to think
> of dot-followed-by-Ll-or-Lo as a single extended character.

Yes. And currently only Ll characters are allowed at the start
of an atom. So currently the same set is allowed at the start
as after a ".".

Your current suggestion allows a.ª as an unquoted atom since the character
after the dot is in Lo, but it is not allowed in Erlang today.

It also allows ᛮᛯᛰ as an atom but not ᛮᛯᛰ.ᛮᛯᛰ since these characters
are in Nl (Letter_Number), which is part of XID_Start.

So I think the mental model should be that after a dot there
should be as if a new atom was starting.

:
> 
> Concerning stability, I did send a message to the Unicode consortium.
> I've had an informal response:
> 
> 	An interesting question you raise, which I will pass along
> 	to some people here.  I think the short answer is that you
> 	can tailor these things to particular environments, and you
> 	may not be able to rely on any given standard property for
> 	special purposes.  Especially if that property is not
> 	formally stable.  But I'll see what others say.
> 
> There are sufficiently many programming languages that depend on
> initial alphabetic case that we may be looking at a revision of
> UAX#31.  Wouldn't that be fun‽  (Groan.)

I think we need an XID_Start_Uppercase and XID_Start_Lowercase,
containing Other_ID_Start_Uppercase and Other_ID_Start_Lowercase.

> 
> Remaining points skipped for now.
> 
> 

I especially anticipate a reply about what happens if a character
moves from Ll or Lo to Other_ID_Start...

Good luck with the exams!

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB