[erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

Thu Nov 1 23:41:46 CET 2012

I'm not going to answer every point, because I'm supposed to be marking exams.
That doesn't mean they aren't good points.

Next revision of the EEP: 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eep-0040.md
Type: application/octet-stream
Size: 11511 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121102/91a0a8e7/attachment.obj>
-------------- next part --------------

> So here is what seems to be the core question:
> 
> I say I want to be able to see the difference between a variable and an
> unquoted atom even if I can not make sense of the variables and atoms names'.

And I say that I don't see any significant benefit in being able to do this.

I also note that Haskell and Prolog also have identifiers whose properties
depend on the case of their initial letter.  In Haskell, "conid"s begin with
a "large" letter and "varid"s begin with a "small" one (section 2.4,
Identifiers and Operators), where they take "_" as a "small" letter so that
it can begin a variable.  And they do not require either varids or conids to
begin with an ASCII letter.  Nor does SWI Prolog require this:
m% swipl
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 6.1.4)
...
?- ????? = ?????.
????? = ?????.

[Meta-X describe-character]
> Yes. I know. I gave the example.

You seemed to be saying that describe-character didn't work
with non-Latin-1 characters.  I am sorry to have misunderstood you.
> 
> So in Vim you can easilly see if the character is less than 128.
> But not if it is a letter.

>>> and U+FE33 "?" looks like a vertical bar (I guess intended for
>>> vertical flow chinese) so they do not resemble "_" very much.
>> 
>> Who said they were _supposed_ to resemble "_"?
>> Not me.
> 
> No. I did, because for me that would indicate the character's purpose.

That's rather like saying that the Greeks should stop using
; for questions, because only ? would indicate the character's purpose.
> 
> Sorry I can not find those reasons. I find reasons and agree
> that if we allow more than "_" we should allow all in Pc,
> but I do not see why we need more than "_" other than because
> it is UAX#31's recommendation.

> 
> The wildcard variable is "_" and starting a variable with that
> character has a special meaning to the compiler. Why do we need
> more aliases for that character?

BECAUSE that character has a special meaning,
and the other characters are NOT aliases for it.

Maybe it's not in the EEP, but it certainly was in this mailing list.
Someone was arguing against internationalisation on the grounds that
?? couldn't be used as a variable name, and to the proposal that
_?? be used, it was claimed that the compiler would have to treat
this as something that was supposed to occur just once, and so I
pointed out that there are other Pc characters available, so that
??? or ??? could be used.  It wasn't that word, and I think I
didn't mention ?.  But the point was that we could retain the
current reading of "_" unchanged and begin caseless words used as
variable names with some other Pc character.  The idea is that the
other Pc characters would or could be treated differently from "_".

In fact I do prefer that all the Pc characters should be treated
the same, but at the moment the EEP offers both alternatives for
consideration.

>> It is perfectly acceptable to say "If someone wants to share
>> Erlang code with people in other countries, they should use
>> characters that all those people recognise."  In the 21st
>> century it is no longer acceptable to say "nobody may use a
>> character unless I remember what it is."
> 
> I said I want to be able to understand the semantics without
> knowing all characters. Is that a straw man attack?

You cannot even understand the lexical semantics without knowing
the characters.  The most primitive level of "understand(ing)
the semantics" I can imagine is being able to answer the question
"Is this sequence of characters legal or not?"

Consider this example: "???." (U+0930, U+0970, usual full stop.)
If you were trying to read that from a file, would it be a legal
term?

No.  The first character is a letter, but the second character is
classified as a punctuation mark.  I only know this because I was
constantly referring to the tables while constructing the example.
It will be instantly obvious, I imagine, to anyone familiar with
the Devanagari script.  For that matter, hawai?i is or ought to
be a perfectly good atom.  That glottal stop letter looked a lot
like a question mark, didn't it?  So it might not have _looked_
like an atom, but it would be one.

If someone gives you an Erlang file written entirely in ASCII,
but using the Klingon language, just how much would it help you
to know where the variables began?  (Google Translate offers
translation to Esperanto, why not Klingon?  I haven't opened my
copy of the how-to-learn-Klingon book in 20 years.  Sigh.)

>> 
>> The backwards compatibility issue is that
>> ?? are Lo characters and are not allowed to begin an Erlang atom.
> 
> Would that be an issue? Since they are in Lo should we not start
> allowing them?

I wanted to preserve a somewhat stronger property than any I mentioned,
namely that
	"this is a legal Erlang text using Latin-1 characters
	 under the old rules"
     if and only if
	"this is a legal Erlang text using Latin-1 characters
	 under the new rules".

If anyone wants to propose allowing "??" at the beginning of an atom
in Latin-1 Erlang, fine.  Doesn't bother me.  But I wasn't about to
introduce _any_ incompatibility if I could avoid it.  In particular,
it seems like a nice thing for the transition period that if you have
an Erlang file that works in Unicode Erlang and happens to include
nothing outside Latin-1 (a trivial mechanical check) it should be
guaranteed to work in Latin-1 Erlang.

Oh FLAMING SWEARWORDS.  Erlang doesn't currently allow "??" anywhere
in an unquoted atom.  OK.  There are two reasonable alternatives:

Backwards compatible: do not allow "??" in identifiers.
UAX#31 compatible:    treat "??" just like any other Ll characters.

I never thought to check whether Erlang allowed "??" at the end of
an identifier because it _obviously_ would.  But it doesn't.  Sigh.

>> This should read
>> 
>>    atom_start ::= XID_Start \ (Lu ? Lt ? "??")
>>                |  "." (Ll ? Lo)
> 
> Ok. Now I get it. But should it not be the same set after a dot
> as at the start?

Consider
1> X = a.B.
* 1: syntax error before: B
1> X = a._2.
* 1: syntax error before: _2
1> X = a.3.
* 1: syntax error before: 3
1> X = a.b.
'a.b'

That tells us that currently, only Ll characters are allowed
after a dot in the continuation of an identifier.  That naturally
generalised to (Ll ? Lo).  So I made "what can follow a dot" the
same everywhere in an atom.  The mental model I had was to think
of dot-followed-by-Ll-or-Lo as a single extended character.

>>> I agree that moving a character from Lu or Lt to Other_Id_Start would
> increase the set of atom_start characters.
> 
> For the characters "??" you above called that a backwards compatibility
> issue, which I doubt it is.

There is definitely a backwards compatibility issue (whether one can
safely move a new-rules file that is entirely in Latin-1 back to an
old-rules system).  Whether it is of any practical significance is
another matter.  What's also clear is that I haven't quite got there
yet.  One reason for revising the EEP again.

Concerning stability, I did send a message to the Unicode consortium.
I've had an informal response:

	An interesting question you raise, which I will pass along
	to some people here.  I think the short answer is that you
	can tailor these things to particular environments, and you
	may not be able to rely on any given standard property for
	special purposes.  Especially if that property is not
	formally stable.  But I'll see what others say.

There are sufficiently many programming languages that depend on
initial alphabetic case that we may be looking at a revision of
UAX#31.  Wouldn't that be fun?  (Groan.)

Remaining points skipped for now.