[erlang-questions] Pmods, packages, Unicode source code and column numbers in compiler - what will happen in R16?

Thu Oct 18 01:34:10 CEST 2012

On 17/10/2012, at 8:11 PM, Vlad Dumitrescu wrote:

>> On 17/10/2012, at 2:51 AM, Patrik Nyblom wrote:
>>> 
>>> The OTP Technical board decisions from last Thursday are now published on the erlang.org website, which means that the answers to some questions about changes in R16 are finally officially answered.
> 
> Great initiative to publish these decisions and explain them!
> 
> On Wed, Oct 17, 2012 at 1:22 AM, Richard O'Keefe <ok@REDACTED> wrote:
>> 
>> "Variable names will continue to be limited to Latin characters."
>> 
>>        I hope that means "for this release."
> 
> That's an interesting problem. Variable names are defined as starting
> with an upper case letter, but the only scripts that I know of that
> have those are roman, greek, cyrillic and armenian.

"Variables are defined as starting with an upper case letter"
isn't exactly true, unless you do what Quintus did back in the
80s and redefine "_" from being a 'punctuation connector' to
an 'upper case letter'.  Quintus did that for CJK, so that 日付
was an unquoted atom and _日付 was a Prolog variable.
This was apparently acceptable, and the same practice is followed in
other Prologs.  I see no reason why it would not work for Erlang,
where _1 is a perfectly good variable.

> So would variables
> with names in other scripts be forced to start with an uppercase latin
> letter? We might just as well have them start with '§' or something,
> and drop the capitalization rule.

It is _already_ the case that Erlang variables are not forced to
start with an uppercase latin letter, but may start with "_".
(The section sign is not allowed in Unicode identifiers.)
Having Latin, Greek, Cryllic, and Armenian scripts already covers a
lot of languages.

Here I am in New Zealand.  There are two official languages in this country.
English isn't actually one of them, although it is in practice the language
of government, commerce, and practically everything.  One of the two
official languages is New Zealand Sign Language.  The other is Māori.  Note
the little bar over the "a"?  It's called a macron.  And a-with-macron is
not a Latin-1 character.  The city I'm living in has the name Dunedin in
English and Ōtepoti in Māori.  (Note the macron on the "O".)  The organisation
I work for is the University of Otago/Te Whare Wānanga o Otāgo.  Notice a
pattern?  Do I look forward to being able to tell those of my students who
are Māori that it is now possible for them to use words of their own
language as Erlang?  You bet.  Did I mention that although the language of
_instruction_ in this University is English, by official decree students may
submit assignments and answers to examination questions in Māori?  If I ask
them to write programs in Erlang (which I did last year and will again next
year), am I actually _allowed_ to do this if they cannot use Māori words as
freely as English ones?  I'd rather not find out, thanks.  I'd definitely
rather not be told to require C, C++, Java, or Ada, which _do_ allow
non-Latin-1 letters in identifiers.

Unicode Standard Annex #31 (UAX 31),
'Unicode identifier and Pattern Syntax',
http://www.unicode.org/reports/tr31/
says how to handle identifiers in Unicode.
In particular, Coptic, Deseret, and Glagolitic are in table 4:
"candidate characters for exclusion from identifiers".
Section 5 recommends NFC for case-sensitive identifiers.
_ is not an ID_Start character, but if C can have that extension, so can we.

As for '§', SECTION SIGN is _not_ allowed in identifiers.

Ada 2012 identifier syntax (section 2.3) is closely based on Unicode
(technically, on ISO 10646).  Let's see what they say:

  identifier ::= identifier_start {identifier_start | identifier_extend}

  identifier_start ::= letter_uppercase | letter_lowercase |
                       letter_titlecase | letter_modifier |
                       letter_other| number_letter

  identifier_extend ::= mark_non_spacing | mark_spacing_combining |
		        number_decimal | punctuation_connector

  An identifier shall not contain two consecutive characters in
  category punctuation_connector or end with a character in that category.

Ada's restrictions on underscores (the one Latin-1 character that is a
punctuation_connector) have always been idiosyncratic.  This is otherwise
pretty close to what UAX31 says.

Ada is case-insensitive, so they don't greatly care about which letters
are upper+title-case and which are not.  So adapt the rules like this:

unquoted_atom ::= atom_start identifier_continuation*
atom_start ::= letter_lowercase | letter_modifier | letter_other |
   number_letter | "."    % the last is Erlang-specific

variable ::= variable_start identifier_continuation*
variable_start ::= letter_uppercase | letter_titlecase |
    punctuation_connector  % this includes "_" and some others

identifier_continuation ::= atom_start | variable_start |
    number_decimal | mark_non_spacing | mark_spacing_combining |
    | "@"     % this is Erlang-specific

> My guess is that atoms will be allowed to contain unicode just so that
> atom_to_list and list_to_atom can still be used, but usage of such
> atoms in source code will be discouraged because these will have to be
> written as quoted.

Why _should_ Unicode identifiers that would be legal identifiers in Ada
but do not begin with an upper case letter, title case letter, or
connector punctuation mark require quoting?  You might as well require
atoms containing the letter 'z' to be quoted.