[erlang-questions] Atom Unicode Support

Wed Feb 3 22:48:43 CET 2016

On 4/02/16 1:52 am, Max Lapshin wrote:
> Ok, ok, maybe I'm wrong and it is ok to have isolated pieces of code 
> that is impossible to edit by foreign programmer.
It is *already* possible, even *easy*, to have huge amounts of code that
are impossible for a foreign programmer to understand.
ISO Latin 1 is said to be adequate for writing Swahili.
I don't know one word of Swahili.
Write an Erlang module in Swahili -- except for keywords and library 
references,
of course -- and I guarantee that I won't be able to do anything with it.
> Problem in UTF8 is not only in nice hieroglyphs, but also in 20 
> different dash-like symbols and about 8 different spaces.
>
> This is one single atom:    my package
> It is not 2 words, it is a single word that has non-breakable space 
> inside it. Good luck for debugging =)

Bad choice.  No-Break SPace (NBSP) is in Latin-1, which Erlang already 
accepts.

You are confounding two different issues:

(a) should Erlang allow programmers to write quoted and/or unquoted atoms
      in their own language, even if that requires characters outside 
Latin-1?
      People are arguing that the answer is "yes".  UAX#31 deals with this.
      By the way, if you check DerivedCoreProperties.txt, you'll notice 
that no
      white space characters are included in any of the four sets used to
      define identifiers: {,X}ID_{Start,Continue}.

(b) should Erlang accept absolutely every Unicode character?
      Nobody is arguing for that.  There is no particular problem in 
allowing
      all the "Zs" characters as separating white space.  Terminating % 
comments
      at LINE SEPARATOR and PARAGRAPH SEPARATOR doesn't seem like a huge
      problem either.  But there's a lot we're not obliged to handle.  
Unicode 6.1
      section 15.6 describes four invisible mathematical characters:
      - an invisible separator for i<here>j subscripts and superscripts,
      - an invisible multiplication operator
      - an invisible addition operator
      - an invisible function application operator, to distinguish f<APP>x
         from f<TIMES>x.
      Does Erlang need to support those?  Nope.  Does Erlang have to support
      "20 different dash-like symbols"?  Nope.  More precisely, there is no
      obligation to process *as Erlang syntax* any Unicode character just
      because it is allowed *as data*.

It has to be admitted that Unicode is appallingly complex, and that some 
things
have crept into it (like language tags) which are now strongly deprecated.
Let me change a word: s/appallingly/terrifyingly/.  *Full* support of 
*everything*
in this dauntingly large and still growing standard may never come in my
lifetime.  That doesn't mean we shouldn't do *anything*.
>
> Of course you may say me: hire programmer that makes such things. Ok, 
> no problems. But what to do with copy-paste from skype/slack, where 
> such symbols are translated into nice utf8 automatically?

Invert the translation automatically.  If your editor can't do that (and 
to be honest,
mine can't), program a command "unskype-region".  And if you can't do 
that, write
an outboard program unskype-file to do it.

If the translation done by skype/slack is not invertible, then you have 
a major
problem due to using buggy software that destroys information without 
warrant.

The one thing that's likely to be a problem for Erlang is "-" (number 
subtraction)
vs "--" (list difference) where if you're handed some other "dash-like 
symbol",
it might not be clear what to do.  If anyone's interested I could 
suggest some ideas.

Let's revert to "my package", which as noted would NOT be acceptable as an
unquoted identifier under Unicode rules.  There *is* a problem here.
     'my package' (ASCII space)
     'my package' (pasted from a file where it's Latin-1 NBSP)
are visually indistinguishable.  But there is a well-trodden path to 
safety in such
cases: a programming language may ban white space characters other than the
plain space from string literals, and some programming languages do.  If you
want something else, you have to use some sort of escape mechanism.
I note that we ALREADY have this problem in ASCII, where it may be 
impossible
to tell a space from a tab.
> It is very good that we all have about 80-90 symbols to write code 
> that other people understand, but I really don't understand what is 
> the profit of adding ability to make code non-understandable by people 
> from other cultures.
That ability ALREADY EXISTS.  (Swahili, for example.)

The ability to make code non-understandable by other people
already existed in plain ASCII.  Here's a function from one of the
standard library modules in OTP 18.2, where I've replaced identifiers
local to the module or function by random 2- or 3-letter words
from an English Scrabble dictionary.

ken(SYN, _NAG) ->
     EFF = bra(),
     case catch woo(SYN) of
         {'EXIT',REB} -> nus(EFF), exit(REB);
         {error,REF,AX} -> {error, [{nus(EFF), [{REF, ?MODULE, AX}]}], []};
         EME ->
             case arf() of
                 [] -> nus(EFF), EME;
                 HO -> YAM = nus(EFF), {warning, EME,
                       [{YAM, [{ALA, ?MODULE, AX}]} || {ALA,AX} <- HO]}
             end
     end.

Why isn't this already a problem?  Because Erlang programmers don't
*want* to write bizarre code.  I think at some point you just have to
trust people[%] to be sensible.

[%] Standards committees excepted (:-).

>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions