[erlang-questions] literal character syntax

Richard A. O'Keefe ok@REDACTED
Mon May 19 07:50:11 CEST 2014


On 17/05/2014, at 2:30 AM, zxq9 wrote:
> No. "Whitespace" means a whole slew of things, including, in this case, "non 
> printable characters". "Whitespace" and "non-printable characters" are 
> uselessly broad definitions in the days of unicode Erlang.

Ah, Unicode!

Unicode 6.3 has 18 characters in category Zs.
(ZERO WIDTH SPACE isn't a Zs any more, it's a Cf.
 However MONGOLIAN VOWEL SEPARATOR and MEDIUM MATHEMATICAL SPACE
 have made up for the loss.)

The following code fragment is a list containing
a series of character literals separated by commas.
Each comma is followed by a plain old U+0020 space.
One of the other spaces is also this kind of space,
but all the spaces that do not follow a comma are
*different* characters.

[$ , $ , $ , $ , $ , $ , $ , $ , $ ]

How I got these was by writing the following program:

#include <locale.h>
#include <stdio.h>
#include <wchar.h>

static void foo(int c, int flag) {
    putwc('$', stdout);
    putwc(c,   stdout);
    if (flag) putwc(',', stdout), putwc(' ',  stdout);
    else      putwc(']', stdout), putwc('\n', stdout);
}

int main(void) {
    setlocale(LC_CTYPE, "en_NZ.UTF-8");
    fwide(stdout, 1);
    putwc('[', stdout);
    foo(0x0020, 1);    
    foo(0x00A0, 1);
    foo(0x2002, 1);
    foo(0x2003, 1);
    foo(0x2004, 1);
    foo(0x2007, 1);
    foo(0x202F, 1);
    foo(0x205F, 1);
    foo(0x3000, 0);
    return 0;
}

I wrote the output to a file, verified with od(1) that the
file contained the bytes it was supposed to, opened that
file in TextEdit, and copied-and-pasted the line to Mail.
I've attached a copy of the file with the different spaces
so that you can view it in your preferred editor to see if
you can tell the difference.  In Mail I cannot, in TextEdit
I cannot, in Aquamacs I can if I look really closely.

All that Unicode has done for us here is to make ($ )
*more* dangerous than it already was.  (And this is without
considering the 145 'Cf' characters in Unicode 6.3, amongst
which TAB and LF may be found.  The thought of someone
innocently copying a "character" from a file and dropping
it in just after a $ sign without realising that the "character"
is a sequence beginning with a language tag or a digit shape
selector does *not* appeal.)

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: spaces.txt
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140519/cdda28a0/attachment.txt>
-------------- next part --------------


> 
> Were you taught how to write a space in school?

Yes.

> How about how to write a tab?

Not me personally, but the students who took typing, yes.

> Columnar delimitation? How large is a space in a file?

With Unicode, isn't the answer *always* "it depends"?
> 
> It seems like the general feeling is to impose a new restriction in the 
> language. Its not a game-changer either way, but I wanted to present a case 
> for not introducing special-case rules that outlaw a rather broader category 
> of characters than the OP may have realized in one very specific circumstance 
> -- just to make sure this is thoroughly considered before someone spends time 
> tinkering with the parser yet again.

I *DO* want to make a case for reporting a warning for *ALL* characters
that do not result in a visible mark.  And I think the example above
makes that case very strongly.




More information about the erlang-questions mailing list