some lexical questions

Mon Jul 12 16:10:56 CEST 2010

Hello,

I've been trying to make a Erlang scanner/parser from a peg style
grammar and some specialized parser combinators.  The grammar at the
end of this mail can tokenize all but two files in the standard OTP
distribution. It's not blindingly efficient but it does the job and is
far easier to understand than erl_scan and friends. My goal is to
start with a grammar like the one below and transform it into
something as efficient as erl_scan ..

Two files defeat my scanner

	/usr/local/lib/erlang/lib/stdlib-1.16.5/src/edlin.erl
        /usr/local/lib/erlang/lib/asn1-1.6.13/src/asn1ct.erl

and contain character sequences not allow by the grammar below - and
which also appear to be illegal according to the documentation.

Now I have to ask:

1) Is there a "definitive" reference for the meaning of the
allowed escape sequences within a string?

The best I've found is in the on-line documentation in section 2.14
of http://www.erlang.org/doc/reference_manual/data_types.html

But the shell seems to parse "\^[" as [27] - now either table 2.14 is
incorrect or the implementation is incorrect since following \^ there
should be only be a character matching [a-zA-Z].

Also \X where X is non of the alternatives in table 2.14 defaults to
X - but it should be an error by table 2.14 since X is none of the defined
alternatives. The shell thinks "\a" is the same as "a" but I think this
is confusing, in any case its an undocumened feature.

2) There is a strange character in line 1678 of asn1ct.erl
   Emacs thinks this character is #xA0 encoded with iso-latin-1-unix

   What is this strange character?

   /Joe

PS - the erlang lexical grammar so far is :

    start     <- spaces / comment / DOT
               / ATOM / VAR
               / QATOM / STRING / FLOAT / CHAR / INTEGER / recorddef
               / recordsel;
    spaces    <- ws+;
    recorddef <- '#';
    recordsel <- '.';
    DOT       <- '.' & '%' / '.' ws;
    ws        <- '\r' / '\n' / '\s' / '\t' / '\14';
    CHAR      <- '$' schar;
    INTEGER   <- [0-9]+ '#' [0-9a-fA-F]+ / [0-9]+ ;
    FLOAT     <- [0-9]+ '.' [0-9]+ ('e' ('+'/'-')? [0-9]+)? ;
    ATOM      <- [a-z][a-zA-Z0-9_@]* ;
    VAR       <- [_A-Z][a-zA-Z0-9_]*;
    QATOM     <- ['] (![']  schar)* ['];
    STRING    <- '"' (! '"' schar)* '"';
    schar     <- [\\] echar
               / .;
    echar     <-  [0-3][0-7][0-7]
               /  [0-7][0-7]
               /  [0-7]
               /  '^' .
               /  .;
    comment   <- '%' eat_line
	       / '/*' ((!eof / ! '*/') .)* '*/';
    eat_line  <- (!eol .)*;
    eol       <- '\r\n' / '\n';
    eof       <- !.;

This seems far easier to understand than a yacc or LL(k) grammar :-)