some lexical questions
Mon Jul 12 16:10:56 CEST 2010
I've been trying to make a Erlang scanner/parser from a peg style
grammar and some specialized parser combinators. The grammar at the
end of this mail can tokenize all but two files in the standard OTP
distribution. It's not blindingly efficient but it does the job and is
far easier to understand than erl_scan and friends. My goal is to
start with a grammar like the one below and transform it into
something as efficient as erl_scan ..
Two files defeat my scanner
and contain character sequences not allow by the grammar below - and
which also appear to be illegal according to the documentation.
Now I have to ask:
1) Is there a "definitive" reference for the meaning of the
allowed escape sequences within a string?
The best I've found is in the on-line documentation in section 2.14
But the shell seems to parse "\^[" as  - now either table 2.14 is
incorrect or the implementation is incorrect since following \^ there
should be only be a character matching [a-zA-Z].
Also \X where X is non of the alternatives in table 2.14 defaults to
X - but it should be an error by table 2.14 since X is none of the defined
alternatives. The shell thinks "\a" is the same as "a" but I think this
is confusing, in any case its an undocumened feature.
2) There is a strange character in line 1678 of asn1ct.erl
Emacs thinks this character is #xA0 encoded with iso-latin-1-unix
What is this strange character?
PS - the erlang lexical grammar so far is :
start <- spaces / comment / DOT
/ ATOM / VAR
/ QATOM / STRING / FLOAT / CHAR / INTEGER / recorddef
spaces <- ws+;
recorddef <- '#';
recordsel <- '.';
DOT <- '.' & '%' / '.' ws;
ws <- '\r' / '\n' / '\s' / '\t' / '\14';
CHAR <- '$' schar;
INTEGER <- [0-9]+ '#' [0-9a-fA-F]+ / [0-9]+ ;
FLOAT <- [0-9]+ '.' [0-9]+ ('e' ('+'/'-')? [0-9]+)? ;
ATOM <- [a-z][a-zA-Z0-9_@]* ;
VAR <- [_A-Z][a-zA-Z0-9_]*;
QATOM <- ['] (!['] schar)* ['];
STRING <- '"' (! '"' schar)* '"';
schar <- [\\] echar
echar <- [0-3][0-7][0-7]
/ '^' .
comment <- '%' eat_line
/ '/*' ((!eof / ! '*/') .)* '*/';
eat_line <- (!eol .)*;
eol <- '\r\n' / '\n';
eof <- !.;
This seems far easier to understand than a yacc or LL(k) grammar :-)
More information about the erlang-questions