[erlang-questions] Leex scanners and default token matching

Sun Jul 1 04:12:01 CEST 2012

The reason for the infinite loop is the macro definition:

AOP     = (\\+|-|\\*|/)

The double \\ means you are quoting the '\' not the '+' and '-'. So that regex means:

match one-or-more '+'
or
match '-'
or
match zero-or-more '*'    <<<===
or
match '/'

This will match zero of any non-matching character so you get a match but no character will be consumed and the scanner will loop over the same character again. For matching characters this is not a problem as you always get the longest match which is always longer than the empty match. When you add your illegal regex this is what happens.

Having regex which contain just '*' qualified regex is very dangerous as they can match the empty string and so create a loop. I know of no good way to handle this as it is not a bug, it is doing what you told it to do. The only way would be to disallow empty matches.

Your string regex '([^''])*' looks a little strange.

Robert

----- Original Message -----
> Well the 'solution' is probably as simple as filtering them out - not
> quite as onerous as I'd imagined once I realised that any character
> that isn't single quoted falls into a pretty small range. An extra
> macro:
> 
> ILLEGAL = ([^\s\w=><*/])
> 
> and then an extra Rule:
> 
> {ILLEGAL}+          : {error, "Illegal Character(s): " ++
> TokenChars}.
> 
> Still, it should be documented that Leex *can* produce a scanner that
> is non-terminating. Even though there's nothing to say that it
> shouldn't, whilst the Yecc documentation explains something about
> the implementation that helps you understand how it works (LALR-1)
> there is nothing in the leex docs to suggest that I should *beware*
> of this. Luckily I decided to randomly generate input strings using
> PropEr and quickly came across the issue in my implementation.
> 
> What do other think about this? Should there be a caution in the
> documentation explaining this potential situation?
> 
> On 30 Jun 2012, at 12:31, Tim Watson wrote:
> 
> > Hi all,
> > 
> > I've got a simple Leex scanner, which appears to go into a
> > non-terminating state for certain inputs, consuming 100% CPU and
> > quickly eating up all available memory. I found this *very*
> > surprising - should the generated scanner really be able to get
> > itself into this state? Is there some way for me to provide a
> > default rule that will execute if no other regex matches, so I can
> > return {error, Reason} for this???
> > 
> > Below is a copy of the xrl file - running scanner:string("aa = !")
> > will cause it to hang. Am I missing some obvious way of preventing
> > this?
> > 
> > Definitions.
> > COMMA   = [,]
> > PARENS  = [\(\)]
> > L       = [A-Za-z_\$]
> > D       = [0-9-]
> > F       = (\+|-)?[0-9]+\.[0-9]+((E|e)(\+|-)?[0-9]+)?
> > HEX     = 0x[0-9]+
> > WS      = ([\000-\s]|%.*)
> > S       = ({COMMA}|{PARENS})
> > CMP     = (=|>|>=|<|<=|<>)
> > AOP     = (\\+|-|\\*|/)
> > 
> > Rules.
> > 
> > LIKE                : {token, {op_like, TokenLine, like}}.
> > IN                  : {token, {op_in, TokenLine, in}}.
> > AND                 : {token, {op_and, TokenLine, conjunction}}.
> > OR                  : {token, {op_or, TokenLine, disjunction}}.
> > NOT                 : {token, {op_not, TokenLine, negation}}.
> > IS{WS}NULL          : {token, {op_null, TokenLine, is_null}}.
> > IS{WS}NOT{WS}NULL   : {token, {op_null, TokenLine, not_null}}.
> > BETWEEN             : {token, {op_between, TokenLine, range}}.
> > ESCAPE              : {token, {escape, TokenLine, escape}}.
> > {CMP}               : {token, {op_cmp, TokenLine,
> > atomize(TokenChars)}}.
> > {AOP}               : {token, {op_arith, TokenLine,
> > atomize(TokenChars)}}.
> > {L}({L}|{D})*       : {token, {ident, TokenLine, TokenChars}}.
> > '([^''])*'          : {token, {lit_string, TokenLine,
> > strip(TokenChars,TokenLen)}}.
> > {S}                 : {token,
> > {list_to_atom(TokenChars),TokenLine}}.
> > {D}+                : {token, {lit_int, TokenLine,
> > list_to_integer(TokenChars)}}.
> > {F}                 : {token, {lit_flt, TokenLine,
> > list_to_float(TokenChars)}}.
> > {HEX}               : {token, {lit_hex, TokenLine,
> > hex_to_int(TokenChars)}}.
> > {WS}+               : skip_token.
> > 
> > Erlang code.
> > 
> > strip(TokenChars,TokenLen) ->
> >    lists:sublist(TokenChars, 2, TokenLen - 2).
> > 
> > hex_to_int([_,_|R]) ->
> >    {ok,[Int],[]} = io_lib:fread("~16u", R),
> >    Int.
> > 
> > atomize(TokenChars) ->
> >    list_to_atom(TokenChars).
> > 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>