[erlang-questions] Leex scanners and default token matching

Sat Jun 30 13:47:00 CEST 2012

Well the 'solution' is probably as simple as filtering them out - not quite as onerous as I'd imagined once I realised that any character that isn't single quoted falls into a pretty small range. An extra macro:

ILLEGAL = ([^\s\w=><*/])

and then an extra Rule:

{ILLEGAL}+          : {error, "Illegal Character(s): " ++ TokenChars}. 

Still, it should be documented that Leex *can* produce a scanner that is non-terminating. Even though there's nothing to say that it shouldn't, whilst the Yecc documentation explains something about the implementation that helps you understand how it works (LALR-1) there is nothing in the leex docs to suggest that I should *beware* of this. Luckily I decided to randomly generate input strings using PropEr and quickly came across the issue in my implementation.

What do other think about this? Should there be a caution in the documentation explaining this potential situation?

On 30 Jun 2012, at 12:31, Tim Watson wrote:

> Hi all,
> 
> I've got a simple Leex scanner, which appears to go into a non-terminating state for certain inputs, consuming 100% CPU and quickly eating up all available memory. I found this *very* surprising - should the generated scanner really be able to get itself into this state? Is there some way for me to provide a default rule that will execute if no other regex matches, so I can return {error, Reason} for this??? 
> 
> Below is a copy of the xrl file - running scanner:string("aa = !") will cause it to hang. Am I missing some obvious way of preventing this?
> 
> Definitions.
> COMMA   = [,]
> PARENS  = [\(\)]
> L       = [A-Za-z_\$]
> D       = [0-9-]
> F       = (\+|-)?[0-9]+\.[0-9]+((E|e)(\+|-)?[0-9]+)?
> HEX     = 0x[0-9]+
> WS      = ([\000-\s]|%.*)
> S       = ({COMMA}|{PARENS})
> CMP     = (=|>|>=|<|<=|<>)
> AOP     = (\\+|-|\\*|/)
> 
> Rules.
> 
> LIKE                : {token, {op_like, TokenLine, like}}.
> IN                  : {token, {op_in, TokenLine, in}}.
> AND                 : {token, {op_and, TokenLine, conjunction}}.
> OR                  : {token, {op_or, TokenLine, disjunction}}.
> NOT                 : {token, {op_not, TokenLine, negation}}.
> IS{WS}NULL          : {token, {op_null, TokenLine, is_null}}.
> IS{WS}NOT{WS}NULL   : {token, {op_null, TokenLine, not_null}}.
> BETWEEN             : {token, {op_between, TokenLine, range}}.
> ESCAPE              : {token, {escape, TokenLine, escape}}.
> {CMP}               : {token, {op_cmp, TokenLine, atomize(TokenChars)}}.
> {AOP}               : {token, {op_arith, TokenLine, atomize(TokenChars)}}.
> {L}({L}|{D})*       : {token, {ident, TokenLine, TokenChars}}.
> '([^''])*'          : {token, {lit_string, TokenLine, strip(TokenChars,TokenLen)}}.
> {S}                 : {token, {list_to_atom(TokenChars),TokenLine}}.
> {D}+                : {token, {lit_int, TokenLine, list_to_integer(TokenChars)}}.
> {F}                 : {token, {lit_flt, TokenLine, list_to_float(TokenChars)}}.
> {HEX}               : {token, {lit_hex, TokenLine, hex_to_int(TokenChars)}}.
> {WS}+               : skip_token.
> 
> Erlang code.
> 
> strip(TokenChars,TokenLen) ->
>    lists:sublist(TokenChars, 2, TokenLen - 2).
> 
> hex_to_int([_,_|R]) ->
>    {ok,[Int],[]} = io_lib:fread("~16u", R),
>    Int.
> 
> atomize(TokenChars) ->
>    list_to_atom(TokenChars).
>