[erlang-questions] Leex scanners and default token matching

Sat Jun 30 13:31:04 CEST 2012

Hi all,

I've got a simple Leex scanner, which appears to go into a non-terminating state for certain inputs, consuming 100% CPU and quickly eating up all available memory. I found this *very* surprising - should the generated scanner really be able to get itself into this state? Is there some way for me to provide a default rule that will execute if no other regex matches, so I can return {error, Reason} for this??? 

Below is a copy of the xrl file - running scanner:string("aa = !") will cause it to hang. Am I missing some obvious way of preventing this?

Definitions.
COMMA   = [,]
PARENS  = [\(\)]
L       = [A-Za-z_\$]
D       = [0-9-]
F       = (\+|-)?[0-9]+\.[0-9]+((E|e)(\+|-)?[0-9]+)?
HEX     = 0x[0-9]+
WS      = ([\000-\s]|%.*)
S       = ({COMMA}|{PARENS})
CMP     = (=|>|>=|<|<=|<>)
AOP     = (\\+|-|\\*|/)

Rules.

LIKE                : {token, {op_like, TokenLine, like}}.
IN                  : {token, {op_in, TokenLine, in}}.
AND                 : {token, {op_and, TokenLine, conjunction}}.
OR                  : {token, {op_or, TokenLine, disjunction}}.
NOT                 : {token, {op_not, TokenLine, negation}}.
IS{WS}NULL          : {token, {op_null, TokenLine, is_null}}.
IS{WS}NOT{WS}NULL   : {token, {op_null, TokenLine, not_null}}.
BETWEEN             : {token, {op_between, TokenLine, range}}.
ESCAPE              : {token, {escape, TokenLine, escape}}.
{CMP}               : {token, {op_cmp, TokenLine, atomize(TokenChars)}}.
{AOP}               : {token, {op_arith, TokenLine, atomize(TokenChars)}}.
{L}({L}|{D})*       : {token, {ident, TokenLine, TokenChars}}.
'([^''])*'          : {token, {lit_string, TokenLine, strip(TokenChars,TokenLen)}}.
{S}                 : {token, {list_to_atom(TokenChars),TokenLine}}.
{D}+                : {token, {lit_int, TokenLine, list_to_integer(TokenChars)}}.
{F}                 : {token, {lit_flt, TokenLine, list_to_float(TokenChars)}}.
{HEX}               : {token, {lit_hex, TokenLine, hex_to_int(TokenChars)}}.
{WS}+               : skip_token.

Erlang code.

strip(TokenChars,TokenLen) ->
    lists:sublist(TokenChars, 2, TokenLen - 2).

hex_to_int([_,_|R]) ->
    {ok,[Int],[]} = io_lib:fread("~16u", R),
    Int.

atomize(TokenChars) ->
    list_to_atom(TokenChars).