[erlang-questions] Leex scanners and default token matching
Tim Watson
watson.timothy@REDACTED
Sat Jun 30 13:47:00 CEST 2012
Well the 'solution' is probably as simple as filtering them out - not quite as onerous as I'd imagined once I realised that any character that isn't single quoted falls into a pretty small range. An extra macro:
ILLEGAL = ([^\s\w=><*/])
and then an extra Rule:
{ILLEGAL}+ : {error, "Illegal Character(s): " ++ TokenChars}.
Still, it should be documented that Leex *can* produce a scanner that is non-terminating. Even though there's nothing to say that it shouldn't, whilst the Yecc documentation explains something about the implementation that helps you understand how it works (LALR-1) there is nothing in the leex docs to suggest that I should *beware* of this. Luckily I decided to randomly generate input strings using PropEr and quickly came across the issue in my implementation.
What do other think about this? Should there be a caution in the documentation explaining this potential situation?
On 30 Jun 2012, at 12:31, Tim Watson wrote:
> Hi all,
>
> I've got a simple Leex scanner, which appears to go into a non-terminating state for certain inputs, consuming 100% CPU and quickly eating up all available memory. I found this *very* surprising - should the generated scanner really be able to get itself into this state? Is there some way for me to provide a default rule that will execute if no other regex matches, so I can return {error, Reason} for this???
>
> Below is a copy of the xrl file - running scanner:string("aa = !") will cause it to hang. Am I missing some obvious way of preventing this?
>
> Definitions.
> COMMA = [,]
> PARENS = [\(\)]
> L = [A-Za-z_\$]
> D = [0-9-]
> F = (\+|-)?[0-9]+\.[0-9]+((E|e)(\+|-)?[0-9]+)?
> HEX = 0x[0-9]+
> WS = ([\000-\s]|%.*)
> S = ({COMMA}|{PARENS})
> CMP = (=|>|>=|<|<=|<>)
> AOP = (\\+|-|\\*|/)
>
> Rules.
>
> LIKE : {token, {op_like, TokenLine, like}}.
> IN : {token, {op_in, TokenLine, in}}.
> AND : {token, {op_and, TokenLine, conjunction}}.
> OR : {token, {op_or, TokenLine, disjunction}}.
> NOT : {token, {op_not, TokenLine, negation}}.
> IS{WS}NULL : {token, {op_null, TokenLine, is_null}}.
> IS{WS}NOT{WS}NULL : {token, {op_null, TokenLine, not_null}}.
> BETWEEN : {token, {op_between, TokenLine, range}}.
> ESCAPE : {token, {escape, TokenLine, escape}}.
> {CMP} : {token, {op_cmp, TokenLine, atomize(TokenChars)}}.
> {AOP} : {token, {op_arith, TokenLine, atomize(TokenChars)}}.
> {L}({L}|{D})* : {token, {ident, TokenLine, TokenChars}}.
> '([^''])*' : {token, {lit_string, TokenLine, strip(TokenChars,TokenLen)}}.
> {S} : {token, {list_to_atom(TokenChars),TokenLine}}.
> {D}+ : {token, {lit_int, TokenLine, list_to_integer(TokenChars)}}.
> {F} : {token, {lit_flt, TokenLine, list_to_float(TokenChars)}}.
> {HEX} : {token, {lit_hex, TokenLine, hex_to_int(TokenChars)}}.
> {WS}+ : skip_token.
>
> Erlang code.
>
> strip(TokenChars,TokenLen) ->
> lists:sublist(TokenChars, 2, TokenLen - 2).
>
> hex_to_int([_,_|R]) ->
> {ok,[Int],[]} = io_lib:fread("~16u", R),
> Int.
>
> atomize(TokenChars) ->
> list_to_atom(TokenChars).
>
More information about the erlang-questions
mailing list