ANN: MicroLex - simple lexical scanner

Mon Sep 2 22:02:47 CEST 2002

Good day,

MicroLex is a simple DFA based lexical scanner. I decided to post it
to list in hope to get suggestions and comments. If somebody find it
useful I 'm ready to post it to Erlang user contributions list.

IMicroLex supports mostly all frequently used lex regexps, predictive
operator, long (default) and short regexps.

* Grammar

MicroLex grammar is a list of rules. Order is significant. If input
matches few rules the first in the list is chosen.

* Rules

Rules have three forms:

** {Class, Regexp, FormatFun}

   Class - token class
   Regexp - regular expression
   FormatFun = Fun(Class, Line, String)
	     Line - current line in input stream
	     String - longest string matched Regexp

** {Class, Regexp, FormatFun, short}

The same as above but the shortest string matched Regexp is chosen.

** {Class, Regexp1, '/', Regexp2, FormatFun}

Predictive operator. Input matches Regexp1Regexp2 but only the part
matched Regexp1 is chosen as token string and buffer position points to
the next char after it.

* Grammar Example

Following simple grammar recognizes integers and floats

%% Grammar

grammar() ->
 [{ws,             ws(),             ?skip},
  {float_num,      float_num(),      fun yeec_token/3},
  {integer_num,    integer_num(),    fun yeec_token/3}].

ws() ->	ci(" \t\f").

%%Float
%%(+|-)?[0-9]+\.[0-9]+((E|e)(+|-)?[0-9]+)?
float_num() ->
  '@'([integer_num(), c($.),'+'(digit()),'?'('@'([ci("Ee"), integer_num()]))]).

integer_num() ->
  '@'(['?'(ci("+-")), '+'(digit())]).

digit() ->
  ci($0, $9).

%% End of Grammar

* Regexps

MicroLex regexp		Lex analog

'@'([R1, R2, ...])	R1R2...

'|'([R1, R2, ...])	R1|R2|...

'*'(R)			R*

'+'(R)			R+

'?'(R)			R?

'.'()			.

sol(R)			^R

eol(R)			R$

btw(From, To, R)	R{From, To}

c($a)			a

nc($a)			[^a]

ci($a, $z)		[a-z]

ci("abc")		[abc]

cni($a, $z)		[^a-z]

cni("abc")		[^abc]

str("abba")		abba

* Scanner

Scanner can be used in batch mode when the whole input buffer is
processed and list of tokens is returned and in continuation style.

If rule's format function returns empty list it isn't included to the output.

** Errors

On syntax error scanner returns tuple {error, Error}, where
Error = {scan, LineNum, Char, Expect, Str}
	LineNum - line number
	Char - first unmatched character
	Expect - character classes grammar expecting at that point
	Str - last recognized characters

It could be formatted to user friendly string with format_error/1.

* Examples

There are two example grammars

mlex_asn1.erl - subset of ASN.1 grammar
mlex_freeradius_conf.erl - cshell-like configuration file grammar of
FreeRadius package

Test files are in 'priv/test' directory.

Look also at mlex_test.erl

Best Regards,
Vladimir Sekissov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mlex.tar.gz
Type: application/octet-stream
Size: 15182 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20020903/4e961eb7/attachment.obj>