Module mlex

Description

MicroLex is a simple DFA based lexical scanner. It supports mostly all frequently used lex regexps, predictive operator, long (default) and short regexps. It works with Unix, DOS and Mac files.

Grammar

MicroLex grammar is a list of rules. Order is significant. If input matches few rules the first in the list is chosen.

Rules

Rules have three forms:

{Class, Regexp, FormatFun}
The longest string matched Regexp is chosen as token string
{Class, Regexp, FormatFun, short}
The same as above but the shortest string matched Regexp is chosen.
{Class, Regexp1, '/', Regexp2, FormatFun}
Predictive operator. Input matches Regexp1Regexp2 but only the part matched Regexp1 is chosen as token string and buffer position points to the next char after it.
Class
token class
Regexp
regular expression
FormatFun
Fun(Class, Line, String) -> {error, Error} | Token
Line
current line in input stream
String
string matched Regexp

Grammar Example

Following simple grammar recognizes integers and floats

  %% Grammar
 
  grammar() ->
   [{ws,             ws(),             ?skip},
    {float_num,      float_num(),      fun yeec_token/3},
    {integer_num,    integer_num(),    fun yeec_token/3}].
 
  ws() ->	ci(" \t\f").
 
  %%Float
  %%(+|-)?[0-9]+\.[0-9]+((E|e)(+|-)?[0-9]+)?
  float_num() ->
    '@'([integer_num(), c($.),'+'(digit()),'?'('@'([ci("Ee"), integer_num()]))]).
 
  integer_num() ->
    '@'(['?'(ci("+-")), '+'(digit())]).
 
  digit() ->
    ci($0, $9).
 
  %% End of Grammar
  

Regexps

MicroLex regexpLex analog
'@'([R1, R2, ...])R1R2...
'|'([R1, R2, ...])R1|R2|...
'*'(R) R*
'+'(R) R+
'?'(R) R?
'.'() .
sol(R) ^R
eol(R) R$
btw(From, To, R) R{From, To}
c($a) a
nc($a) [^a]
ci($a, $z) [a-z]
ci("abc") [abc]
cni($a, $z) [^a-z]
cni("abc") [^abc]
str("abba") abba

Scanner

Scanner output is list of tokens. List ended with user defined end token or $end for yecc compatibility.

Scanner can be used in batch mode when the whole input buffer is processed and list of tokens is returned and in continuation style.

If rule's format function returns list it is appended to the output list. Any other result is added to output list. Format function can return empty list [] if you don't want rule result to be present in the output.

Errors

On syntax error scanner returns tuple {error, Error}.

Error
scanError()

Function Index

Exported Functions
'*'/1Match zero or more appearances of regexp.
'+'/1Match one or more appearances of regexp.
'.'/0Match any character excluding new line.
'?'/1Match one or zero appearances of regexp.
'@'/1Regexps concatenation.
btw/3Match from From to To appearances of regexp.
c/1Match character C.
ci/1Match any character in list.
ci/2Match any character in range From-To.
cni/1Match any character excluding chars in list.
cni/2Match any character excluding chars in range From-To.
eol/1Match regexp at the end of line.
format_error/1
grammar/1The same as grammar/2 but use default terminating token '$end'.
grammar/2Compile list of Rules to internal grammar representation.
match/1
match/2
nc/1Match any character excluding C.
nmatch/1
nmatch/2
scan/3Scans whole Buffer and returns list of tokens or error.
scan_token/2The same as scan_token/3 but uses buffer returned by scan_token/2 or scan_token/3.
scan_token/3Scans Buffer for the first recognized token Must be called first.
sol/1Match regexp at the start of line.
str/1Match string.
'|'/1Match any regexp from list.

Data Types

scanError(LineNum, Char, Expect, Str) = {scan, LineNum, Char, Expect, Str}

LineNum
line number
Char
first unmatched character
Expect
character classes grammar expecting at that point
Str
last recognized characters

It could be formatted to user friendly string with format_error/1.

Exported Functions

'*'/1

*(Node::function()) -> function()

Match zero or more appearances of regexp

'+'/1

+(Node::function()) -> function()

Match one or more appearances of regexp

'.'/0

.() -> function()

Match any character excluding new line.

'?'/1

?(Node::function()) -> function()

Match one or zero appearances of regexp

'@'/1

@(Nodes::list()) -> function()

Regexps concatenation

btw/3

btw(From, To, Node::function()) -> function()

Match from From to To appearances of regexp

c/1

c(C::char()) -> function()

Match character C

ci/1

ci(Str::string()) -> function()

Match any character in list

ci/2

ci(From::char(), To::char()) -> function()

Match any character in range From-To

cni/1

cni(Str::string()) -> function()

Match any character excluding chars in list

cni/2

cni(From::char(), To::char()) -> function()

Match any character excluding chars in range From-To

eol/1

eol(Node::function) -> function()

Match regexp at the end of line

format_error/1

format_error(Arg1) -> term()

grammar/1

grammar(Rules::list()) -> grammar()

The same as grammar/2 but use default terminating token '$end'.

See also: grammar/2.

grammar/2

grammar(Rules::list(), EndToken::term()) -> grammar()

Compile list of Rules to internal grammar representation

match/1

match(Arg1) -> term()

match/2

match(Arg1, Arg2) -> term()

nc/1

nc(C::char()) -> function()

Match any character excluding C

nmatch/1

nmatch(Arg1) -> term()

nmatch/2

nmatch(Arg1, Arg2) -> term()

scan/3

scan(ModBuffer::atom(), Buffer::term(), Grammar::grammar()) -> TokenList | {error, Error}

Scans whole Buffer and returns list of tokens or error. See mlex_str_buf.erl for buffer module example. ModBuffer is a buffer module name which operates on Buffer.

See also: grammar/1, grammar/2.

scan_token/2

scan_token(Buf::term(), Grammar::grammar()) -> {eof, Cont} | {ok, Cont} | {error, Error}

The same as scan_token/3 but uses buffer returned by scan_token/2 or scan_token/3

See also: scan/3, scan_token/3.

scan_token/3

scan_token(ModBuffer::atom(), Buffer::term(), Grammar::grammar()) -> {eof, Cont} | {ok, Cont} | {error, Error}

Scans Buffer for the first recognized token Must be called first. Next calls must use scan_token/2

See also: scan/3, scan_token/3.

sol/1

sol(Node::function) -> function()

Match regexp at the start of line

str/1

str(Str::string()) -> function()

Match string

'|'/1

|(Nodes::list()) -> function()

Match any regexp from list