[erlang-questions] extending the scanner and parser

Fri Oct 3 13:13:04 CEST 2008

Hi,

Tools that process Erlang source code need more detailed information
about the lexical tokens than those provided by the scanner in the
standard distribution.  For example, ErlIDE needs the character offset
in the file and also the textual representation of the token.  It also
adds as an optimization the length of the textual token.  Wrangler
uses column information.

These tools also need to be able to reconstruct the source code from
the token list.  For this to be possible, it must be possible to ask
the scanner to return even whitespace and comments.  It must also be
possible to know the exact textual representation of the token, for
example an integer valued 42 may have been written as "42" or as
"16#2A" in the source file.

In a similar way, the parse tree may need to know for each construct
where it starts and where it ends in the source, so it needs to keep
track of the underlying tokens.

The way these tools get the extra information is by making a copy of
erl_scan and erl_parse and modifying them to suit their purposes.
This is a big maintenance headache and it limits interoperability
between these tools, but currently the only way to solve the problem.

Therefore I would like to suggest to extend the standard scanner and
parser to support such extra information. If there will be a consensus
that this is a Good Thing, I will happily provide a reference
implementation in cooperation with any interested parties.

If there is not enough general interest, I would still like to call
out to all those who develope source handling tools and see if we
could create a common library to share.

best regards,
Vlad