"Fast" text parsing

Mon Dec 10 11:29:50 CET 2001

On Mon, 10 Dec 2001, Shawn Pearce wrote:

> I have tried a naive approach in the past of loading the entire file into
> a list, one character per cell, and walking that list with a simple hand
> rolled lexer which created a list of tokens to be handed to the real
> grammar parser.  The lexer required 12 seconds to tokenize even the
> smallest document (<1000 characters), but my average document size is
> closer to 16,000 characters.  The C++ tool rips through 10 16k documents
> in less time than the Erlang lexer could get through a single 1,000
> character document.

I don't know what you're doing, but the normal Erlang tokeniser
(erl_scan) only needs about 100 ms (on my local crusty hardware) to
tokenise a 16 k char file, and it is written purely in Erlang. A big
source file of about 150 k char took 300 ms.

Reading the file as a binary and converting to a string took 30-40 ms
for the larger file, and for the smaller file less than 10 ms.

The call `epp:parse_file/3', which reads, tokenises, preprocesses and
parses the code (and does not read the whole file as a binary) takes
250-300 ms for the smaller file, and 650-700 ms for the larger.

Of course, you can probably still get a tenfold speedup over this by
using a parser driver written in C, but don't get the idea that a
tokeniser/parser written in Erlang cannot be reasonably fast.

	/Richard

Richard Carlsson (richardc@REDACTED)   (This space intentionally left blank.)
E-mail: Richard.Carlsson@REDACTED	WWW: http://www.csd.uu.se/~richardc/