"Fast" text parsing

Mon Dec 10 06:23:20 CET 2001

I have something of a problem, so I thought I would ask the list.
I want to do "fast" text parsing in Erlang.  I'm considering porting
an application from C++ to Erlang, but it does a considerable amount
of text processing.  Currently it contains two seperate text parsers,
one that parses HTML using a very simple mmap'd character buffer which
it iterates over character by character, the other a flex/bison
(lex/yacc) grammar, abstract syntax tree and the whole nine yards.

Currently average parse times are less than a second for even the
most complex documents with this C++ tool.

My problem is, how do I construct Erlang parsers for both the HTML and
flex/bison components in such a way that I don't increase parse times
dramatically?

I have tried a naive approach in the past of loading the entire file into
a list, one character per cell, and walking that list with a simple hand
rolled lexer which created a list of tokens to be handed to the real
grammar parser.  The lexer required 12 seconds to tokenize even the
smallest document (<1000 characters), but my average document size is
closer to 16,000 characters.  The C++ tool rips through 10 16k documents
in less time than the Erlang lexer could get through a single 1,000
character document.

Is the trick to use binaries?  Or is there no trick, just that Erlang
text processing in general is slower than what can be constructed in
lower level languages such as C++?

I guess I'm really interested in Erlang for two reasons:  one, I can
quickly make the tool distributed and take advantage of many spare
CPU cycles on other nodes to perform parsing, two it has really nice
pattern matching on function heads, especially with records, which may
be helpful for working with the abstract syntax trees I need to deal with.

Anyone have experience with building "fast" parsers???  I'd love to hear
some suggestions...

--
Shawn.

  ``If this had been a real
    life, you would have
    received instructions
    on where to go and what
    to do.''