[erlang-questions] Rant: I hate parsing XML with Erlang

Thu Oct 25 03:09:26 CEST 2007

Hi Bob,

Bob Ippolito wrote:
 > http://tidy.sourceforge.net/ is the typical library I've seen used to
 > transform arbitrary HTML into a valid document quickly and without
 > re-inventing the wheel. Much easier than trying to integrate with
 > Mozilla.

Tidy is good, maybe even the only workable solution, depending on your
needs.  It tries to convert malformed HTML into wellformed HTML that
you can parse with something that expects wellformed markup.  In
practice, it tends to be rather slow.  I had a need a few years ago to
parse arbitrary HTML.  I didn't care about making it wellformed; I
just needed something that could start at the beginning and raise
SAX-like events when it encountered stuff.  For example, the following
malformed document:

<html>
   <body>
     <p>I am an unclosed tag
   </body>
</html>

would result in:

start_tag: html
start_tag: body
start_tag: p
element_content: I am an unclosed tag
end_tag: body
end_tag: html

If this is good enough it might be worth looking at.  The code is part
of a C++ application framework I work on in my spare time.  You can
get the latest code at:

https://launchpad.net/framework

The relevant bits wrt to HTML parsing are here:

http://codebrowse.launchpad.net/~jkakar/framework/57186-release-0.2/files/jkakar%40starla-20070107232638-khamuslq9vz15z6a?file_id=text-20060821192009-a8646b9047718e3d

I've started porting the HTML parsing code to Python, but it's not
ready yet.  Maybe using this code would be helpful?  The code has been
used in a production environment and works and performs fairly well.

Thanks,
J.