[erlang-questions] Rant: I hate parsing XML with Erlang

Tue Oct 23 16:13:27 CEST 2007

I have written several XML parsers in several states of completeness,
at the moment I'm trying to put together yet another xml toolkit (I
might actually release
this one)

The problem with "parsing" xml is not so much the parsing but what you want to
do with the parse tree. Do you need validation? how are the DTD
defined and so on.

Here are are some of the questions that occur to me in the design of an
XML parser.

1) Is the input small or large. small means will fit into memory.

    In the case of small input then everything fits into memory - I
can happily parse
    input streams of a 200K lines.

    The only large files XML I've found are a tens of Gigabytes of
data - is it important
    to be able to parse and validate these? or do you just want sax
like processing

2) Is the input stream "framed" - ie we have a framing protocol so we
know we have
    and entire XML document or do you want a re-entrant parser.

3) Do you need to handle streams of xml terms. This is often problematic
    since nobody can agree on the framing protocol - does each term begin
    with a new <?xml ...?> header?

4) Do you want validation. In which case how do you find the DTD/schema - do you
    have to comply with OASIS catalogues?
    Do you want DTD's, Schema, RNC, (or an erlang ad hock grammar)

    (My solution is to parameterize the parser with a fun F - F(URI)
is a function
     that knows how to find the DTD in URI - the OASIS catalogue structure
     is not something that I want to be concerned with - most
applications seem to
     titall ignore this)

5) Do you want the parser to try an correct errors and recover, or
bail out early.

6) Do you want unicode support? or just ASCII.

7) Does the data come from files, sockets, binaries?

8) Do you want strict or lazy parsing. If you don't look an attribute
you might like to
    defer parsing the content until you actually need it.

9) Do you want to check all ids and idrefs and correctly handle
NOTATION's  and so on
    i.e. all the weird things in the XML spec that 99.9% of
programmers have never used.

10) In the case on XML without a DTD do you want a heuristic to throw away
      non-significant white space (I often use a simple heuristic if all the
      PCDATA children of a tag are all white space then throw all this
white space away)

...

It's very difficult to write a parser that correctly handles *all* of
these and is fast, small etc.

I have made a set of compromises and a toolkit that provides the following.

   1) A tag level interface to the system
       this has a file type interface.

        Pid = open_xml_token_stream(Descriptor)

      makes a re-entrant token scanner

       get_next_token(Pid) -> Token | eof

      A lot of things can be done with this alone, for example a SAX
like processor

    2) A simple parser this takes a token stream and parses it just checking for
        well formed-ness.

    3) A validator that runs on the output of 2 - this only
understands DTDs (not schemas,
       or rnc)

    There are also diverse routines to parse files etc, based on these.

    I've also written an XSLT type thing that takes the output of 2)
or 3) and transforms
it.

     This is just ASCII.

     I've talked to a lot of people about XML - most people (the majority) want
something to parse a small ASCII file containing a single XML data structure.

      The data structure is well formed - there is no DTD - and they don't care
about integrity constraints on the attributes- They don't care about
entity expansion,
NOTATIONs CDATA etc.

     The kind of Ruby code shown in an earlier posting is easy given a
simple parse tree

     My experimental parser turns an XML data structure in a

      @type xml() = {node,Line,Tag,attrs(),[xml()]} |
{raw,Ln,AllBlack:bool(), bin()}

     It's easy to write a fold function that applies a Fun to each node

fold_over_nodes(Fun, Env, {node, _, _, _, C} = Node) ->
    Env1 = Fun(Node, Env),
    fold_over_nodes(Fun, Env1, C);
fold_over_nodes(Fun, Env, {raw,_,_,_} = Raw) ->
    Fun(Raw, Env);
fold_over_nodes(Fun, Env, [H|T]) ->
    Env1 = fold_over_nodes(Fun, Env, H),
    fold_over_nodes(Fun, Env1, T);
fold_over_nodes(Fun, Env, []) ->
    Env.

   (like foldl) - this can be used to extract tags

     F = fun({node,_,Stag,_,C}=N, E) ->
            case member(Stag, [a,b,c]) of
                true -> [N|E];
                false -> E
            end,
     Tags = fold_over_nodes(Tree, [], F)

For very simple applications I could put together a parser that does
the following:

   1) Small files only.
   2) no DTDs or grammar checking
   3) white space normalisation according to the following.
       If a tag has MIXED content (ie tags and PCDATA) and all the
PCDATA bits are all
       blank then remove all the PCDATA
   4) ASCII
   5) simple "obvious" parse tree {tag,Line,Name,[Attr], [Children]}
       Attr is a sorted [{Key,Val}] list
   6) Simple SAX library, foldtree functions, find object in tree (like Xpath
       only in Erlang)

I have a 6-pack of parsers than almost do this - they are all specialsed in
different ways for infinite streams and so on ...

I'm not sure if a general purpose toolkit (that allows you to build
the above) or a set
of completely different parsers with different properties is desirable.

/Joe Armstrong

On 10/23/07, Joel Reymont <joelr1@REDACTED> wrote:
>
>
> On Oct 23, 2007, at 2:02 PM, Vlad Dumitrescu wrote:
> > Do you try to scrape arbitrary HTML? I don't think a XML parser will
> > help that much in such a case, because HTML is only a distant cousin
> > of XML...
>
> Completely arbitrary HTML. Any web page out there. The syntax and
> approach won't be much for HTML, assuming you had a robust parser. My
> rant is about the syntax.
>
> --
> http://wagerlabs.com
>
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>