[erlang-questions] Rant: I hate parsing XML with Erlang

Tue Oct 23 22:45:59 CEST 2007

Joe Armstrong-2 wrote:
> 
> 3) Do you need to handle streams of xml terms. This is often problematic
>     since nobody can agree on the framing protocol - does each term begin
>     with a new <?xml ...?> header?
> 

There are many applications, which using custom and/or non-conformant XML
processing tools, which emit XML without <?xml ...?> processing instruction,
so it's better to have non-strict mode, which accepts XML like this.

Joe Armstrong-2 wrote:
> 
> 4) Do you want validation. In which case how do you find the DTD/schema -
> do you
>     have to comply with OASIS catalogues?
>     Do you want DTD's, Schema, RNC, (or an erlang ad hock grammar)
> 
>     (My solution is to parameterize the parser with a fun F - F(URI)
> is a function
>      that knows how to find the DTD in URI - the OASIS catalogue structure
>      is not something that I want to be concerned with - most
> applications seem to
>      titall ignore this)
> 

Form my expierence DTD is largely ignored today, I can easily live with XML
parser ignoring DTD.
The standard XML Schema language is W3D XSD, which is very usefull,
especially for XML Data Binding and Web Services.

Joe Armstrong-2 wrote:
> 
> 5) Do you want the parser to try an correct errors and recover, or
> bail out early.
> 

the XML parser, should be strict parser (maybe except only leading
<?xml...?> PI.
The HTML and RDF/RSS parsers can be more forgiving, b/c of legacy of
millions of bad HTML.
FOr HTML the solution is Tidy, maybe we need Erlang Tidy port or just call
it via os:cmd .

Joe Armstrong-2 wrote:
> 
> 6) Do you want unicode support? or just ASCII.
> 
if you writing parser for XML configuration files, then ASCII is fine, but
let's agree: it's stupid, it's much easier to store configuration in Plain
Old Erlang Term (POET) format and just file:consult.

For the webapp and web syndication domain the UTF-8 encoding is a MUST. Most
RSS files are UTF-8.

Joe Armstrong-2 wrote:
> 
> 7) Does the data come from files, sockets, binaries?
> 

In most generic case the data comes in chunks of UTF-8 encoded binaries.

Joe Armstrong-2 wrote:
> 
> 8) Do you want strict or lazy parsing. If you don't look an attribute
> you might like to
>     defer parsing the content until you actually need it.
> 

I think a lazy parser just give an illusion of fast processing, it's still
reads the raw XML into memory.
The MSXML DOM native parser was lazy parser from the begining and it was
showing super-fast bechmarks in just "parsing" (read "opening") of XML
files, but when actual processing of XML nodes was involved it was much
slower, than non-lazy Java parsers. Whoever want to process just parts of
XML document, better use SAX, than lazy tree building parser.

Joe Armstrong-2 wrote:
> 
> 9) Do you want to check all ids and idrefs and correctly handle
> NOTATION's  and so on
>     i.e. all the weird things in the XML spec that 99.9% of
> programmers have never used.
> 

no, the only important parts are: namespaces, processing instructions and
CDATA

Joe Armstrong-2 wrote:
> 
> 10) In the case on XML without a DTD do you want a heuristic to throw away
>       non-significant white space (I often use a simple heuristic if all
> the
>       PCDATA children of a tag are all white space then throw all this
> white space away)
> 

good enough for me

Joe Armstrong-2 wrote:
> 
> ...
> 
> It's very difficult to write a parser that correctly handles *all* of
> these and is fast, small etc.
> 
> I have made a set of compromises and a toolkit that provides the
> following.
> 
>    1) A tag level interface to the system
>        this has a file type interface.
> 
>         Pid = open_xml_token_stream(Descriptor)
> 
>       makes a re-entrant token scanner
> 
>        get_next_token(Pid) -> Token | eof
> 
>       A lot of things can be done with this alone, for example a SAX
> like processor
> 
>     2) A simple parser this takes a token stream and parses it just
> checking for
>         well formed-ness.
> 
>     3) A validator that runs on the output of 2 - this only
> understands DTDs (not schemas,
>        or rnc)
> 
>     There are also diverse routines to parse files etc, based on these.
> 
>     I've also written an XSLT type thing that takes the output of 2)
> or 3) and transforms
> it.
> 
>      This is just ASCII.
> 
>      I've talked to a lot of people about XML - most people (the majority)
> want
> something to parse a small ASCII file containing a single XML data
> structure.
> 
>       The data structure is well formed - there is no DTD - and they don't
> care
> about integrity constraints on the attributes- They don't care about
> entity expansion,
> NOTATIONs CDATA etc.
> 

There are two major types of XML:

1. Document-oriented (also Unstructured)
2. Data-oriented (also Structured)

To support document-oriented XML, you need generic XML parser with
namespaces, CDATA, PI, UTF-8 and XPath/XQuery support.

For data-oriented XML, you do not need generic parser, but XML Data Binding
tool, like XSD to Erlang compiler (xsd2erl) (like XMLBeans for Java, or
RogueWave LIEF for C++, etc.), i.e. you giving employee.xsd and it's
compiled into employee.erl module, which you then use in your application to
parse and generate XML instances, conforming to employee.xsd schema.

Joe Armstrong-2 wrote:
> 
> 
>      The kind of Ruby code shown in an earlier posting is easy given a
> simple parse tree
> 
>      My experimental parser turns an XML data structure in a
> 
>       @type xml() = {node,Line,Tag,attrs(),[xml()]} |
> {raw,Ln,AllBlack:bool(), bin()}
> 
>      It's easy to write a fold function that applies a Fun to each node
> 
> fold_over_nodes(Fun, Env, {node, _, _, _, C} = Node) ->
>     Env1 = Fun(Node, Env),
>     fold_over_nodes(Fun, Env1, C);
> fold_over_nodes(Fun, Env, {raw,_,_,_} = Raw) ->
>     Fun(Raw, Env);
> fold_over_nodes(Fun, Env, [H|T]) ->
>     Env1 = fold_over_nodes(Fun, Env, H),
>     fold_over_nodes(Fun, Env1, T);
> fold_over_nodes(Fun, Env, []) ->
>     Env.
> 
>    (like foldl) - this can be used to extract tags
> 
>      F = fun({node,_,Stag,_,C}=N, E) ->
>             case member(Stag, [a,b,c]) of
>                 true -> [N|E];
>                 false -> E
>             end,
>      Tags = fold_over_nodes(Tree, [], F)
> 

I think it should be:
    Tags = fold_over_nodes(F, [], Tree).

Joe Armstrong-2 wrote:
> 
> 
> For very simple applications I could put together a parser that does
> the following:
> 
>    1) Small files only.
>    2) no DTDs or grammar checking
>    3) white space normalisation according to the following.
>        If a tag has MIXED content (ie tags and PCDATA) and all the
> PCDATA bits are all
>        blank then remove all the PCDATA
>    4) ASCII
>    5) simple "obvious" parse tree {tag,Line,Name,[Attr], [Children]}
>        Attr is a sorted [{Key,Val}] list
>    6) Simple SAX library, foldtree functions, find object in tree (like
> Xpath
>        only in Erlang)
> 
> I have a 6-pack of parsers than almost do this - they are all specialsed
> in
> different ways for infinite streams and so on ...
> 
> I'm not sure if a general purpose toolkit (that allows you to build
> the above) or a set
> of completely different parsers with different properties is desirable.
> 
> /Joe Armstrong
> 

Zvi

-- 
View this message in context: http://www.nabble.com/Rant%3A-I-hate-parsing-XML-with-Erlang-tf4676760.html#a13373230
Sent from the Erlang Questions mailing list archive at Nabble.com.