[erlang-questions] Rant: I hate parsing XML with Erlang

Tue Oct 23 17:05:32 CEST 2007

Kevin A. Smith wrote:
> If you can get by with a SAX-based approach erlsom can work without  
> schema.
> 
> Atomizer (http://code.google.com/p/atomizer/) uses this approach to  
> parse Atom feeds.

Just to be clear, you could "easily" write a wrapper around
xmerl_scan that fits more or less exactly into the atom_parser
structure. This was pretty much the idea with xmerl, but the
documentation is terse enough that most people have missed
that.

Also, while there are some wrappers provided with xmerl,
there is no SAX wrapper.

Just to illustrate with an (incomplete) wrapper:

-module(xmerl_sax).

-export([file/2]).

-include_lib("xmerl/include/xmerl.hrl").

file(F, CB) when is_function(CB, 3) ->
     xmerl_scan:file(F, [{event_fun, fun(E, S) ->
					    event(E, CB, S)
				    end},
			{acc_fun, fun(_, Acc, S) ->
					  {Acc, S}
				  end}]).

event(#xmerl_event{event = E,
		   data = D}, CB, S) ->
     case D of
	#xmlPI{} -> S;
	#xmlComment{} -> S;
	#xmlDecl{} -> S;
	_ ->
	    ES = xmerl_scan:event_state(S),
	    ES1 = CB(E, data(D), ES),
	    xmerl_scan:event_state(ES1, S)
     end.

data(#xmlAttribute{name = N, value = V}) ->
     {attribute, N, V};
data(#xmlElement{name = N, attributes = As, content = C}) ->
     {element, N, [{K,V} || #xmlAttribute{name = K, value = V} <- As], C};
data(document) -> document;
data(#xmlText{value = V}) ->
     {text, V}.

11> 
xmerl_sax:file("/home/etxuwig/contribs/xmerl-0.18.1/priv/testdata/test3.xml",fun(E,Info,S) 
-> io:format("E = ~p, Info = ~p~n", [E,Info]), S end).
E = started, Info = document
E = ended, Info = {attribute,encoding,"iso-8859-1"}
E = started, Info = {element,'People',[],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"\n  "}
E = started, Info = {element,comment,[],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"This is a comment"}
E = ended, Info = {element,comment,[],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"\n  "}
E = ended, Info = {attribute,'Type',"Personal"}
E = started, Info = {element,'Person',[{'Type',"Personal"}],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"\n  "}
E = ended, Info = {element,'Person',[{'Type',"Personal"}],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"\n"}
E = ended, Info = {element,'People',[],[]}
E = ended, Info = document

...then, xmerl spits out an xmlElement record anyway,
which is a bug, IMO.

Another bug is that you can't tell xmerl which accumulator
to use as the return value. This would be easily fixed.

I agree with Joe: it's pretty easy to write a limited
XML parser that handles > 90% of all XML out there and
returns something that's visually appealing.

Writing a few front-ends to xmerl that are leagues more
user friendly than the generic back-end is not rocket-
science, but I have no problem accepting that people
don't think they should have to do that.

BR,
Ulf W