[erlang-questions] How to best parse XML in erlang?

Ulf Wiger ulf@REDACTED
Mon Jul 29 00:07:23 CEST 2013


Note that xmerl nowadays has a number of token scanners and SAX parser variants, that don't blow up the atom table, and are as fast as erlsom. However, they aren't wired into the main xmerl utilities, such as validation and XPATH.

Here's one very basic parser based on xmerl_sax_parser:

-module(xmerl_simple).
-export([file/1, stream/1]).

file(F) ->
    xmerl_sax_parser:file(F, options()).

stream(S) ->
    xmerl_sax_parser:stream(S, options()).

options() ->
    [{event_state, []},
     {event_fun, fun event/3}].

event({startElement, _, _LocalName, QName, Attrs}, _, Acc) ->
    Attrs1 = [{qname({Pfx,N}), list_to_binary(V)} || {_, Pfx, N, V} <- Attrs],
    [{qname(QName), Attrs1, []}|Acc];
event({endElement, _, _LocalName, QName}, _, S) ->
    Name = qname(QName),
    case S of
        {Cs, [{Name, As, C}|Acc]} ->
            end_element(Name, As, iolist_to_binary([C,Cs]), Acc);
        [{Name, Attrs, C}|Acc] ->
            end_element(Name, Attrs, lists:reverse(C), Acc)
    end;
event({ignorableWhitespace, _}, _, {C, Acc}) ->
    {[C, " "], Acc};
event({characters, Cs}, _, {C, Acc}) ->
    {[C,Cs], Acc};
event({characters, Cs}, _, Acc) ->
    {Cs, Acc};
event(_, _, S) ->
    S.

end_element(Name, As, C, []) ->
    {Name, As, C};
end_element(Name, As, C, [{PName, PAs, PC}|Acc]) ->
    [{PName, PAs, [{Name, As, C}|PC]}|Acc].

qname({NS,Name}) ->
    {list_to_atom(NS),list_to_atom(Name)}.

 * * * Simple XML file:

<example>
<rpc message-id="101"
     xmlns="urn:ietf:params:xml:ns:netconf:base:1.0"
     xmlns:ex="http://example.net/content/1.0"
     ex:user-id="fred">
  <get/>
</rpc>
<rpc-reply message-id="101"
	   xmlns="urn:ietf:params:xml:ns:netconf:base:1.0"
	   xmlns:ex="http://example.net/content/1.0"
	   ex:user-id="fred">
  <data>
    <!-- contents here... -->
  </data>
</rpc-reply>
</example>

* * * * parsing:

Eshell V5.9.2  (abort with ^G)
1> xmerl_simple:file("examples/ncrfc_rpc_example_1.xml").
{ok,{{'',example},
     [],
     [{{'',rpc},
       [{{'','message-id'},<<"101">>},{{ex,'user-id'},<<"fred">>}],
       [{{'',get},[],[]}]},
      {{'','rpc-reply'},
       [{{'','message-id'},<<"101">>},{{ex,'user-id'},<<"fred">>}],
       [{{'',data},[],[]}]}]},
    <<>>}

(Note that while this parse emitted atoms, that was *my* doing - not the parser's).

BR,
Ulf

On 28 Jul 2013, at 21:51, Ignas Vyšniauskas <baliulia@REDACTED> wrote:

> Hi,
> 
> I was actually researching available XML parsers last week, so I'm just
> going to list all the options I found:
> 
> * xmerl (the standard)
> * erlsom (the "better")
> * exml ( https://github.com/paulgray/exml) (expat based)
> * parsexml (KISS)  ( https://github.com/maxlapshin/parsexml)
> * (shameless plug) my even more simplified fork of parsexml
> https://github.com/yfyf/parsexml
> 
> It all depends on what kind of XML do you want to parse/support, i.e.
> answering these questions to yourself:
> * how complex is your XML?
> * can you keep holding the XMLs in memory?
> * how much do you care about speed?
> 
> --
> Ignas
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.
http://feuerlabs.com






More information about the erlang-questions mailing list