[erlang-questions] Rant: I hate parsing XML with Erlang

Willem de Jong w.a.de.jong@REDACTED
Tue Oct 23 21:35:13 CEST 2007


Using erlsom, you can write:

-module(product).
-compile(export_all).

parse(File) ->
  {ok, Model} = erlsom:compile_xsd_file("product.xsd"),
  {ok, Result, _} = erlsom:scan_file(File, Model),
  Result.
Then you can do:
1> erlsom:write_xsd_hrl_file("product.xsd", "product.hrl", []).
2> rr("product.hrl").
3> product:parse("export.xml").
#'Export'{anyAttribs = [],
          'Product' = #'Product'{anyAttribs = [],
                                 'SKU' = "403276",
                                 'ItemName' = "Trivet",
                                 'CollectionNo' = 0,
                                 'Pages' = 0}}

Very different from the example, but also nice :) And maybe more useful,
depending on what you want to do with it.

You need to provide a schema, of course. I am pasting an example schema for
this XML at the end of this email. Using the schema has the advantage that
the xml document will be validated. Having a schema is useful as well to
document the interface. Even if you don't like XML schema's (I still have
problems with them, even after writing the parser), having a specification
should be a good thing. Isn't this a bit like ASN.1, actually?

If you don't like the approach with the schema, you can also do this:

-module(product_sax).
-compile(export_all).

parse(File) ->
  {ok, Bin}  = file:read_file(File),
  {R, _} = erlsom_sax:parseDocument(binary_to_list(Bin), {s1, []},
                                    fun callback/2),
  lists:reverse(R).

callback({startElement, _, "Product", _, _}, {s1, S}) ->
  {s2, S};
callback({startElement, _, Tag, _, _}, {s2, S}) ->
  {s3, {Tag, S}};
callback({characters, Value}, {s3, {Tag, List}}) ->
  {s2, [{Tag, Value} | List]};
callback({endElement, _, "Product", _}, {_, S}) -> S;
callback(_, S) -> S.
4> product_sax:parse("export.xml").
[{"SKU","403276"},{"ItemName","Trivet"},{"CollectionNo","0"},{"Pages","0"}]

Using a callback and a simple sort of state machine - also nice, and very
efficient.

I have written a new version of the sax parser that can parse a file in
blocks, so that you can use it to parse very big files or streams of data.
At the moment I am doing some final testing, finishing the documentation
etc. Not the kind of work I like, so it is likely to take a while (1 - 2
weeks). The new release also fixes some bugs in the XML Schema related code,
and it has some features that should improve the capabilities to use erlsom
for SOAP.

Regards,
Willem

--------------------------
The schema:

<xsd:schema xmlns:xsd='http://www.w3.org/2001/XMLSchema'>

<xsd:element name='Export'>
  <xsd:complexType>
    <xsd:sequence>
      <xsd:element name = 'Product' type='Product'/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:element>

<xsd:complexType name='Product'>
  <xsd:sequence>
    <xsd:element name='SKU' type='xsd:string'/>
    <xsd:element name='ItemName' type='xsd:string'/>
    <xsd:element name='CollectionNo' type='xsd:integer'/>
    <xsd:element name='Pages' type='xsd:integer'/>
  </xsd:sequence>
</xsd:complexType>

</xsd:schema>



On 10/23/07, Hakan Mattsson <hakan@REDACTED> wrote:
>
> On Tue, 23 Oct 2007, Joel Reymont wrote:
>
> JR> Take a look at the following [1] and try to visualize an
> JR> implementation in Erlang. More thoughts after the example.
> JR>
> JR> The data:
> JR>
> JR> <Export>
> JR>    <Product>
> JR>      <SKU>403276</SKU>
> JR>      <ItemName>Trivet</ItemName>
> JR>      <CollectionNo>0</CollectionNo>
> JR>      <Pages>0</Pages>
> JR>    </Product>
> JR> </Export>
> JR>
> JR> The Ruby hPricot code:
> JR>
> JR> FIELDS = %w[SKU ItemName CollectionNo Pages]
> JR>
> JR> doc = Hpricot.parse(File.read("my.xml"))
> JR> (doc/:product).each do |xml_product|
> JR>    product = Product.new
> JR>    for field in FIELDS
> JR>      product[field] = (xml_product/field.intern).first.innerHTML
> JR>    end
> JR>    product.save
> JR> end
>
> At a first glance your Ruby code looks impressively
> compact.  But the corresponding implementation in
> Erlang is about the same size. What's the point in
> adding some syntactic sugar in order to make it even
> more compact? It is just a matter of taste.
>
>    % cat product.erl
>    -module(product).
>    -compile(export_all).
>    -include_lib("xmerl/include/xmerl.hrl").
>
>     parse(File) ->
>        {#xmlElement{content = Exports}, _} = xmerl_scan:file(File),
>        [{Tag, Val} || #xmlElement{content = Products} <- Exports,
>                       #xmlElement{content = Fields} <- Products,
>                       #xmlText{parents = [{Tag, _} | _], value = Val}  <-
> Fields].
>
>    % erl
>    Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0]
> [kernel-poll:false]
>
>    Eshell V5.5.5  (abort with ^G)
>    1> product:parse("my.xml").
>
> [{'SKU',"403276"},{'ItemName',"Trivet"},{'CollectionNo',"0"},{'Pages',"0"}]
>    2>
>
> /Håkan
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071023/ae2c6128/attachment.htm>


More information about the erlang-questions mailing list