[erlang-questions] Rant: I hate parsing XML with Erlang
Willem de Jong
w.a.de.jong@REDACTED
Tue Oct 23 21:35:13 CEST 2007
Using erlsom, you can write:
-module(product).
-compile(export_all).
parse(File) ->
{ok, Model} = erlsom:compile_xsd_file("product.xsd"),
{ok, Result, _} = erlsom:scan_file(File, Model),
Result.
Then you can do:
1> erlsom:write_xsd_hrl_file("product.xsd", "product.hrl", []).
2> rr("product.hrl").
3> product:parse("export.xml").
#'Export'{anyAttribs = [],
'Product' = #'Product'{anyAttribs = [],
'SKU' = "403276",
'ItemName' = "Trivet",
'CollectionNo' = 0,
'Pages' = 0}}
Very different from the example, but also nice :) And maybe more useful,
depending on what you want to do with it.
You need to provide a schema, of course. I am pasting an example schema for
this XML at the end of this email. Using the schema has the advantage that
the xml document will be validated. Having a schema is useful as well to
document the interface. Even if you don't like XML schema's (I still have
problems with them, even after writing the parser), having a specification
should be a good thing. Isn't this a bit like ASN.1, actually?
If you don't like the approach with the schema, you can also do this:
-module(product_sax).
-compile(export_all).
parse(File) ->
{ok, Bin} = file:read_file(File),
{R, _} = erlsom_sax:parseDocument(binary_to_list(Bin), {s1, []},
fun callback/2),
lists:reverse(R).
callback({startElement, _, "Product", _, _}, {s1, S}) ->
{s2, S};
callback({startElement, _, Tag, _, _}, {s2, S}) ->
{s3, {Tag, S}};
callback({characters, Value}, {s3, {Tag, List}}) ->
{s2, [{Tag, Value} | List]};
callback({endElement, _, "Product", _}, {_, S}) -> S;
callback(_, S) -> S.
4> product_sax:parse("export.xml").
[{"SKU","403276"},{"ItemName","Trivet"},{"CollectionNo","0"},{"Pages","0"}]
Using a callback and a simple sort of state machine - also nice, and very
efficient.
I have written a new version of the sax parser that can parse a file in
blocks, so that you can use it to parse very big files or streams of data.
At the moment I am doing some final testing, finishing the documentation
etc. Not the kind of work I like, so it is likely to take a while (1 - 2
weeks). The new release also fixes some bugs in the XML Schema related code,
and it has some features that should improve the capabilities to use erlsom
for SOAP.
Regards,
Willem
--------------------------
The schema:
<xsd:schema xmlns:xsd='http://www.w3.org/2001/XMLSchema'>
<xsd:element name='Export'>
<xsd:complexType>
<xsd:sequence>
<xsd:element name = 'Product' type='Product'/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:complexType name='Product'>
<xsd:sequence>
<xsd:element name='SKU' type='xsd:string'/>
<xsd:element name='ItemName' type='xsd:string'/>
<xsd:element name='CollectionNo' type='xsd:integer'/>
<xsd:element name='Pages' type='xsd:integer'/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
On 10/23/07, Hakan Mattsson <hakan@REDACTED> wrote:
>
> On Tue, 23 Oct 2007, Joel Reymont wrote:
>
> JR> Take a look at the following [1] and try to visualize an
> JR> implementation in Erlang. More thoughts after the example.
> JR>
> JR> The data:
> JR>
> JR> <Export>
> JR> <Product>
> JR> <SKU>403276</SKU>
> JR> <ItemName>Trivet</ItemName>
> JR> <CollectionNo>0</CollectionNo>
> JR> <Pages>0</Pages>
> JR> </Product>
> JR> </Export>
> JR>
> JR> The Ruby hPricot code:
> JR>
> JR> FIELDS = %w[SKU ItemName CollectionNo Pages]
> JR>
> JR> doc = Hpricot.parse(File.read("my.xml"))
> JR> (doc/:product).each do |xml_product|
> JR> product = Product.new
> JR> for field in FIELDS
> JR> product[field] = (xml_product/field.intern).first.innerHTML
> JR> end
> JR> product.save
> JR> end
>
> At a first glance your Ruby code looks impressively
> compact. But the corresponding implementation in
> Erlang is about the same size. What's the point in
> adding some syntactic sugar in order to make it even
> more compact? It is just a matter of taste.
>
> % cat product.erl
> -module(product).
> -compile(export_all).
> -include_lib("xmerl/include/xmerl.hrl").
>
> parse(File) ->
> {#xmlElement{content = Exports}, _} = xmerl_scan:file(File),
> [{Tag, Val} || #xmlElement{content = Products} <- Exports,
> #xmlElement{content = Fields} <- Products,
> #xmlText{parents = [{Tag, _} | _], value = Val} <-
> Fields].
>
> % erl
> Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0]
> [kernel-poll:false]
>
> Eshell V5.5.5 (abort with ^G)
> 1> product:parse("my.xml").
>
> [{'SKU',"403276"},{'ItemName',"Trivet"},{'CollectionNo',"0"},{'Pages',"0"}]
> 2>
>
> /Håkan
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071023/ae2c6128/attachment.htm>
More information about the erlang-questions
mailing list