[erlang-questions] Rant: I hate parsing XML with Erlang

Richard A. O'Keefe ok@REDACTED
Wed Oct 24 23:57:58 CEST 2007


>> On Tue, 23 Oct 2007, Joel Reymont wrote:
>>
>> JR> Take a look at the following [1] and try to visualize an
>> JR> implementation in Erlang. More thoughts after the example.
>> JR>
>> JR> The data:
>> JR>
>> JR> <Export>
>> JR>    <Product>
>> JR>      <SKU>403276</SKU>
>> JR>      <ItemName>Trivet</ItemName>
>> JR>      <CollectionNo>0</CollectionNo>
>> JR>      <Pages>0</Pages>
>> JR>    </Product>
>> JR> </Export>

The first thing I note here is that this is not what I would call
well designed XML.  A good rule of design in general is
"if data are naturally unordered, do not impose any more order on them
than you can help".  Here we have four string properties identified by
name.  XML has a mechanism that is specifically designed for that job:
attributes.  The design rule has an exception "it is sometimes OK to
represent a set or bag as a sequence, provided you ensure that the
results are invariant under permutation."  So in *good* XML this
example is
   <Export>
     <Product SKU="403276" Item="Trivet" Collection="0" Pages="0"/>
   </Export>
There are two more benefits here.  One is that long-winded element
names like ItemName and CollectionNo can be replaced by shorter
attribute names like Item and Collection because attributes are
inherently contextual: it's really Product@REDACTED and Product@REDACTED
The second is that the attribute version is 83 bytes,
while the element version is 144 bytes, both using 1 space per level
indentation.  That reduces space by >1.7 times, and as well all know,
data *space* = I/O *time*.  In fact the better XML is even better than
that: when extended to a million products the space saving is better
than 1.95 and so is the time saving when parsing.

Does anyone remember that I proposed that Erlang could be
extended quite simply with XML expressions and XML patterns?

f() ->
     [  #product{sku=S,item=I,collection=C,Pages=P}
     || <'Export'>L</> <- xml:parse_file("my.xml")
      , <'Product' 'SKU'=S 'Item'=I 'Collection'=C 'Pages'=P/> <- L].

Without that extension, it would have to be something like

f() ->
     [  #product{sku=S,item=I,collection=C,Pages=P}
     || {'Export',_,L} <- xml:parse_file("my.xml")
      , {'Product',[{'Collection',C},{'Item',I},{'Pages',P},{'SKU',S}],
		  []} <- L].

except that this wouldn't work with additional attributes, and the
previous version would.

I also happen to think that good XML style imitates XHTML, SVG, MathML,
and other such standards in preferring lower case starts or even  
avoiding
upper case starts entirely, rather than imitating Visual Basic style.
Coincidentally that means rather less quoting in Erlang, so

f() ->
     [  #product{sku=S,item=I,collection=C,Pages=P}
     || {export,_,L} <- xml:parse_file("my.xml")
      , {product,[{'SKU',S},{collection,C},{item,I},{pages,P}], []}  
<- L].

This is doable *now* in Erlang, using an XML parser I wrote back in
June 2001.  I honestly cannot see this as inferior to the Ruby version.
So now let's revert to the inferior application of XML and see what
that looks like:

f() ->
     [  #product{sku=S,item=I,collection=C,Pages=P}
     || {'Export',_,L} <- xml:parse_file("my.xml")
      , {'Product',_,D} <- L
      , {'SKU',_,[S]} <- D
      , {'ItemName',_,[I]} <- D
      , {'CollectionNo',_,[C]} <- D
      , {'Pages',_,[P]} <- D
     ].

That wasn't *too* bad, was it?  Shorter than the Ruby version...

Handling namespaces is trickier, but thankfully, Erlang lets us
include *bound* variables in patterns, so we could do something like
this:

f() ->
     Main_NS = "http://www.example.org/silly/main",
     Attr_NS = "http://www.example.org/silly/attr",
     Export       = xml:name('Export',       Main_NS),
     Product      = xml:name('Product',      Main_NS),
     SKU          = xml:name('SKU',          Attr_NS),
     ItemName     = xml:name('ItemName',     Attr_NS),
     CollectionNo = xml:name('CollectionNo', Attr_NS),
     Pages        = xml:name('Pages',        Attr_NS),

     [  #product{sku=S,item=I,collection=C,Pages=P}
     || {Export,_,L} <- xml:parse_file("my.xml")
      , {Product,_,D} <- L
      , {SKU,_,[S]} <- D
      , {ItemName,_,[I]} <- D
      , {CollectionNo,_,[C]} <- D
      , {Pages,_,[P]} <- D
     ].

It doesn't get much easier than this, anywhere.  It so happens that
xml:name/2 doesn't exist.  It would be
	name(Name, NS) -> {Name,NS}.
but my old parser doesn't do namespaces.  Not much point in changing it
when there are more capable parsers around.

ML: doesn't allow bound variables in patterns, doesn't have list
     comprehension
Haskell and Clean: don't allow bound variables in patterns,
     do have list comprehension
Erlang: does allow bound variables in patterns, does have list
     comprehension, making this way of picking XML apart dead easy.






More information about the erlang-questions mailing list