XML and Erlang

Mon Jun 20 17:32:51 CEST 2005

    It seems like XML is not going to go away - I closed my eyes, screwed them up, put my fingers in
my ears and waited for years - but it's still here, so can we do anything with it?

    Possibly :-)

    A few days ago I stumbled upon the "RELAX NG compact notation" - this is more or less "XML DTDs as they
should have been" - James Clark has done a wonderful job here and invented a notation that is very
similar to the informal type notation I have been using to define Erlang types for the last hundred years or so.

    This is so near that with a slight tweak Erlang terms can become XML data structures.

    Suddenly I realised that all the XML pasers I'd every written were wrong - sorry guys ...

    What was wrong and how can it be fixed?

    What's wrong with my XML parsers was the representation.

    I represented

	<tag Attributes> Children </tag>

    as

	{tag, AssociationListOfAttributes, [ Children ]}

    at first sight this looks ok - we don't know how many children an element has so we represent the 
children as a list of elements - this is WRONG.

    So what should it have been?

     Rember the old DTD's (the nice easy ones *before* the standardisation committees got their hands on them)?

    Suppose I write

	<!ELEMENT a (b,c,d)>

     This means that a is a fixed length sequence of three items. How do we do fixed length sequences in
Erlang? - yes - that's right tuples: Thus the parse tree of an a should be:

	{a, Attrs, {p(b), p(c), p(d) }  when p(X) is the parse tree of X

     and NOT

	{a, Attrs, [p(b), p(c), p(d) ] }

     With this small twist it's easy to see the relationship between DTDs and erlang terms

	<!ELEMENT a (b, c*, d)> is thus represented as {a, Attrs, { p(b), [p(c)], p(d) } }

     What's this got to do with RELAX NG compact syntax?

      In the compact syntax you can say things like

	town = element town {
		attribute name { text }
		street*
            }

       This represents an XML data type like

	<town name="Stockholm">
	   <street>...</street>
	   <street>...</street>
	</town>

       Or an Erlang term

	{town, [{name," Stockholm"}], [{street,...},...]}

       And the transformation between the two is pretty straightforward.

       Now town involved a data constructor - it is wrapper for a sequence of streets. 

      The base case might be something like

	name = element name { text }

      an XML instance might be 

	<name>Joe</name>

     or in Erlang

	{name, [], {value, "Joe"}}
     
     note that {value, "Joe"} can never be interpreted as the body of an element containing two sub-elements
since in this case each of the sub elements would have to be a 3-tuple.

     This gives us with an "almost" canonical Erlang representation of XML (3-tuples for elements) 2-tuples for values
- it's "almost" because of "white space issues" - disregarding white space issues I think this is a canonical
representation.

     If it is then the following things are possible.

	- Compact encoding/decoding of XML (ie turn it into the canonical representation then
	  use term-to-binary etc.
	- Verifying that a data instance is according to a particular DTD (easy)
	- Type inference of XML producing functions (ie run the dializer, and confert the
	   inferred types back to XML)
          
    It also means that things like dynamic type checking in Erlang is almost identical to checking
an instance of a XML datatype against a schema.
 
    Now all I have to do is re-write my latest and greatest XML parser, and write a RELAX NG compact syntax
parser :-)

  /Joe