XML and Erlang

Wed Jun 22 04:56:30 CEST 2005

Tony Rogvall <tony@REDACTED> suggested:
    <!ELEMENT a ((b & c & d)* | (c, d))>
which I pointed out was not legal XML.
He replied:
	Suppose you remove the & and replace it with the expansion?

	Please let me try to explain.

	(a & b) == (a, b)  | (b,a)
	(a & b & c) == (a,b,c) | (a,c,b) | (b,a,c) | (b,c,a) | (c,a,b) | (a,b,a)

This kind of exponential explosion is part of _why_ it was left out
of XML.  Things like (b & c & d)* are more than slightly weird.

However, whenever we have (x1 & ... & xn) we can treat this as
insignificant variation in the input and always map it to a canonical
order.  So

    <!ELEMENT a - - ((b & c & d)* | (c,d))>
	<!--    ^^^ this must be SGML -->

    <!ELEMENT a - - ((c & d & b)* | (c,d))>
    <!ELEMENT a - - ((d & c & b)* | (c,d))>

could all be given the same internal form, as they all recognise the same
sequences.  The internal form could be the same as the internal form for

    <!ELEMENT a ((b, c, d)* | (c,d))>

Now any mix of | and , can be put into a standard form where
"|" is the top level operator; let's agree to treat x? as (x|EMPTY),
which is not actually legal.  Sequences like x* and x+ can be mapped
to lists.

In this case, we'd get
 /* Prolog */
    a([{}(B1,C1,D1),...,{}(BN,CN,DN)]) | a(C,D)
 /* Erlang */
    {a,[{B1,C1,D1},...,{BN,CN,DN}]} | {a,C,D}

However, let's consider a more interesting example, interesting to me
because it's something I'm using at the moment (documentation for
Smalltalk) and because it was not contrived for the purpose.

<!ELEMENT st O O (class+)>

Clearly maps to
    Pro:	st([Class,...])
    erl:	[Class,...]

<!ELEMENT class - O (p,(p|example)*,cat*,ccat*)>

Tricky!  The construction "p,(p|example)*" crops up in several places.
The intention is that every description should begin with a paragraph,
not an example.  But processing doesn't actually care, and I would
like that combination consistently treated as (p|example)+.
Let's assume that that's done through some kind of magic.

<!ATTLIST class
    name	ID	#REQUIRED
    parent	IDREF	#IMPLIED
    ansi	NMTOKEN #IMPLIED
    abstract    (abstract|concrete|value) "concrete"
    elements    (none|objects|bytes|chars) "none"
>

The mapping we want here is
    Pro:	class(Name,Parent,Ansi,Abstract,Elements,
		      [P or Example,...], [Cat,...], [Ccat,...])
    Erl:	{class,Name,Parent,Ansi,Abstract,Elements,
		      [P or Example,...], [Cat,...], [Ccat,...])
    where Parent is an atom or 0 and Ansi is an atom or 0.

<!ELEMENT example - O (#PCDATA)>

No problem here, just some wrapper around a string.

<!ELEMENT p  - O (#PCDATA|c|m|v|em|x|protocol)*>

No problem here.
    Pro:	p([String or C or M or V or Em or X or Protocol, ...])
    Erl:	{p,[String or C or M or V or Em or X or Protocol, ...]}

<!ELEMENT em - O (#PCDATA|c|m|v|em)*>

Like p.

<!ELEMENT protocol - - (#PCDATA)>

Like example.

<!ELEMENT c - O EMPTY>
<!ATTLIST c n IDREF #REQUIRED>

No problem here either:
    Pro:	c(N)
    Erl:	{c,N}

<!ELEMENT m - - (#PCDATA)>
<!ELEMENT v - - (#PCDATA)>
<!ELEMENT x - - (#PCDATA)>

All like <example>.

<!ELEMENT cat - O (method+)>
<!ATTLIST cat for NMTOKEN "unknown">

No problem here:
    Pro:	cat(For, [Method,...])
    Erl:	{cat,For,[Method,...]}

An additional optimisation can be applied for Erlang:
in a context where the tag is predictable (as it is for argument),
the tag may be omitted, so

    Erl':	{For,[Method,...]}

<!ELEMENT method - O (header,argument*,result?,p,(p|example)*)>
<!ATTLIST method ansi NMTOKEN #IMPLIED>

Here we have two problems.
One is the p,(p|example)* problem mentioned above.
I'll assume the same magical answer.
The other is how to treat x?.
One way is to treat x? as (x|EMPTY) and expand out,
which would result in
    Pro:	method(Header,[Argument,...],Result,[P or Example,...])
		method(Header,[Argument,...],[P or Example,...])
and the other is to treat it as (x|MISSING), where MISSING maps to
something (such as 0) which cannot otherwise occur.  The latter seems
preferable.  This also handles #IMPLIED attributes.

    Pro:	method(Header,Ansi or 0,[Argument,...],
			Result or 0,[P or Example,...])
    Erl:	{method,Header,Ansi or 0,[Argument,...],
			Result or 0,[P or Example,...])
    Erl':	{Header,Ansi or 0,[Argument,]]],
			Result or 0,[P or Example,...])

<!ELEMENT header O O (#PCDATA)>

Like <example>

<!ELEMENT argument - O EMPTY>
<!ATTLIST argument name	NMTOKEN #REQUIRED type CDATA #IMPLIED
    captured (y|n|u) "u">

No problem here:
    Pro:	argument(Name,Type or 0,Captured)
    Erl:	{argument,Name,Type or 0,Captured}
    Erl':	{Name,Type or 0,Captured}

<!ELEMENT result - O EMPTY>
<!ATTLIST result type CDATA #IMPLIED source (s|n|u) "u">

No problem here:
    Pro:	result(Type or 0,Source)
    Erl:	{result,Type or 0,Source}
    Erl':	{Type or 0,Source}

<!ELEMENT ccat - O (method+)>
<!ATTLIST ccat for NMTOKEN "unknown">

No problem here either:
    Pro:	ccat(For,[Method,...])
    Erl:	{ccat,For,[Method,...]}
    Erl':	{For,[Method,...]}

There's a guide-line which I've never explicitly formulated,
but which is often followed, and which explains why the mapping
from XML to Prolog or Erlang is so easy in this case:

    Never nest ','.

That is,
    model --> (#PCDATA|tag1|...|tagn)*
           |  (item1,...,itemn)
    item --> tag_or_choice [?|+|*]
    tag_or_choice --> tag | (tag1|...|tagn)

The only common exception I can call to mind is
    <!ELEMENT DL - - (DT,DD)+>
except that what the HTML specifications *really* say is
    <!ELEMENT DL - - (DT|DD)*>

In short, I don't really believe that a direct XML->Erlang mapping
*is* as straightforward as it might seem, but for the kinds of DTDs
that people actually write, characterised by that simplified context
model grammar above, it *is* fairly straightforward.

One difference between SGML and XML is that XML always allows "any other
attribute", but if an application *cared* about other attributes, they'd
have been mentioned in the DTD or Schema, so it's OK to strip other
attributes off in the mapping to Erlang (or Prolog).

Oh, that optimisation I mentioned above applies to a tag whenever it
is never part of the "choice" alternative of a tag_or_choice.