# XML and Erlang

Richard A. O'Keefe <>
Wed Jun 22 04:56:30 CEST 2005

```Tony Rogvall <> suggested:
<!ELEMENT a ((b & c & d)* | (c, d))>
which I pointed out was not legal XML.
He replied:
Suppose you remove the & and replace it with the expansion?

Please let me try to explain.

(a & b) == (a, b)  | (b,a)
(a & b & c) == (a,b,c) | (a,c,b) | (b,a,c) | (b,c,a) | (c,a,b) | (a,b,a)

This kind of exponential explosion is part of _why_ it was left out
of XML.  Things like (b & c & d)* are more than slightly weird.

However, whenever we have (x1 & ... & xn) we can treat this as
insignificant variation in the input and always map it to a canonical
order.  So

<!ELEMENT a - - ((b & c & d)* | (c,d))>
<!--    ^^^ this must be SGML -->

<!ELEMENT a - - ((c & d & b)* | (c,d))>
<!ELEMENT a - - ((d & c & b)* | (c,d))>

could all be given the same internal form, as they all recognise the same
sequences.  The internal form could be the same as the internal form for

<!ELEMENT a ((b, c, d)* | (c,d))>

Now any mix of | and , can be put into a standard form where
"|" is the top level operator; let's agree to treat x? as (x|EMPTY),
which is not actually legal.  Sequences like x* and x+ can be mapped
to lists.

In this case, we'd get
/* Prolog */
a([{}(B1,C1,D1),...,{}(BN,CN,DN)]) | a(C,D)
/* Erlang */
{a,[{B1,C1,D1},...,{BN,CN,DN}]} | {a,C,D}

However, let's consider a more interesting example, interesting to me
because it's something I'm using at the moment (documentation for
Smalltalk) and because it was not contrived for the purpose.

<!ELEMENT st O O (class+)>

Clearly maps to
Pro:	st([Class,...])
erl:	[Class,...]

<!ELEMENT class - O (p,(p|example)*,cat*,ccat*)>

Tricky!  The construction "p,(p|example)*" crops up in several places.
The intention is that every description should begin with a paragraph,
not an example.  But processing doesn't actually care, and I would
like that combination consistently treated as (p|example)+.
Let's assume that that's done through some kind of magic.

<!ATTLIST class
name	ID	#REQUIRED
parent	IDREF	#IMPLIED
ansi	NMTOKEN #IMPLIED
abstract    (abstract|concrete|value) "concrete"
elements    (none|objects|bytes|chars) "none"
>

The mapping we want here is
Pro:	class(Name,Parent,Ansi,Abstract,Elements,
[P or Example,...], [Cat,...], [Ccat,...])
Erl:	{class,Name,Parent,Ansi,Abstract,Elements,
[P or Example,...], [Cat,...], [Ccat,...])
where Parent is an atom or 0 and Ansi is an atom or 0.

<!ELEMENT example - O (#PCDATA)>

No problem here, just some wrapper around a string.

<!ELEMENT p  - O (#PCDATA|c|m|v|em|x|protocol)*>

No problem here.
Pro:	p([String or C or M or V or Em or X or Protocol, ...])
Erl:	{p,[String or C or M or V or Em or X or Protocol, ...]}

<!ELEMENT em - O (#PCDATA|c|m|v|em)*>

Like p.

<!ELEMENT protocol - - (#PCDATA)>

Like example.

<!ELEMENT c - O EMPTY>
<!ATTLIST c n IDREF #REQUIRED>

No problem here either:
Pro:	c(N)
Erl:	{c,N}

<!ELEMENT m - - (#PCDATA)>
<!ELEMENT v - - (#PCDATA)>
<!ELEMENT x - - (#PCDATA)>

All like <example>.

<!ELEMENT cat - O (method+)>
<!ATTLIST cat for NMTOKEN "unknown">

No problem here:
Pro:	cat(For, [Method,...])
Erl:	{cat,For,[Method,...]}

An additional optimisation can be applied for Erlang:
in a context where the tag is predictable (as it is for argument),
the tag may be omitted, so

Erl':	{For,[Method,...]}

<!ATTLIST method ansi NMTOKEN #IMPLIED>

Here we have two problems.
One is the p,(p|example)* problem mentioned above.
I'll assume the same magical answer.
The other is how to treat x?.
One way is to treat x? as (x|EMPTY) and expand out,
which would result in
and the other is to treat it as (x|MISSING), where MISSING maps to
something (such as 0) which cannot otherwise occur.  The latter seems
preferable.  This also handles #IMPLIED attributes.

Result or 0,[P or Example,...])
Result or 0,[P or Example,...])
Result or 0,[P or Example,...])

Like <example>

<!ELEMENT argument - O EMPTY>
<!ATTLIST argument name	NMTOKEN #REQUIRED type CDATA #IMPLIED
captured (y|n|u) "u">

No problem here:
Pro:	argument(Name,Type or 0,Captured)
Erl:	{argument,Name,Type or 0,Captured}
Erl':	{Name,Type or 0,Captured}

<!ELEMENT result - O EMPTY>
<!ATTLIST result type CDATA #IMPLIED source (s|n|u) "u">

No problem here:
Pro:	result(Type or 0,Source)
Erl:	{result,Type or 0,Source}
Erl':	{Type or 0,Source}

<!ELEMENT ccat - O (method+)>
<!ATTLIST ccat for NMTOKEN "unknown">

No problem here either:
Pro:	ccat(For,[Method,...])
Erl:	{ccat,For,[Method,...]}
Erl':	{For,[Method,...]}

There's a guide-line which I've never explicitly formulated,
but which is often followed, and which explains why the mapping
from XML to Prolog or Erlang is so easy in this case:

Never nest ','.

That is,
model --> (#PCDATA|tag1|...|tagn)*
|  (item1,...,itemn)
item --> tag_or_choice [?|+|*]
tag_or_choice --> tag | (tag1|...|tagn)

The only common exception I can call to mind is
<!ELEMENT DL - - (DT,DD)+>
except that what the HTML specifications *really* say is
<!ELEMENT DL - - (DT|DD)*>

In short, I don't really believe that a direct XML->Erlang mapping
*is* as straightforward as it might seem, but for the kinds of DTDs
that people actually write, characterised by that simplified context
model grammar above, it *is* fairly straightforward.

One difference between SGML and XML is that XML always allows "any other
attribute", but if an application *cared* about other attributes, they'd
have been mentioned in the DTD or Schema, so it's OK to strip other
attributes off in the mapping to Erlang (or Prolog).

Oh, that optimisation I mentioned above applies to a tag whenever it
is never part of the "choice" alternative of a tag_or_choice.

```