XML and Erlang
Richard A. O'Keefe
ok@REDACTED
Wed Jun 22 04:56:30 CEST 2005
Tony Rogvall <tony@REDACTED> suggested:
<!ELEMENT a ((b & c & d)* | (c, d))>
which I pointed out was not legal XML.
He replied:
Suppose you remove the & and replace it with the expansion?
Please let me try to explain.
(a & b) == (a, b) | (b,a)
(a & b & c) == (a,b,c) | (a,c,b) | (b,a,c) | (b,c,a) | (c,a,b) | (a,b,a)
This kind of exponential explosion is part of _why_ it was left out
of XML. Things like (b & c & d)* are more than slightly weird.
However, whenever we have (x1 & ... & xn) we can treat this as
insignificant variation in the input and always map it to a canonical
order. So
<!ELEMENT a - - ((b & c & d)* | (c,d))>
<!-- ^^^ this must be SGML -->
<!ELEMENT a - - ((c & d & b)* | (c,d))>
<!ELEMENT a - - ((d & c & b)* | (c,d))>
could all be given the same internal form, as they all recognise the same
sequences. The internal form could be the same as the internal form for
<!ELEMENT a ((b, c, d)* | (c,d))>
Now any mix of | and , can be put into a standard form where
"|" is the top level operator; let's agree to treat x? as (x|EMPTY),
which is not actually legal. Sequences like x* and x+ can be mapped
to lists.
In this case, we'd get
/* Prolog */
a([{}(B1,C1,D1),...,{}(BN,CN,DN)]) | a(C,D)
/* Erlang */
{a,[{B1,C1,D1},...,{BN,CN,DN}]} | {a,C,D}
However, let's consider a more interesting example, interesting to me
because it's something I'm using at the moment (documentation for
Smalltalk) and because it was not contrived for the purpose.
<!ELEMENT st O O (class+)>
Clearly maps to
Pro: st([Class,...])
erl: [Class,...]
<!ELEMENT class - O (p,(p|example)*,cat*,ccat*)>
Tricky! The construction "p,(p|example)*" crops up in several places.
The intention is that every description should begin with a paragraph,
not an example. But processing doesn't actually care, and I would
like that combination consistently treated as (p|example)+.
Let's assume that that's done through some kind of magic.
<!ATTLIST class
name ID #REQUIRED
parent IDREF #IMPLIED
ansi NMTOKEN #IMPLIED
abstract (abstract|concrete|value) "concrete"
elements (none|objects|bytes|chars) "none"
>
The mapping we want here is
Pro: class(Name,Parent,Ansi,Abstract,Elements,
[P or Example,...], [Cat,...], [Ccat,...])
Erl: {class,Name,Parent,Ansi,Abstract,Elements,
[P or Example,...], [Cat,...], [Ccat,...])
where Parent is an atom or 0 and Ansi is an atom or 0.
<!ELEMENT example - O (#PCDATA)>
No problem here, just some wrapper around a string.
<!ELEMENT p - O (#PCDATA|c|m|v|em|x|protocol)*>
No problem here.
Pro: p([String or C or M or V or Em or X or Protocol, ...])
Erl: {p,[String or C or M or V or Em or X or Protocol, ...]}
<!ELEMENT em - O (#PCDATA|c|m|v|em)*>
Like p.
<!ELEMENT protocol - - (#PCDATA)>
Like example.
<!ELEMENT c - O EMPTY>
<!ATTLIST c n IDREF #REQUIRED>
No problem here either:
Pro: c(N)
Erl: {c,N}
<!ELEMENT m - - (#PCDATA)>
<!ELEMENT v - - (#PCDATA)>
<!ELEMENT x - - (#PCDATA)>
All like <example>.
<!ELEMENT cat - O (method+)>
<!ATTLIST cat for NMTOKEN "unknown">
No problem here:
Pro: cat(For, [Method,...])
Erl: {cat,For,[Method,...]}
An additional optimisation can be applied for Erlang:
in a context where the tag is predictable (as it is for argument),
the tag may be omitted, so
Erl': {For,[Method,...]}
<!ELEMENT method - O (header,argument*,result?,p,(p|example)*)>
<!ATTLIST method ansi NMTOKEN #IMPLIED>
Here we have two problems.
One is the p,(p|example)* problem mentioned above.
I'll assume the same magical answer.
The other is how to treat x?.
One way is to treat x? as (x|EMPTY) and expand out,
which would result in
Pro: method(Header,[Argument,...],Result,[P or Example,...])
method(Header,[Argument,...],[P or Example,...])
and the other is to treat it as (x|MISSING), where MISSING maps to
something (such as 0) which cannot otherwise occur. The latter seems
preferable. This also handles #IMPLIED attributes.
Pro: method(Header,Ansi or 0,[Argument,...],
Result or 0,[P or Example,...])
Erl: {method,Header,Ansi or 0,[Argument,...],
Result or 0,[P or Example,...])
Erl': {Header,Ansi or 0,[Argument,]]],
Result or 0,[P or Example,...])
<!ELEMENT header O O (#PCDATA)>
Like <example>
<!ELEMENT argument - O EMPTY>
<!ATTLIST argument name NMTOKEN #REQUIRED type CDATA #IMPLIED
captured (y|n|u) "u">
No problem here:
Pro: argument(Name,Type or 0,Captured)
Erl: {argument,Name,Type or 0,Captured}
Erl': {Name,Type or 0,Captured}
<!ELEMENT result - O EMPTY>
<!ATTLIST result type CDATA #IMPLIED source (s|n|u) "u">
No problem here:
Pro: result(Type or 0,Source)
Erl: {result,Type or 0,Source}
Erl': {Type or 0,Source}
<!ELEMENT ccat - O (method+)>
<!ATTLIST ccat for NMTOKEN "unknown">
No problem here either:
Pro: ccat(For,[Method,...])
Erl: {ccat,For,[Method,...]}
Erl': {For,[Method,...]}
There's a guide-line which I've never explicitly formulated,
but which is often followed, and which explains why the mapping
from XML to Prolog or Erlang is so easy in this case:
Never nest ','.
That is,
model --> (#PCDATA|tag1|...|tagn)*
| (item1,...,itemn)
item --> tag_or_choice [?|+|*]
tag_or_choice --> tag | (tag1|...|tagn)
The only common exception I can call to mind is
<!ELEMENT DL - - (DT,DD)+>
except that what the HTML specifications *really* say is
<!ELEMENT DL - - (DT|DD)*>
In short, I don't really believe that a direct XML->Erlang mapping
*is* as straightforward as it might seem, but for the kinds of DTDs
that people actually write, characterised by that simplified context
model grammar above, it *is* fairly straightforward.
One difference between SGML and XML is that XML always allows "any other
attribute", but if an application *cared* about other attributes, they'd
have been mentioned in the DTD or Schema, so it's OK to strip other
attributes off in the mapping to Erlang (or Prolog).
Oh, that optimisation I mentioned above applies to a tag whenever it
is never part of the "choice" alternative of a tag_or_choice.
More information about the erlang-questions
mailing list