Build an erlang computer (was:Computers are fast)

Richard A. O'Keefe ok@REDACTED
Fri Jan 27 05:31:19 CET 2006


"Joe Armstrong \(AL/EAB\)" <joe.armstrong@REDACTED> wrote:
	[Using XML for voluminous data] is pure lunacy

There's a student records protocol designed by someone in a goverment
department here; I have a copy of the spec.  They use XML.  They use
XML badly.  (With just a little redesign, I was able to reduce the
size of the average message by about a factor of two, not that anyone
was interested, of course.  People just don't use attributes enough.)
And they specify badly, no schema information of any kind (not just no
Schema and no DTD, but not even an *informal* schema).

	Idea - grade moderately difficult - XML should compress very
	nicely - since the same tags get repeated over and over again,
	thus in LZSS compression duplicated tags will appear as
	pointers.

It does.  See XMill, by
  Dan Suciu, AT&T Labs Research, Florham Park, suciu@REDACTED
  Hartmut Liefke, Univ. of Pennsylvania, Philadelphia, liefke@REDACTED

Shakespeare's "Timon of Athens"
 177 kB in XML   (Bosak's XML version)
 150 kB in SGML  (Bosak's original SGML version)
 121 kB in SGML  (same structure, with </SPEECH>, </STAGEDIR>, and
                  <SPEAKER> implied, <LINE> shortened to <L>, and
                  <SPEECH> shortened to <S>)
 110 kB as TEXT  (plain text stripped out of XML version)
  47 kB in XMill (compression of 3.74 relative to XML)

  49 kB XML   + gzip -9  (Bosak's)
  48 kB SGML  + gzip -9  (Bosak's)
  48 kB XMill + gzip -9  (yep, gzip made this one bigger)
  46 kB SGML  + gzip -9  (my briefer SGML)
  45 kB TEXT  + gzip -9  (plain text)

  37 kB XML   + bzip2 -9 (Bosak's,         a factor of 4.8)
  37 kB SGML  + bzip2 -9 (Bosak's,         a factor of 4.1)
  36 kB SGML  + bzip2 -9 (my briefer SGML, a factor of 3.3)
  36 kB TEXT  + bzip2 -9 (plain text,      a factor of 3.0)

  40 kB Xbmill (Bosak's XML + xbmill,      a factor of 4.4)

This is an example which is pretty heavy on text:
 Bosak's XML  markup adds 61%;
 Bosak's SGML markup adds 37%;
 briefer SGML markup adds 10%
(where "briefer" does NOT mean "expressing less structure"; the XML, SML,
and briefer SGML versions all have the *same* logical structure).
Even so, we see that it does compress pretty well.  In fact, using
bzip2, the compressed XML is very little larger than the compressed
plain text!

XMill can be told to use multiple "containers", which elements (via a path)
to put in which containers, and what compression scheme to use for each
container, and you can apparently plug your own compressors in, although
I've never done that.

	How about writing an XML parser that works directly from an LSZZ
	compressed XML stream
	*without* doing the decompression first.

This is effectively what xdemill (xbdemill) does.  The compression phase
stores a (compressed) representation of the tree structure, separate from
the compressed representation of the text.

All of Bosak's XML Shakespeare in one file with <PLAYS>...</PLAYS>
wrapped around them:

    parse using nsgmls with ESIS output to /dev/null		3.3 sec
    parse using SWI's sgml, output ditto			3.6 sec
    parse using my XML parser 'qh', output ditto		0.6 sec
    decompress plays.xmi using xdemill, xml out to /dev/null	0.6 sec

qh doesn't recognise DTDs; with that deficiency it's the fastest XML
parser I've ever come across.  So xdemill is doing pretty well.

	If you're clever and cache the result of the parses of the
	things that the LZSS pointer point to you might be able to write
	a very fast and compact parser.
	
The really clever thing to do is to do the parsing as part of compression
so that the decompressor has to do very little work to build the structure.
In a general context, this is particularly nice, because you can recover
the structure, then decompress the text parts *lazily*, so the parts you
are interested in never do get decompressed.

Someone else originally asked:
	> How about parsing a 300+ megabyte XML file?  I have a lean 
	> and mean XML parser in Erlang, one that only deals with a 
	> strict subset of XML, and operates entirely on binaries.  On 
	> 8 or 10 megabyte files, it's great, but on this monster--heh. 
	>  After about 30 minutes the emulator dies with "Abnormal 
	> Termination."  I suspect it's running out of memory.

On the same 500MHz 500MB machine where I did the timing tests above,
qh parses a 300MB XML file in 21 seconds, about 1/3 of which is I/O.
For things like this an Erlang parser is *never* going to match C, so
there really is a very strong argument for parsing the XML in another
process and just sending the interesting bits or summaries to Erlang.




More information about the erlang-questions mailing list