XML and Erlang

Richard A. O'Keefe ok@REDACTED
Thu Jun 23 03:35:13 CEST 2005


I've just written some code to measure how big a parsed document
would be using several different representations.  This is the same
DTD I used as an example yesterday, except that a couple of attributes
are really (large) enumerations and I didn't show you those.

There's a 5.4 ratio between best and worst overall.
But the really important factor is NOT whether you use a general-purpose
XML representation or one tailored to your particular problem,
it's HOW YOU STORE STRINGS.

Now when we are processing XML, it is quite true that much of the
time we actually ignore the strings and just transform one structure
to another structure, so it's the amount of memory we TOUCH that matters,
not the amount of memory we HOLD.  But we do have to allocate and fill
in all that memory, and it does have to be reclaimed.

It looks as though the biggest space win for Erlang might be representing
parsed character data and attribute values other than enumeration values
as binaries rather than lists.

The original document was 31707 bytes, excluding the DTD.
Size		is reported in 32-bit words.
Language	is Erlang (cost model: [_|_] = 3 words, {X1,...,Xn} =
		n+2 words), Prolog (WAM cost model), or Smalltalk (a
		non-interactive Smalltalk dialect with cost model
		unindexed object = 1 + #slots words,
		indexed object = 2 + #slots words + element space).
		C is just C.
Elem.rep	is generic, meaning that it's like the current Erlang
		XML representation in working for _any_ XML with or without
		a DTD or schema, or specific, meaning that it is tailored
		to this particular DTD.  Erlang,specific is basically the
		tightly packed "Erl'" version I outlined yesterday.
String rep	is string=atom for Erlang and Prolog, string=list (of
		integers) for Erlang and Prolog, char=byte (1 byte per
		Latin-1 character) or char=word (4 bytes per 21-bit
		Unicode character) for Smalltalk, or "my DVM2 library"
		which uses UTF8 + unique storage.  The string=atom case
		is a useful approximation to what a string=binary
		representatin would cost.

    Size                Language   Elem.rep  String rep

    11779 words         Smalltalk, specific, char=byte  
    13384 words         Prolog,    specific, string=atom
    15912 words         Smalltalk, generic,  char=byte  
    16151 words         Erlang,    specific, string=atom
    17820 words         Prolog,    generic,  string=atom
    18673 words         C,         generic,  my DVM2 library.
    22735 words         Erlang,    generic,  string=atom
    29343 words         Smalltalk, specific, char=word  
    34583 words         Smalltalk, generic,  char=word  
    53101 words         Prolog,    specific, string=list
    54918 words         Erlang,    specific, string=list
    59752 words         Prolog,    generic,  string=list
    63441 words         Erlang,    generic,  string=list
  
"Honesty is praised and starves." -- Juvenal



More information about the erlang-questions mailing list