are Mnesia tables immutable?

Thu Jun 22 05:55:43 CEST 2006

	Interesting.  Do you think it would be possible to write an API
	similar to Lines (or on top of Lines), that would turn any
	string into a tree for easy manipulation and pseudo-random
	access, without understanding its syntax?  The tradeoff would be
	speed in exchage for space.

If "Lines" refers to Ulf Wiger's 'lines' module, that defines an
indexable sequence data type which stores its contents as a balanced
binary tree of "blocks" (where a "block" may have up to 10 elements).
There is nothing in that module that knows or cares what the elements
are.  As it happens, 'lines' itself _will_ turn any stirng into a tree.
Just use

	Array = lines:convert_from_list(String)

This will take even more space than an ordinary list.  That could be
fixed quite easily by rewriting the 'lines' module (amongst other things,
the operation 'lines' calls 'append' is in fact 'snoc' and there isn't
any true 'append' operation for 'lines').  The main problem with this
is that it concentrates on operations
	- fetch one element
	- replace one element
	- insert one element
	- delete one element
that are simply at the wrong level of abstraction for anything you are
likely to want to do with a "string" these days.

There are quite a lot of data structures _like_ the 'lines' module around,
starting with the old 'AVL DAGs' and including some very clever data
structures indeed.

Here is a very simple illustration of how choosing a slightly different
data structure can give useful savings.

I have an XML file with 1,322,807 characters.
(It's part of the TREC Wall Street Journal collection.)
Reading it into SWI Prolog as a list of character codes,
just the same as the normal Erlang representation of a string,
required 15,874,896 bytes, from which we conclude that SWI
Prolog uses 12 bytes per list cell.  (Quintus Prolog used 8.)

A common trick in Information Retrieval is to represent a text
as a sequence of *words*.  So I wrote a small amount of Prolog code
to convert a sequence of alphanumerics possibly followed by a space
to an atom, while any other character remained as a character code
(and cancelled the implied space at the end of the preceding word).
So "'Twas the night before Christmas, and all through the house,"
would be stored as [39,'Twas',the,night,before,'Christmas',44,
and,all,through,the,house,44].

Using that representation, the same text required under 3.5 bytes
per character.

The great thing about that representation is that you can search for
words and phrases easily, in fact _easier_ than you can on a string.

We can also represent a list of items as a list of blocks (in much the
same way that 'lines' uses a tree of blocks); a block could be a tuple.
>From actual measurements in SWI Prolog:
	list of characters		3.0  words = 12.0  bytes per character
	list of blocks of characters	1.4  words =  5.6  bytes per character
	list of words			0.86 words =  3.43 bytes per character
	list of blocks of words 	0.40 words =  1.6  bytes per character

Now the list of blocks of words representation doesn't give you constant
time index, but that's not a useful operation on strings.  It _does_ give
you efficient word and phrase search.

But my whole point is that the real way to get efficient text representation
is *NOT* to use a one-size-fits-everyone-badly representation that operates
'without understanding the syntax' but to use something that _does_ exploit
the structures that are relevant to the processing you intend to do.

Erlang's binaries are a perfectly good space-efficient representation for
textual data that you are just accepting, storing, and passing on.

For more general textual processing, you want
 - O(1) slicing
 - somewhere between O(1) and O(lg n) concatenation
 - O(n) traversal
 - O(n) search
The conventional packed-array-of-bytes approach gives you the last two,
but not the first two.
	Well, Hokan said it in this thread :)

	"Erlang is somewhat weak in the area of strings (strings where not a main
	concern when erlang was begin designed to be used in Ericssons telecom
	products - switches ... etc)"

He was wrong.  Strings may not have been a major concern, but that doesn't
mean that Erlang is weak in that area.  It's rather like the contrast between
PL/I -- where strings *are* a special data type and a lot of thought went
into them -- and C -- where they *aren't*.  Way back in 1979, a friend of
mine, who was an experienced PL/I programmer, wrote a 'stream editor' not
entirely unlike sed for IBM mainframes.  It took him a *lot* of code, and
he was fighting PL/I every step of the way, because the operations *it*
wanted to provide for strings were not the operations he needed to use, and
to get the operations he needed required a great deal of rather clumsy code.
A month after meeting C for the first time, I decided to reimplement his
editor.  It took me about 1/8th as much code, and one of the reasons for
that was that I was able to implement *precisely* the operations I needed
using the array primitives provided by C and then write the rest of the
editor using them.  That was my first inkling that strings were wrong, and
good *sequence* data types were right.

Like Prolog and Haskell, Erlang has excellent list processing support.
So Erlang is in fact a very *good* language for working on strings,
not because of special support for strings but because special support is
not needed.

And as a consequence of that, Erlang offers *equally* good support for
sequences of words, sequences of tokens, sequences of lines, &c; it's
not like AWK which is _just_ good at strings...

	> It's the same with Erlang.  *Building* strings using [This|That] instead
	> of This ++ That is an O(1) operation no matter how big This is.  Yes, you
	> can beat C that way -- unless the C programmer knows that strings are wrong
	> too.

	Coming from a C/C++ background, I would say that it's pretty easy to
	write an expanding buffer in those languages to which you can append
	characters at the end for O(1) cost, with an occasional realloc()
	call.

Yes of course.  Everyone knows that.  BUT it's not the way most C code is
actually written, and that's *not* the data type that is directly supported
by strcpy(), strstr(), sprintf(), sscanf(), and so on.  Above all, you have
completely missed the point of 'this is an O(1) operation NO MATTER HOW BIG
This IS'.  That is, you can concatenate two non-flat strings in Erlang
in CONSTANT time regardless of how big either string is.  Using the
flex array technique you mention in C, the cost is LINEAR IN THE SIZE OF
AT LEAST ONE OF THE STRINGS.

Now there _is_ a way around that in C, and it is to use a completely
diferent representation for strings, which is *not* like the flex array
approach.  It's a tree.  (Yes, we're back at AVL DAGs and their kin.)

	To the outside observer, it seems like Google's lightbulb hasn't gone
	off yet when it comes to functional languages :)

The language they designed for querying their logs has as an *essential*
feature that it does not and cannot change any persistent data structure.
It looks imperative, and it's presumably implemented in an imperative way,
but it could be regarded as syntactic sugar for a functional language.

Google are *definitely* looking for people with XSLT experience.
Weird, ugly, misbegotten freak that it is, XSLT is none-the-less
a functional language.  It's not just a _declarative_ language, like
SQL, it's an honest-to-goodness functional language.

	Yes, unicode does make things different. In Java, all strings are
	unicode (like Erlang),

Actually, no.
Unicode is 17 planes of 65566 positions each,
so a Unicode character may require 21 bits.
There are already about 100,000 different characters.
Alas, Java strings are sequences of *16-bit* codes,
which means that it doesn't fit Unicode all that well.
There are tens of thousands of Unicode characters which do NOT
fit in a Java character, and which take 2 positions in a Java string.
So the O(1) indexing you get in Java strings is *not* O(1) access
to the nth Unicode character, but O(1) access to the nth fragment of
some Unicode character, who knows which.  Accessing the nth Unicode
character in a Java string is an O(n) operation.

	but there isn't the extra 32 bit overhead per
	characters for building the linked list.

The overheads for building a string are so high that they require
another data type (which _should_ have been a kind of output stream,
but in a typical Java design botch isn't, although Java 1.5 has a
bodged-on half-solution) for that.

The space overheads for Java strings are much worse than you think,
because strings (other than the results of String.intern()) do not
share storage with other strings.  Erlang makes trees easy, trees
make sharing easy, and sharing cuts space costs a LOT.

	Is it always 32, by the way, even on 64 bit machines?

It's a pointer-sized word.

	Or do Erlang "strings" double in size on 64 bit architectures?

Yes, they do.

Perhaps I can put it this way:
    The _simplest_ way to represent and process strings in Erlang is
    neither compact nor particularly fast (although as speed goes it's
    not bad).

    However, _problem-appropriate_ data structures can make text
    processing in Erlang both space efficient and fast.  But you have
    to _think_ about the data you have and what you want to do with it.