Strings (was: Re: are Mnesia tables immutable?)

Mon Jun 26 05:26:55 CEST 2006

What I wrote about Java strings turns out not to be true.
Substrings *do* share their character array with their parent,
and string.substring(begin, end) *is* an O(1) operation,
although this doesn't seem to be documented in the official
Java documentation, so one presumably cannot rely on it.
That's the good news for Java.
The bad news is that the space overheads are high,
there is a space leak,
and while the performance is better than Erlang, it isn't _that_ much better,
and the difference has very little to do with using lists.

It turns out that Peter Norvig has measured the space overheads,
and a Java string of N characters requires 40 + 4*ceil(N/2) bytes on a 32-bit
machine.  (On a 64-bit machine, I'm guessing it would be 64 + 8*ceil(N/4).)
An Erlang string requires 8N bytes on a 32-bit machine (16N on a 64-bit one).

So on a 32-bit machine, strings of 0..7 characters held as Erlang lists
are smaller than they would be in Java, and strings of 8 or more characters
are larger.

This means that with a little care, a collection of words held in Erlang
can take _less_ space than a collection of words in Java.

What about the space leak?  Suppose we construct a Java String of one
million characters, call it x.  Then we do
    x = x.substring(0, 1);
Now there is only one useful character, BUT all million characters will
be retained.  That's a space leak.  A sufficiently smart garbage collector
could work around that.  In fact my very first publication described just
such a garbage collector.  But I have seen no evidence in any Sun publication
that they have one.  So when you are splitting a string into pieces in Java,
it is up to you to either
 - ensure that all the pieces are dead when the original string is dead, or
 - do x = new String(x) for each non-dead piece.

With strings implemented just as lists, the very common case of shared tails
requires no manual intervention.  If you implement strings as trees, there
is still no large scale leakage.  It's _only_ when strings are handled as
array slices that this can happen.  (ML has "substring" as a type that is
distinct from "string" to help the programmer keep track.)

What about time?

In Erlang, we can get at the first character of "fred" in *one* memory
reference.  In Java, "fred".charAt(0) requires
 - NO dynamic dispatch (String is a final class)
   unless you go through the CharSequence interface
 - check 0 against the string length (memory reference to pick it up)
 - add the offset (another memory reference)
 - fetch the array (another memory reference)
 - check 0+offset against array size (another memory reference)
 - fetch the code (the last reference)
for a total of 5 memory references.

So, given a sufficiently smart Erlang compiler, we'd expect access to
any of the first few characters of a string to be faster in Erlang than
in Java.

Alas, current Erlang compilers aren't that smart.

I wrote a little test program in several languages to see how fast
string access was.  Here's the Java code:

    private static int hash(String s) {
        int h = 0;
        for (int i = s.length(); --i >= 0; )
            h = (h*31 + (int)s.charAt(i)) % 1645499;
        return h;
    }

Here's the Erlang equivalent:

    hash(Cs) ->
	hash(Cs, 0).

    hash([C|Cs], H) ->
	hash(Cs, (H*31 + C) rem 1645499);
    hash([], H) ->
	H.

(compiled with the 'inline' option, hash/1 goes away).

Here's the Haskell equivalent:

    hash :: [Int] -> Int

    hash cs = aux cs 0
        where aux (c:cs) h = aux cs ((h*31 + c) `mod` 1645499)
	      aux []     h = h

How long does it take _per character_ to compute this hash function?

14.4   AWK      2.06  usec      (Mawk 1.4)
 4.8   Erlang	0.68  usec	(Erlang R11B)
 2.0   Haskell	0.29  usec	(GHC 6.2)
 1.4   Java	0.195 usec	(Sun Java 1.4.2 SDK)
 1.0   C        0.143 usec	(GCC 3.3.2, -O4)

all on a 500MHz UltraSPARC II but compiled for 32 bits.

We note that Haskell and Erlang are using the *same* representation for
sequences here; we would expect them to be going through the *same* steps
at run time.  (Haskell is lazy, but GHC is smart enough to figure out that
this particular function would benefit from being compiled as strict.)
It might perhaps be to do with Haskell being typeful and Erlang not.

The Erlang : Haskell ratio tells us that compiler quality accounts for
more (a factor of 2.4) than data representation (a factor of 2.0) for this
particular test.  The Java : C ratio tells us that Sun's Java compiler is
actually doing a pretty good job with this chunk of code; SPARCs have
never been very good at division, so a fair bit of the time is probably
going in the remainder.  (Which I stuck in to make this a reasonable test;
usually you are doing _something_ rather more interesting with the characters.)
AWK is usually regarded as being reasonably good at strings.

My initial C times were a lot smaller than that because I used Sun's C
compiler, which was bright enough to spot that hash(s, n) was constant...
So was GHC until I modified the code so it wasn't.

Whether you want to call this being "weak" at strings or not,
I don't see any evidence here that Erlang is weak*er* at strings
than at any other kind of data structure.