Strings (was: Re: are Mnesia tables immutable?)

Mon Jun 26 09:25:13 CEST 2006

Excellent, thank you for that write up.

There is a general perception that Erlang is no good at strings.  Part
of the issue is the whole 'lists of integers' and people freak on the
memory requirements.  The other part I think is the Unicode support.
Care to speak to that?

-ryan

On 6/25/06, Richard A. O'Keefe <ok@REDACTED> wrote:
> What I wrote about Java strings turns out not to be true.
> Substrings *do* share their character array with their parent,
> and string.substring(begin, end) *is* an O(1) operation,
> although this doesn't seem to be documented in the official
> Java documentation, so one presumably cannot rely on it.
> That's the good news for Java.
> The bad news is that the space overheads are high,
> there is a space leak,
> and while the performance is better than Erlang, it isn't _that_ much better,
> and the difference has very little to do with using lists.
>
> It turns out that Peter Norvig has measured the space overheads,
> and a Java string of N characters requires 40 + 4*ceil(N/2) bytes on a 32-bit
> machine.  (On a 64-bit machine, I'm guessing it would be 64 + 8*ceil(N/4).)
> An Erlang string requires 8N bytes on a 32-bit machine (16N on a 64-bit one).
>
> So on a 32-bit machine, strings of 0..7 characters held as Erlang lists
> are smaller than they would be in Java, and strings of 8 or more characters
> are larger.
>
> This means that with a little care, a collection of words held in Erlang
> can take _less_ space than a collection of words in Java.
>
> What about the space leak?  Suppose we construct a Java String of one
> million characters, call it x.  Then we do
>     x = x.substring(0, 1);
> Now there is only one useful character, BUT all million characters will
> be retained.  That's a space leak.  A sufficiently smart garbage collector
> could work around that.  In fact my very first publication described just
> such a garbage collector.  But I have seen no evidence in any Sun publication
> that they have one.  So when you are splitting a string into pieces in Java,
> it is up to you to either
>  - ensure that all the pieces are dead when the original string is dead, or
>  - do x = new String(x) for each non-dead piece.
>
> With strings implemented just as lists, the very common case of shared tails
> requires no manual intervention.  If you implement strings as trees, there
> is still no large scale leakage.  It's _only_ when strings are handled as
> array slices that this can happen.  (ML has "substring" as a type that is
> distinct from "string" to help the programmer keep track.)
>
> What about time?
>
> In Erlang, we can get at the first character of "fred" in *one* memory
> reference.  In Java, "fred".charAt(0) requires
>  - NO dynamic dispatch (String is a final class)
>    unless you go through the CharSequence interface
>  - check 0 against the string length (memory reference to pick it up)
>  - add the offset (another memory reference)
>  - fetch the array (another memory reference)
>  - check 0+offset against array size (another memory reference)
>  - fetch the code (the last reference)
> for a total of 5 memory references.
>
> So, given a sufficiently smart Erlang compiler, we'd expect access to
> any of the first few characters of a string to be faster in Erlang than
> in Java.
>
> Alas, current Erlang compilers aren't that smart.
>
> I wrote a little test program in several languages to see how fast
> string access was.  Here's the Java code:
>
>     private static int hash(String s) {
>         int h = 0;
>         for (int i = s.length(); --i >= 0; )
>             h = (h*31 + (int)s.charAt(i)) % 1645499;
>         return h;
>     }
>
> Here's the Erlang equivalent:
>
>     hash(Cs) ->
>         hash(Cs, 0).
>
>     hash([C|Cs], H) ->
>         hash(Cs, (H*31 + C) rem 1645499);
>     hash([], H) ->
>         H.
>
> (compiled with the 'inline' option, hash/1 goes away).
>
> Here's the Haskell equivalent:
>
>     hash :: [Int] -> Int
>
>     hash cs = aux cs 0
>         where aux (c:cs) h = aux cs ((h*31 + c) `mod` 1645499)
>               aux []     h = h
>
> How long does it take _per character_ to compute this hash function?
>
> 14.4   AWK      2.06  usec      (Mawk 1.4)
>  4.8   Erlang   0.68  usec      (Erlang R11B)
>  2.0   Haskell  0.29  usec      (GHC 6.2)
>  1.4   Java     0.195 usec      (Sun Java 1.4.2 SDK)
>  1.0   C        0.143 usec      (GCC 3.3.2, -O4)
>
> all on a 500MHz UltraSPARC II but compiled for 32 bits.
>
> We note that Haskell and Erlang are using the *same* representation for
> sequences here; we would expect them to be going through the *same* steps
> at run time.  (Haskell is lazy, but GHC is smart enough to figure out that
> this particular function would benefit from being compiled as strict.)
> It might perhaps be to do with Haskell being typeful and Erlang not.
>
> The Erlang : Haskell ratio tells us that compiler quality accounts for
> more (a factor of 2.4) than data representation (a factor of 2.0) for this
> particular test.  The Java : C ratio tells us that Sun's Java compiler is
> actually doing a pretty good job with this chunk of code; SPARCs have
> never been very good at division, so a fair bit of the time is probably
> going in the remainder.  (Which I stuck in to make this a reasonable test;
> usually you are doing _something_ rather more interesting with the characters.)
> AWK is usually regarded as being reasonably good at strings.
>
> My initial C times were a lot smaller than that because I used Sun's C
> compiler, which was bright enough to spot that hash(s, n) was constant...
> So was GHC until I modified the code so it wasn't.
>
> Whether you want to call this being "weak" at strings or not,
> I don't see any evidence here that Erlang is weak*er* at strings
> than at any other kind of data structure.
>
>