Strings (was: Re: are Mnesia tables immutable?)
Ryan Rawson
ryanobjc@REDACTED
Mon Jun 26 09:25:13 CEST 2006
Excellent, thank you for that write up.
There is a general perception that Erlang is no good at strings. Part
of the issue is the whole 'lists of integers' and people freak on the
memory requirements. The other part I think is the Unicode support.
Care to speak to that?
-ryan
On 6/25/06, Richard A. O'Keefe <ok@REDACTED> wrote:
> What I wrote about Java strings turns out not to be true.
> Substrings *do* share their character array with their parent,
> and string.substring(begin, end) *is* an O(1) operation,
> although this doesn't seem to be documented in the official
> Java documentation, so one presumably cannot rely on it.
> That's the good news for Java.
> The bad news is that the space overheads are high,
> there is a space leak,
> and while the performance is better than Erlang, it isn't _that_ much better,
> and the difference has very little to do with using lists.
>
> It turns out that Peter Norvig has measured the space overheads,
> and a Java string of N characters requires 40 + 4*ceil(N/2) bytes on a 32-bit
> machine. (On a 64-bit machine, I'm guessing it would be 64 + 8*ceil(N/4).)
> An Erlang string requires 8N bytes on a 32-bit machine (16N on a 64-bit one).
>
> So on a 32-bit machine, strings of 0..7 characters held as Erlang lists
> are smaller than they would be in Java, and strings of 8 or more characters
> are larger.
>
> This means that with a little care, a collection of words held in Erlang
> can take _less_ space than a collection of words in Java.
>
> What about the space leak? Suppose we construct a Java String of one
> million characters, call it x. Then we do
> x = x.substring(0, 1);
> Now there is only one useful character, BUT all million characters will
> be retained. That's a space leak. A sufficiently smart garbage collector
> could work around that. In fact my very first publication described just
> such a garbage collector. But I have seen no evidence in any Sun publication
> that they have one. So when you are splitting a string into pieces in Java,
> it is up to you to either
> - ensure that all the pieces are dead when the original string is dead, or
> - do x = new String(x) for each non-dead piece.
>
> With strings implemented just as lists, the very common case of shared tails
> requires no manual intervention. If you implement strings as trees, there
> is still no large scale leakage. It's _only_ when strings are handled as
> array slices that this can happen. (ML has "substring" as a type that is
> distinct from "string" to help the programmer keep track.)
>
> What about time?
>
> In Erlang, we can get at the first character of "fred" in *one* memory
> reference. In Java, "fred".charAt(0) requires
> - NO dynamic dispatch (String is a final class)
> unless you go through the CharSequence interface
> - check 0 against the string length (memory reference to pick it up)
> - add the offset (another memory reference)
> - fetch the array (another memory reference)
> - check 0+offset against array size (another memory reference)
> - fetch the code (the last reference)
> for a total of 5 memory references.
>
> So, given a sufficiently smart Erlang compiler, we'd expect access to
> any of the first few characters of a string to be faster in Erlang than
> in Java.
>
> Alas, current Erlang compilers aren't that smart.
>
> I wrote a little test program in several languages to see how fast
> string access was. Here's the Java code:
>
> private static int hash(String s) {
> int h = 0;
> for (int i = s.length(); --i >= 0; )
> h = (h*31 + (int)s.charAt(i)) % 1645499;
> return h;
> }
>
> Here's the Erlang equivalent:
>
> hash(Cs) ->
> hash(Cs, 0).
>
> hash([C|Cs], H) ->
> hash(Cs, (H*31 + C) rem 1645499);
> hash([], H) ->
> H.
>
> (compiled with the 'inline' option, hash/1 goes away).
>
> Here's the Haskell equivalent:
>
> hash :: [Int] -> Int
>
> hash cs = aux cs 0
> where aux (c:cs) h = aux cs ((h*31 + c) `mod` 1645499)
> aux [] h = h
>
> How long does it take _per character_ to compute this hash function?
>
> 14.4 AWK 2.06 usec (Mawk 1.4)
> 4.8 Erlang 0.68 usec (Erlang R11B)
> 2.0 Haskell 0.29 usec (GHC 6.2)
> 1.4 Java 0.195 usec (Sun Java 1.4.2 SDK)
> 1.0 C 0.143 usec (GCC 3.3.2, -O4)
>
> all on a 500MHz UltraSPARC II but compiled for 32 bits.
>
> We note that Haskell and Erlang are using the *same* representation for
> sequences here; we would expect them to be going through the *same* steps
> at run time. (Haskell is lazy, but GHC is smart enough to figure out that
> this particular function would benefit from being compiled as strict.)
> It might perhaps be to do with Haskell being typeful and Erlang not.
>
> The Erlang : Haskell ratio tells us that compiler quality accounts for
> more (a factor of 2.4) than data representation (a factor of 2.0) for this
> particular test. The Java : C ratio tells us that Sun's Java compiler is
> actually doing a pretty good job with this chunk of code; SPARCs have
> never been very good at division, so a fair bit of the time is probably
> going in the remainder. (Which I stuck in to make this a reasonable test;
> usually you are doing _something_ rather more interesting with the characters.)
> AWK is usually regarded as being reasonably good at strings.
>
> My initial C times were a lot smaller than that because I used Sun's C
> compiler, which was bright enough to spot that hash(s, n) was constant...
> So was GHC until I modified the code so it wasn't.
>
> Whether you want to call this being "weak" at strings or not,
> I don't see any evidence here that Erlang is weak*er* at strings
> than at any other kind of data structure.
>
>
More information about the erlang-questions
mailing list