Strings (was: Re: are Mnesia tables immutable?)

Mon Jun 26 12:02:35 CEST 2006

I do appreciate these conversations on erlang "strings".  As a  
practitioner, I have learned lots from Richard and others  
clarification of how erlang handles strings and how it compares to  
other languages.
Now what I'd like to see is discussion on "best practices" for using  
strings (not assuming they are lists of integers but binaries or both  
lists and binaries).
Here is the scenario around which I'd like to shape a best practices  
thread of conversation:

(1) Given: a web based erlang+yaws+mnesia application.  This implies:
	all semantic data and OLTP go into mnesia, yaws files are used for  
control and presentation of all data.
(2) Given: all strings are in utf-8.  why, because its a good "one  
size fits almost all" format.  This mean all strings in mnesia, yaws  
files, templates, "foreign language lookup strings" etc.. are in utf-8.

Here is a thread of behavior for the app:
Thread A: "User requests yaws page from browser".
This triggers creation of controller and model objects (processes or  
lightweight "record models").
These models and controllers are used in binding attributes to or  
injecting into the template of the yaws files to output the HTML stream.

assume the file requested is personForm.yaws which contains HTML  
inputs firstName and lastName, and select country which contains a  
list of countries and the selected item in the list.

The list of countries come from the countryManager process which is  
accessed via an attribute of the personFormController process.  The  
countryManager process has a life cycle matching the VM and does not  
die when the personFormController or its related model objects die.   
countryManager returns a list of country names (we'll skip bits about  
mapping this to an id and assume we store just country name in mnesia).
The person object (just a record owned by the personFormController  
contains attributes firstName, lastName and country (not the whole  
list of countries, just the selected country in the list).

several problem come to mind:
1 - handling the HTML "decoration" strings.  This would be things  
like field labels, page title etc which would be looked up based on  
id from some.  I think this has been address with macros like ?TXT,  
but want to reflect on this.
2 - injecting attributes into yaws (either via yaws bind or other  
higher level template methods).  How to inject these values without  
going from a binary format to a list back to a binary for streaming  
out to the browser.
3 - retrieving the list of countries from countryManager to be put in  
the select list.  Each time I want to write out personForm.yaws, I  
end up passing this list of countries between processes (from  
countryManager to personFormController to yaws process handling the  
page output).  I _do not_ want this list to be copied each time.  Nor  
do I want this list to go from a utf8 binary to a list back to a  
binary for streaming out to the browser.

As you can see, although strings in erlang may not be grossly under- 
provided for, needing to store them in a compact form and passing  
lists of strings between processes (utf8 binary or lists of integers)  
compounds the memory issues....when all you want to do is output a  
page of concatenated utf8 binaries pulled from different sources.

So, I'm looking for some tips on establishing best practices for  
minimizing the copying and the short and long term memory structure.   
A goal would be to have a set of practices for a yaws+mnesia app  
combined with macros and enhanced libraries which makes it easy for a  
beginner erlang programmer to build an app which by default is fairly  
efficient.

Any ideas??

thanks, ke han

On Jun 26, 2006, at 3:25 PM, Ryan Rawson wrote:

> Excellent, thank you for that write up.
>
> There is a general perception that Erlang is no good at strings.  Part
> of the issue is the whole 'lists of integers' and people freak on the
> memory requirements.  The other part I think is the Unicode support.
> Care to speak to that?
>
> -ryan
>
> On 6/25/06, Richard A. O'Keefe <ok@REDACTED> wrote:
>> What I wrote about Java strings turns out not to be true.
>> Substrings *do* share their character array with their parent,
>> and string.substring(begin, end) *is* an O(1) operation,
>> although this doesn't seem to be documented in the official
>> Java documentation, so one presumably cannot rely on it.
>> That's the good news for Java.
>> The bad news is that the space overheads are high,
>> there is a space leak,
>> and while the performance is better than Erlang, it isn't _that_  
>> much better,
>> and the difference has very little to do with using lists.
>>
>> It turns out that Peter Norvig has measured the space overheads,
>> and a Java string of N characters requires 40 + 4*ceil(N/2) bytes  
>> on a 32-bit
>> machine.  (On a 64-bit machine, I'm guessing it would be 64 +  
>> 8*ceil(N/4).)
>> An Erlang string requires 8N bytes on a 32-bit machine (16N on a  
>> 64-bit one).
>>
>> So on a 32-bit machine, strings of 0..7 characters held as Erlang  
>> lists
>> are smaller than they would be in Java, and strings of 8 or more  
>> characters
>> are larger.
>>
>> This means that with a little care, a collection of words held in  
>> Erlang
>> can take _less_ space than a collection of words in Java.
>>
>> What about the space leak?  Suppose we construct a Java String of one
>> million characters, call it x.  Then we do
>>     x = x.substring(0, 1);
>> Now there is only one useful character, BUT all million characters  
>> will
>> be retained.  That's a space leak.  A sufficiently smart garbage  
>> collector
>> could work around that.  In fact my very first publication  
>> described just
>> such a garbage collector.  But I have seen no evidence in any Sun  
>> publication
>> that they have one.  So when you are splitting a string into  
>> pieces in Java,
>> it is up to you to either
>>  - ensure that all the pieces are dead when the original string is  
>> dead, or
>>  - do x = new String(x) for each non-dead piece.
>>
>> With strings implemented just as lists, the very common case of  
>> shared tails
>> requires no manual intervention.  If you implement strings as  
>> trees, there
>> is still no large scale leakage.  It's _only_ when strings are  
>> handled as
>> array slices that this can happen.  (ML has "substring" as a type  
>> that is
>> distinct from "string" to help the programmer keep track.)
>>
>> What about time?
>>
>> In Erlang, we can get at the first character of "fred" in *one*  
>> memory
>> reference.  In Java, "fred".charAt(0) requires
>>  - NO dynamic dispatch (String is a final class)
>>    unless you go through the CharSequence interface
>>  - check 0 against the string length (memory reference to pick it up)
>>  - add the offset (another memory reference)
>>  - fetch the array (another memory reference)
>>  - check 0+offset against array size (another memory reference)
>>  - fetch the code (the last reference)
>> for a total of 5 memory references.
>>
>> So, given a sufficiently smart Erlang compiler, we'd expect access to
>> any of the first few characters of a string to be faster in Erlang  
>> than
>> in Java.
>>
>> Alas, current Erlang compilers aren't that smart.
>>
>> I wrote a little test program in several languages to see how fast
>> string access was.  Here's the Java code:
>>
>>     private static int hash(String s) {
>>         int h = 0;
>>         for (int i = s.length(); --i >= 0; )
>>             h = (h*31 + (int)s.charAt(i)) % 1645499;
>>         return h;
>>     }
>>
>> Here's the Erlang equivalent:
>>
>>     hash(Cs) ->
>>         hash(Cs, 0).
>>
>>     hash([C|Cs], H) ->
>>         hash(Cs, (H*31 + C) rem 1645499);
>>     hash([], H) ->
>>         H.
>>
>> (compiled with the 'inline' option, hash/1 goes away).
>>
>> Here's the Haskell equivalent:
>>
>>     hash :: [Int] -> Int
>>
>>     hash cs = aux cs 0
>>         where aux (c:cs) h = aux cs ((h*31 + c) `mod` 1645499)
>>               aux []     h = h
>>
>> How long does it take _per character_ to compute this hash function?
>>
>> 14.4   AWK      2.06  usec      (Mawk 1.4)
>>  4.8   Erlang   0.68  usec      (Erlang R11B)
>>  2.0   Haskell  0.29  usec      (GHC 6.2)
>>  1.4   Java     0.195 usec      (Sun Java 1.4.2 SDK)
>>  1.0   C        0.143 usec      (GCC 3.3.2, -O4)
>>
>> all on a 500MHz UltraSPARC II but compiled for 32 bits.
>>
>> We note that Haskell and Erlang are using the *same*  
>> representation for
>> sequences here; we would expect them to be going through the  
>> *same* steps
>> at run time.  (Haskell is lazy, but GHC is smart enough to figure  
>> out that
>> this particular function would benefit from being compiled as  
>> strict.)
>> It might perhaps be to do with Haskell being typeful and Erlang not.
>>
>> The Erlang : Haskell ratio tells us that compiler quality accounts  
>> for
>> more (a factor of 2.4) than data representation (a factor of 2.0)  
>> for this
>> particular test.  The Java : C ratio tells us that Sun's Java  
>> compiler is
>> actually doing a pretty good job with this chunk of code; SPARCs have
>> never been very good at division, so a fair bit of the time is  
>> probably
>> going in the remainder.  (Which I stuck in to make this a  
>> reasonable test;
>> usually you are doing _something_ rather more interesting with the  
>> characters.)
>> AWK is usually regarded as being reasonably good at strings.
>>
>> My initial C times were a lot smaller than that because I used  
>> Sun's C
>> compiler, which was bright enough to spot that hash(s, n) was  
>> constant...
>> So was GHC until I modified the code so it wasn't.
>>
>> Whether you want to call this being "weak" at strings or not,
>> I don't see any evidence here that Erlang is weak*er* at strings
>> than at any other kind of data structure.
>>
>>