String representation in erlang

Richard A. O'Keefe ok@REDACTED
Wed Sep 14 04:27:32 CEST 2005


Thinus Pollard <thinus@REDACTED> wrote:
	According to the Erlang efficiency guide a string is internally
	represented as a list of integers, thus consuming 2 words
	(8 bytes on a 32bit platform) of memory *per* character.
	
Unicode/ISO10646 characters require in general 21 bits.
NOT 16 bits.  Java claims to support Unicode, but there are already
quite a lot of Unicode characters which don't fit in 16 bits, so
many Unicode characters require two Java "chars".  Needless to say,
this stuffs up Java string indexing something wonderful.

There's an easy way to save half of the space of a string that
still leaves them able (in principle) to cope with the full range
of Unicode characters:  represent a string as a *tuple* of integer
codes.  Conversion functions:
    list_to_tuple(String) -> Tuple
    tuple_to_list(Tuple) -> String
    
(Assumption: list elements cost 2 cells each, tuples with N elements
cost 2N+c for c something like 1 or 2, so for large N, the space is
about 1/2 that of a list.)

The other alternative, of course, is to use binaries.  Joe's proposal
for <<"string">> notation makes them particularly attractive.




More information about the erlang-questions mailing list