[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Tue Oct 25 06:57:35 CEST 2011

On 22/10/2011, at 10:12 AM, Jon Watte wrote:
> 2) Strings probably should be of "code points" not "bytes," so they would use 2-byte or 4-byte characters. Note that Windows uses sizeof(wchar_t) == 2, and Linux uses sizeof(wchar_t) == 4, so there's no unambiguously "right" choice.

This claim is inconsistent with claim (4).
> 
> 3) Operations include (erlang-style) matching, (grep-style) finding, converting, splitting, joining and trimming.
> 
> 4) Random access is important! As is slicing. string:substr(from,to) should be O(1).

If you think random access is important, then 2-byte elements are just wrong.
There are currently more than 109,000 graphic characters in Unicode and it keeps
growing (Unicode 6.0 having added over 2000 characters).  With 2-byte elements
some characters are 1 element and some are 2 and you cannot tell how many Unicode
codepoints there are in a sequence of 16-bit elements without checking them all.

Note that *slicing* (to positions already determined) could be fast without
*indexing* being fast.  Indeed, this is the case in Java, where getting an
answer to the question "where does character k begin in this String" takes
O(k) time, whereas slicing from already determined positions is O(1).
> 
> 6) Ideally, reference count immutable string data so that substring extraction is cheap.

Java has cheap substring extraction without reference counting.
However, be aware that doing this can lead to *huge* space leaks with very
large source strings being retained for the sake of comparatively small
substrings; there is a *reason* why Java's immutable String class comes with a
copy() method!

> 9) Do not intern strings, ever, for any reason.

This is surely the programmer's choice.  My XML library code C interns everything
all the time, and it pays off very nicely indeed.  My Smalltalk compiler interns
everything (file names, identifiers, string literals, number literals, although
number literals are interned _as_ number records, not as strings) and again it pays
off very nicely.  It seems truly bizarre to want string _indexing_ (something I never
find useful, given the high strangeness of Unicode) to be O(1) but not to want
string _equality_ (something I do a lot) to be O(1).