[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Sun Oct 30 11:19:55 CET 2011

On 2011-10-30, at 02:20 , Jon Watte wrote:
> 
> Life is too short to not normalize data on input, in my opinion.
"Normalization" does not mean "NFC". And NFC is not the best
normalization form for all situations. Its only advantage really
is in codepoint count.

> However,
> the specific examples I care about are all about parsing existing
> protocols, where all the "interesting" bits are defined as ASCII subset,
> and anything outside that can be lumped into a single "string of other
> data" class without loss of generality. This includes things like HTTP or
> MIME. Your applications may vary.
> 
I think that is *by far* the biggest issue with many string datatypes:
they double up as both "lightweight" structures in many ascii-based
protocols, with those structures being either fixed-size allowing O(1)
access (e.g. many logfile formats) or simple character-separated
structures; and as actual encoding of *human text*, which is what Unicode
was built for.

These usages are completely at odds with one another: a byte/ascii-based
structure is an array of bytes, some of which are visibly representable,
but a unicode string is a *stream*. It is an array of codepoints, but
codepoints are useless to manipulate text, and UTFs also map it to arrays
of bytes but these byte arrays are also useless to manipulate text.

To correctly manipulate text, in terms of code interface, the primary
interface should be the grapheme cluster (what most people think of as a
"character", although you *still* encounter the issue that a given codepoint
sequence can be seen as one or several "characters" depending on the
culture, I think *that* issue is much rarer than trying to manipulate
grapheme clusters with codepoint-based interfaces). Furthermore, this
manipulation should be done completely independently of the underlying
physical representation (a grapheme cluster is expressed in terms of
code points, not in terms of whatever code units the UTF uses).

NSString is one of the very few string datatypes I've seen which makes
treating text correctly easy (although it also includes the bytes/codepoint
array stuff), because — while its primary interface is not grapheme clusters —
it provides a very extensive interface to manipulate text in terms of
grapheme clusters. I think its only issue is that it still allows for
the manipulation of strings themselves in terms of codepoints and code
units at all.

Apple's even has a document on that very subject:
http://developer.apple.com/library/ios/#documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html#//apple_ref/doc/uid/TP40008025
> It's common to think of a string as a sequence of characters,
> but when working with NSString objects, or with Unicode strings in
> general, in most cases it is better to deal with substrings rather
> than with individual characters. The reason for this is that what
> the user perceives as a character in text may in many cases be
> represented by multiple characters in the string.

>>> off very nicely.  It seems truly bizarre to want string _indexing_
>> (something I never
>>> find useful, given the high strangeness of Unicode) to be O(1) but not
>> to want
>>> string _equality_ (something I do a lot) to be O(1).
>> 
> 
> It seems like you never do network protocol parsing
Network protocol parsing is the manipulation of a bytes stream or a bytes
array, by trying to fit this use case into a string datatype you're only
ensuring this datatype will be garbage to use for text manipulation.

You already have an erlang datatype to do network protocol parsing: binaries.

> , or systems with very
> long uptimes that process arbitrary user-supplied data.
What relevance does this have to O(1) string indexing?

> I am coming at this from "I use binaries as strings now, and want something
> even better" point of view.
Then you should ask for improvements of binaries, not for making a useless
string type.

I have the same view on this as Richard: for text (which is what Unicode was
built for), O(1) indexing of code units and code points is useless.

I also think the implementation details and performance characteristics of
a string datatype should not be considered at all before it's being
implemented, the first question should be what its interface looks like.