[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Mon Oct 31 04:28:51 CET 2011

> , or systems with very
> long uptimes that process arbitrary user-supplied data.

What relevance does this have to O(1) string indexing?

Clearly, that comment was based on the string interning discussion, not
O(1).

Sincerely,

jw

--
Americans might object: there is no way we would sacrifice our living
standards for the benefit of people in the rest of the world. Nevertheless,
whether we get there willingly or not, we shall soon have lower consumption
rates, because our present rates are unsustainable.

On Sun, Oct 30, 2011 at 3:19 AM, Masklinn <masklinn@REDACTED> wrote:

> On 2011-10-30, at 02:20 , Jon Watte wrote:
> >
> > Life is too short to not normalize data on input, in my opinion.
> "Normalization" does not mean "NFC". And NFC is not the best
> normalization form for all situations. Its only advantage really
> is in codepoint count.
>
> > However,
> > the specific examples I care about are all about parsing existing
> > protocols, where all the "interesting" bits are defined as ASCII subset,
> > and anything outside that can be lumped into a single "string of other
> > data" class without loss of generality. This includes things like HTTP or
> > MIME. Your applications may vary.
> >
> I think that is *by far* the biggest issue with many string datatypes:
> they double up as both "lightweight" structures in many ascii-based
> protocols, with those structures being either fixed-size allowing O(1)
> access (e.g. many logfile formats) or simple character-separated
> structures; and as actual encoding of *human text*, which is what Unicode
> was built for.
>
> These usages are completely at odds with one another: a byte/ascii-based
> structure is an array of bytes, some of which are visibly representable,
> but a unicode string is a *stream*. It is an array of codepoints, but
> codepoints are useless to manipulate text, and UTFs also map it to arrays
> of bytes but these byte arrays are also useless to manipulate text.
>
> To correctly manipulate text, in terms of code interface, the primary
> interface should be the grapheme cluster (what most people think of as a
> "character", although you *still* encounter the issue that a given
> codepoint
> sequence can be seen as one or several "characters" depending on the
> culture, I think *that* issue is much rarer than trying to manipulate
> grapheme clusters with codepoint-based interfaces). Furthermore, this
> manipulation should be done completely independently of the underlying
> physical representation (a grapheme cluster is expressed in terms of
> code points, not in terms of whatever code units the UTF uses).
>
> NSString is one of the very few string datatypes I've seen which makes
> treating text correctly easy (although it also includes the bytes/codepoint
> array stuff), because — while its primary interface is not grapheme
> clusters —
> it provides a very extensive interface to manipulate text in terms of
> grapheme clusters. I think its only issue is that it still allows for
> the manipulation of strings themselves in terms of codepoints and code
> units at all.
>
> Apple's even has a document on that very subject:
>
> http://developer.apple.com/library/ios/#documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html#//apple_ref/doc/uid/TP40008025
> > It's common to think of a string as a sequence of characters,
> > but when working with NSString objects, or with Unicode strings in
> > general, in most cases it is better to deal with substrings rather
> > than with individual characters. The reason for this is that what
> > the user perceives as a character in text may in many cases be
> > represented by multiple characters in the string.
>
> >>> off very nicely.  It seems truly bizarre to want string _indexing_
> >> (something I never
> >>> find useful, given the high strangeness of Unicode) to be O(1) but not
> >> to want
> >>> string _equality_ (something I do a lot) to be O(1).
> >>
> >
> > It seems like you never do network protocol parsing
> Network protocol parsing is the manipulation of a bytes stream or a bytes
> array, by trying to fit this use case into a string datatype you're only
> ensuring this datatype will be garbage to use for text manipulation.
>
> You already have an erlang datatype to do network protocol parsing:
> binaries.
>
> > , or systems with very
> > long uptimes that process arbitrary user-supplied data.
> What relevance does this have to O(1) string indexing?
>
> > I am coming at this from "I use binaries as strings now, and want
> something
> > even better" point of view.
> Then you should ask for improvements of binaries, not for making a useless
> string type.
>
> I have the same view on this as Richard: for text (which is what Unicode
> was
> built for), O(1) indexing of code units and code points is useless.
>
> I also think the implementation details and performance characteristics of
> a string datatype should not be considered at all before it's being
> implemented, the first question should be what its interface looks like.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111030/5bbb8d24/attachment.htm>