[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Fri Oct 28 02:52:40 CEST 2011

On 28/10/2011, at 11:40 AM, Jon Watte wrote:
> > 4) Random access is important! As is slicing. string:substr(from,to) should be O(1).
> 
> If you think random access is important, then 2-byte elements are just wrong.
> 
> I would be wrong in the same way that Windows is wrong. It uses 2-byte wchar_t, and seems to be enjoying a pretty good adoption in the market.

Windows adopted 16 bit characters back when Unicode was a 16-bit character set and
people hoped it would stay that way.  The point is not that it is wrong to use
UTF-16, but that IF YOU WANT O(1) INDEXING OF UNICODE CODE POINTS it is wrong to
use UTF-16.  Take Java as an example.

	s.charAt(i)		O(1), goes horribly wrong if there are
				any characters >U+FFFF in the string.
	s.codePointAt(i)	returns the ith character, but is O(i)
				not O(1).

	s.length()		O(1), goes horribly wrong if there are
				any characters >U+FFFF in the string.
	s.codePointCount()	returns the number of Unicode characters
				correctly, but is O(s.length()), not O(1).

If you think random access is important, and you think 16-bit characters are
a good idea, then you have to *ensure* that there are no wider characters in
your input.

Now as it happens I *don't* think random access is important.  Here's why.
Consider Ṏ.  There are three ways to encode that:
	O +  ̃ +  ̈
	Õ +  ̈
	Ṏ
This one "user character" may be one, two, or three code points, and
unless you are religious about normalising all inputs, it isn't YOUR choice
which.  (By the way, all the code points in this example fit in 16 bits.)
So you pick up the codepoint at index 137.
Big deal.  What does _that_ tell you?
Unless you check the surrounding code points as well,
you don't know whether you have *A* character or *PART OF* a character.
There are even some "named character sequences" that can be represented
ONLY as a non-trivial sequence of code points.

Unicode really is designed to be processed as a *STREAM* of codepoints
and there is a fairly elaborate state (including for example the
directionality stack, variation selectors, and language tags) which
needs to be maintained for full interpretation.  Again, your problem may
not need any of that stuff, but it is then YOUR responsibility to CHECK
that none of it appears.  The Japanese in particular have a number of
characters that they would like to regard as distinct but are unified
in Unicode CJK, so you have to keep track of which is which using
variation selectors.  This means that s[i] == s[j] does not mean that
s[i] and s[j] represent the same character.

Unicode is a far worse nightmare than most programmers seem to realise.

> > 9) Do not intern strings, ever, for any reason.
> 
> This is surely the programmer's choice.  My XML library code C interns everything
> all the time, and it pays off very nicely indeed.  My Smalltalk compiler interns
> everything (file names, identifiers, string literals, number literals, although
> number literals are interned _as_ number records, not as strings) and again it pays
> off very nicely.  It seems truly bizarre to want string _indexing_ (something I never
> find useful, given the high strangeness of Unicode) to be O(1) but not to want
> string _equality_ (something I do a lot) to be O(1).
> 
> 
> 
> 
> It would be GREAT if string equality could be O(1). However, the runtime cost for that is too high in my opinion.

I repeat: this is surely the PROGRAMMER'S CHOICE.  If I want certain strings to be interned,
I don't see why "do not intern strings, ever, for any reason" should forbid me doing so.

> You basically could not have 100% uptime, ever, if you allow strings to be interned,

That's different.  "Do not intern strings ever for any reason" and "do not automatically
intern all strings" are different.  I can agree with you that "do not automatically
intern all strings" is a bad idea, although SNOBOL 4 did exactly that and it worked
pretty darned well.  I'm still waiting for a reason why the PROGRAMMER should not be
allowed to intern SOME strings SOME of the time for specific purposes.

> and send things like log file formatting through that system.

> Similarly, interning strings, and using that for equality, would mean that the interning system would have to work cross-process both for short strings and long strings, assuming a shared heap approach similar to binaries is used for long strings, which may end up requiring a lot more locking than would be healthy on most modern MP systems.

You are now talking about interning ALL strings ALL the time for NO specific reason.

You are also ignoring the incremental interning scheme described in one of the EEPs I wrote.

There is a lapse in logic.  "Interning strings, and using that for equality" would *NOT*
"mean that the interning system would have to work cross-process".  Each process could
have its OWN string table.  It is never possible to compare a string in one process with
a string in another process:  in order to do the comparison one of the strings has to be
sent to the other process, and the reception machinery can re-intern the string in the
new owner's table.  And that means that the claimed extra locking does not happen.

This is why interned strings would be BETTER than using Erlang atoms.
The Erlang atom table IS a shared resource and using it DOES require locking.
But per-process interning does not.

Right now we CAN program per-process interning using the process dictionary:

	intern(S) ->
	    case get({interned,S})
              of undefined ->
		 V = list_to_binary(S),
		 put({interned,S}, V),
		 V
	       ; V when is_list(V) ->
		 V
	    end.

Having an interned-string data type would allow interned strings to be garbage collected
easily.

I'm not saying that this is particularly important; but I am saying that
we've yet to see any reason why it's a bad idea.

> Btw: the reason I want indexing is that the vast majority of string operations that actually care about what the data represents comes in parsing text file formats and text protocols -- anything from XML to JSON to HTTP to SMTP to MIME. Those specifications work just fine without considering express ligatures, composed diacriticals, or any other messy people-language details, because they are all defined in terms of easily indexable operations, hence why I'd like that support.

These notations were designed to be parsed left to right one character at a time,
NOT to be processed by random access.  I've written XML parsers in C, Erlang, Haskell,
Smalltalk, and a couple of other languages, and random access was *never* useful.
I've also written JSON parsers and unparsers in a couple of languages, and once again,
random access was *never* useful.  I haven't done MIME, but I once did SMTP, and there
again, random access was not useful.