Strings (was: Re: are Mnesia tables immutable?)

Richard A. O'Keefe ok@REDACTED
Thu Jun 29 08:41:22 CEST 2006


	Richard A. O'Keefe wrote:
	
	> So we have two possible approaches here:
	
Andrew Lentvorski <bsder@REDACTED> replied:
	We have more than that, but how about choice 0:
	
	0) We leave strings alone and simply declare them by fiat to be
	lists of integers and encoded as UTF-8.
	
But that is an incompatible change.

	This has the advantage that strings survive very nicely inside BEAM 
	files without making any code changes to the Erlang system.

Suppose you have string data held in dets/mnesia in R11.
Then along comes R12 and says that strings use UTF-8.
BOOM!  A whole lot of data breaks badly.

_NOT_ a good idea.

Also, a whole lot of code that assumes one-list-element-equals-one-codepoint
breaks badly too.

_NOT_ a good idea.

Keeping one-list-element-equals-one-codepoint means that existing
string data encoded as Latin 1 RETAINS ITS VALUE when interpreted
as Unicode, and existing code that assumes one-list-element-equals-one-
codepoint KEEPS ON WORKING.

	It also means that the current term-to-binary stuff works just
	fine if a bit verbose.

As has already been noted, the current term-to-binary stuff doesn't
work as well as it could right now.  

	[UTF-8] is also very identifiable as it looks like ASCII or it
	looks like nothing else.

Not true.  I can find you a string as long as you want that could be
interpreted as UTF-8 or as Latin 1.

True, there are all sorts of good things about UTF-8.  It's really cool
that modern systems come with UTF-8 locales set by default so I can type
practically _anything_ in TextEdit.  BUT it's a *Transmission* format,
that's what the "T" and "F" stand for.  It was never designed to be
used for serious *processing*

UTF-8 is a great representation for C, but for a language where characters
never were stored as bytes in the first place it is pretty pointless.

Apparently people are already saying bad things about Erlang string
handling; what do you think they'll say when they hear that a single
character might require 8 words (32 bytes)?




More information about the erlang-questions mailing list