[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Sun Oct 30 02:20:24 CET 2011

> people hoped it would stay that way.  The point is not that it is wrong to
> use
> UTF-16, but that IF YOU WANT O(1) INDEXING OF UNICODE CODE POINTS it is
> wrong to
> use UTF-16.  Take Java as an example.
>
>
I would be perfectly fine with a proposal that said "we use 4-byte
characters, just like Linux wchar_t."
I would also be OK with a proposal that said "we use 2-byte characters,
just like Windows, and only support the 65535 character subset."
Significantly better performance, slightly worse coverage of 10646.

> This one "user character" may be one, two, or three code points, and
> unless you are religious about normalising all inputs, it isn't YOUR choice
> which.  (By the way, all the code points in this example fit in 16 bits.)
>

Life is too short to not normalize data on input, in my opinion. However,
the specific examples I care about are all about parsing existing
protocols, where all the "interesting" bits are defined as ASCII subset,
and anything outside that can be lumped into a single "string of other
data" class without loss of generality. This includes things like HTTP or
MIME. Your applications may vary.

> > > 9) Do not intern strings, ever, for any reason.
> >
> > This is surely the programmer's choice.  My XML library code C interns
> everything
> > all the time, and it pays off very nicely indeed.  My Smalltalk compiler
> interns
>

As long as you do not allow users to feed data into your library, perhaps,
and/or create a new _OS_ process for each document. For systems with uptime
requirements, interned strings are one of the worst offenders for "easy to
miss" bugs.

But Erlang already has literals: they're called atoms. Let's not re-invent
them. string_to_atom() would be a fine function for those who want to do
that. string_to_interned_string() would not. (Here, I think systems like C#
get it wrong)

> > off very nicely.  It seems truly bizarre to want string _indexing_
> (something I never
> > find useful, given the high strangeness of Unicode) to be O(1) but not
> to want
> > string _equality_ (something I do a lot) to be O(1).
>

It seems like you never do network protocol parsing, or systems with very
long uptimes that process arbitrary user-supplied data.

I repeat: this is surely the PROGRAMMER'S CHOICE.  If I want certain
> strings to be interned,
> I don't see why "do not intern strings, ever, for any reason" should
> forbid me doing so.
>
>

Turn the string to an atom. Done! Then you know it's interned, and it is
type-distinct from "string." That's all I want, and I want this because
strings that can be interned or not have turned out to be a liability in
practice, and strings that are always interned are only useful in
short-running systems.

> > Similarly, interning strings, and using that for equality, would mean
> that the interning system would have to work cross-process both for short
> strings and long strings, assuming a shared heap approach similar to
> binaries is used for long strings, which may end up requiring a lot more
> locking than would be healthy on most modern MP systems.
>
> You are now talking about interning ALL strings ALL the time for NO
> specific reason.
>
>
Nope. Interning even a single string, and making the rule that all strings
that take the same character sequence must have the same pointer value
(pretty common for interned string implementations -- think about it!) then
all string operations need to do global heap locking of one form or
another. You can shard your heaps/locks, you can do all kinds of tricks,
but in the end, what I said is true as long as you support interning a
single string, and let the type still remain "string." Interning a string,
returning type "atom," is much better, for this very reason (and others,
IMO :-)

"mean that the interning system would have to work cross-process".  Each
> process could
> have its OWN string table.  It is never possible to compare a string in
> one process with
>

I am coming at this from "I use binaries as strings now, and want something
even better" point of view. Binaries are shared across processes, because
sending large binaries (or sub-binaries) across processes is common --
again, for network/protocol systems -- and is optimized through this
implementation. Interning, however, adds a different level of locking and
complication.

Anyway, that's about as far as I go with my defense of my particular
opinions. They clearly come from a different background than your opinions,
and if Erlang sprouted a string system that had 8 of my 10 requests, well,
that would be super-duper-sweet!

Sincerely,

jw
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111029/200ba76d/attachment.htm>