[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Jon Watte jwatte@REDACTED
Fri Oct 28 00:40:13 CEST 2011

> On 22/10/2011, at 10:12 AM, Jon Watte wrote:
> > 2) Strings probably should be of "code points" not "bytes," so they would
> use 2-byte or 4-byte characters. Note that Windows uses sizeof(wchar_t) ==
> 2, and Linux uses sizeof(wchar_t) == 4, so there's no unambiguously "right"
> choice.
> This claim is inconsistent with claim (4).

Sorry, when I say "2-byte or 4-byte characters," I mean *one* of those, for
all strings.

> > 4) Random access is important! As is slicing. string:substr(from,to)
> should be O(1).
> If you think random access is important, then 2-byte elements are just
> wrong.

I would be wrong in the same way that Windows is wrong. It uses 2-byte
wchar_t, and seems to be enjoying a pretty good adoption in the market.
Sure, if you want the choice to be 4 bytes, that's a totally valid opinion.
My argument is "pick one code point size, based on the engineering
environment we're solving for, and stick with it" and then make indexed code
point random access O(1). Note that I'd count an expressed ligature as a
single code point at this point. I'm not saying "do magic based on
arbitrarily messy semantic meaning" -- that needs to have a different level
of library support.

> 6) Ideally, reference count immutable string data so that substring
extraction is cheap.

> Java has cheap substring extraction without reference counting.
> However, be aware that doing this can lead to *huge* space leaks with very
> large source strings being retained for the sake of comparatively small
> substrings; there is a *reason* why Java's immutable String class comes
> with a
> copy() method!

I already made a suggestion for how to avoid the worst parts of this large
memory consumption behavior, amortized over time, in another one of these

> > 9) Do not intern strings, ever, for any reason.
> This is surely the programmer's choice.  My XML library code C interns
> everything
> all the time, and it pays off very nicely indeed.  My Smalltalk compiler
> interns
> everything (file names, identifiers, string literals, number literals,
> although
> number literals are interned _as_ number records, not as strings) and again
> it pays
> off very nicely.  It seems truly bizarre to want string _indexing_
> (something I never
> find useful, given the high strangeness of Unicode) to be O(1) but not to
> want
> string _equality_ (something I do a lot) to be O(1).

It would be GREAT if string equality could be O(1). However, the runtime
cost for that is too high in my opinion. You basically could not have 100%
uptime, ever, if you allow strings to be interned, and send things like log
file formatting through that system.

Similarly, interning strings, and using that for equality, would mean that
the interning system would have to work cross-process both for short strings
and long strings, assuming a shared heap approach similar to binaries is
used for long strings, which may end up requiring a lot more locking than
would be healthy on most modern MP systems.

Erlang already has an interned atom system. If you want the benefits of
interning, I suggest your re-use that system. I have nothing against
"erlang:string_to_atom()" existing, but a string literal in the source, or
the output of string formatting, should never risk being interned as a
string. Or, to flip it around: If I use something that is a string, I should
not have to worry about interning eating my RAM forever. That's what atoms
are for.

Btw: the reason I want indexing is that the vast majority of string
operations that actually care about what the data represents comes in
parsing text file formats and text protocols -- anything from XML to JSON to
HTTP to SMTP to MIME. Those specifications work just fine without
considering express ligatures, composed diacriticals, or any other messy
people-language details, because they are all defined in terms of easily
indexable operations, hence why I'd like that support.

Also, as I said initially: Binaries are almost good enough for most use
cases that I care about. It may be viable to extend the binary syntax and
capabilities, rather than introduce a new data type, and get close to the
same end goal.

If I did natural language processing, I may have a different set of goals
:-) Or, more likely, I'd just have a lot more requirements on the
higher-level library support, plus requirements related to making
implementing those libraries efficient.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111027/ad3d4ae1/attachment.htm>

More information about the erlang-questions mailing list