<div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

people hoped it would stay that way.  The point is not that it is wrong to use<br>

UTF-16, but that IF YOU WANT O(1) INDEXING OF UNICODE CODE POINTS it is wrong to<br>

use UTF-16.  Take Java as an example.<br>

<br></blockquote><div><br></div><div>I would be perfectly fine with a proposal that said "we use 4-byte characters, just like Linux wchar_t."</div><div>I would also be OK with a proposal that said "we use 2-byte characters, just like Windows, and only support the 65535 character subset."</div>

<div>Significantly better performance, slightly worse coverage of 10646.</div><div> </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

This one "user character" may be one, two, or three code points, and<br>

unless you are religious about normalising all inputs, it isn't YOUR choice<br>

which.  (By the way, all the code points in this example fit in 16 bits.)<br></blockquote><div><br></div><div>Life is too short to not normalize data on input, in my opinion. However, the specific examples I care about are all about parsing existing protocols, where all the "interesting" bits are defined as ASCII subset, and anything outside that can be lumped into a single "string of other data" class without loss of generality. This includes things like HTTP or MIME. Your applications may vary.</div>

<div> </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">

> > 9) Do not intern strings, ever, for any reason.<br>

><br>

> This is surely the programmer's choice.  My XML library code C interns everything<br>

> all the time, and it pays off very nicely indeed.  My Smalltalk compiler interns<br>

</div></blockquote><div><br></div><div><br></div><div>As long as you do not allow users to feed data into your library, perhaps, and/or create a new _OS_ process for each document. For systems with uptime requirements, interned strings are one of the worst offenders for "easy to miss" bugs.</div>

<div><br></div><div>But Erlang already has literals: they're called atoms. Let's not re-invent them. string_to_atom() would be a fine function for those who want to do that. string_to_interned_string() would not. (Here, I think systems like C# get it wrong)</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">> off very nicely.  It seems truly bizarre to want string _indexing_ (something I never<br>


> find useful, given the high strangeness of Unicode) to be O(1) but not to want<br>

> string _equality_ (something I do a lot) to be O(1).<br>

</div></blockquote><div><br></div><div><br></div><div>It seems like you never do network protocol parsing, or systems with very long uptimes that process arbitrary user-supplied data.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

I repeat: this is surely the PROGRAMMER'S CHOICE.  If I want certain strings to be interned,<br>

I don't see why "do not intern strings, ever, for any reason" should forbid me doing so.<br>

<div class="im"><br></div></blockquote><div><br></div><div><br></div><div>Turn the string to an atom. Done! Then you know it's interned, and it is type-distinct from "string." That's all I want, and I want this because strings that can be interned or not have turned out to be a liability in practice, and strings that are always interned are only useful in short-running systems.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">

<br>

> Similarly, interning strings, and using that for equality, would mean that the interning system would have to work cross-process both for short strings and long strings, assuming a shared heap approach similar to binaries is used for long strings, which may end up requiring a lot more locking than would be healthy on most modern MP systems.<br>


<br>

</div>You are now talking about interning ALL strings ALL the time for NO specific reason.<br>

<br></blockquote><div><br></div><div>Nope. Interning even a single string, and making the rule that all strings that take the same character sequence must have the same pointer value (pretty common for interned string implementations -- think about it!) then all string operations need to do global heap locking of one form or another. You can shard your heaps/locks, you can do all kinds of tricks, but in the end, what I said is true as long as you support interning a single string, and let the type still remain "string." Interning a string, returning type "atom," is much better, for this very reason (and others, IMO :-)</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">"mean that the interning system would have to work cross-process".  Each process could<br>

have its OWN string table.  It is never possible to compare a string in one process with<br>

</blockquote><div><br></div><div><br></div><div>I am coming at this from "I use binaries as strings now, and want something even better" point of view. Binaries are shared across processes, because sending large binaries (or sub-binaries) across processes is common -- again, for network/protocol systems -- and is optimized through this implementation. Interning, however, adds a different level of locking and complication.</div>

<div><br></div><div>Anyway, that's about as far as I go with my defense of my particular opinions. They clearly come from a different background than your opinions, and if Erlang sprouted a string system that had 8 of my 10 requests, well, that would be super-duper-sweet!</div>

<div><br></div><div>Sincerely,</div><div><br></div><div>jw</div><div><br></div></div>