<div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
people hoped it would stay that way. The point is not that it is wrong to use<br>
UTF-16, but that IF YOU WANT O(1) INDEXING OF UNICODE CODE POINTS it is wrong to<br>
use UTF-16. Take Java as an example.<br>
<br></blockquote><div><br></div><div>I would be perfectly fine with a proposal that said "we use 4-byte characters, just like Linux wchar_t."</div><div>I would also be OK with a proposal that said "we use 2-byte characters, just like Windows, and only support the 65535 character subset."</div>
<div>Significantly better performance, slightly worse coverage of 10646.</div><div> </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
This one "user character" may be one, two, or three code points, and<br>
unless you are religious about normalising all inputs, it isn't YOUR choice<br>
which. (By the way, all the code points in this example fit in 16 bits.)<br></blockquote><div><br></div><div>Life is too short to not normalize data on input, in my opinion. However, the specific examples I care about are all about parsing existing protocols, where all the "interesting" bits are defined as ASCII subset, and anything outside that can be lumped into a single "string of other data" class without loss of generality. This includes things like HTTP or MIME. Your applications may vary.</div>
<div> </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">
> > 9) Do not intern strings, ever, for any reason.<br>
><br>
> This is surely the programmer's choice. My XML library code C interns everything<br>
> all the time, and it pays off very nicely indeed. My Smalltalk compiler interns<br>
</div></blockquote><div><br></div><div><br></div><div>As long as you do not allow users to feed data into your library, perhaps, and/or create a new _OS_ process for each document. For systems with uptime requirements, interned strings are one of the worst offenders for "easy to miss" bugs.</div>
<div><br></div><div>But Erlang already has literals: they're called atoms. Let's not re-invent them. string_to_atom() would be a fine function for those who want to do that. string_to_interned_string() would not. (Here, I think systems like C# get it wrong)</div>
<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">> off very nicely. It seems truly bizarre to want string _indexing_ (something I never<br>
> find useful, given the high strangeness of Unicode) to be O(1) but not to want<br>
> string _equality_ (something I do a lot) to be O(1).<br>
</div></blockquote><div><br></div><div><br></div><div>It seems like you never do network protocol parsing, or systems with very long uptimes that process arbitrary user-supplied data.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
I repeat: this is surely the PROGRAMMER'S CHOICE. If I want certain strings to be interned,<br>
I don't see why "do not intern strings, ever, for any reason" should forbid me doing so.<br>
<div class="im"><br></div></blockquote><div><br></div><div><br></div><div>Turn the string to an atom. Done! Then you know it's interned, and it is type-distinct from "string." That's all I want, and I want this because strings that can be interned or not have turned out to be a liability in practice, and strings that are always interned are only useful in short-running systems.</div>
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">
<br>
> Similarly, interning strings, and using that for equality, would mean that the interning system would have to work cross-process both for short strings and long strings, assuming a shared heap approach similar to binaries is used for long strings, which may end up requiring a lot more locking than would be healthy on most modern MP systems.<br>
<br>
</div>You are now talking about interning ALL strings ALL the time for NO specific reason.<br>
<br></blockquote><div><br></div><div>Nope. Interning even a single string, and making the rule that all strings that take the same character sequence must have the same pointer value (pretty common for interned string implementations -- think about it!) then all string operations need to do global heap locking of one form or another. You can shard your heaps/locks, you can do all kinds of tricks, but in the end, what I said is true as long as you support interning a single string, and let the type still remain "string." Interning a string, returning type "atom," is much better, for this very reason (and others, IMO :-)</div>
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">"mean that the interning system would have to work cross-process". Each process could<br>
have its OWN string table. It is never possible to compare a string in one process with<br>
</blockquote><div><br></div><div><br></div><div>I am coming at this from "I use binaries as strings now, and want something even better" point of view. Binaries are shared across processes, because sending large binaries (or sub-binaries) across processes is common -- again, for network/protocol systems -- and is optimized through this implementation. Interning, however, adds a different level of locking and complication.</div>
<div><br></div><div>Anyway, that's about as far as I go with my defense of my particular opinions. They clearly come from a different background than your opinions, and if Erlang sprouted a string system that had 8 of my 10 requests, well, that would be super-duper-sweet!</div>
<div><br></div><div>Sincerely,</div><div><br></div><div>jw</div><div><br></div></div>