[erlang-questions] Binary string literal syntax

Thu Jun 7 00:48:14 CEST 2018

On 2018年6月6日水曜日 14時00分20秒 JST Vlad Dumitrescu wrote:

> - The new string functions work with strings as sequences of lexemes. The
> "list strings" are lists of characters, so for example calling length() on
> the two representations of the same text may not return the same value.
> Most notably, CRLF is a lexeme, but two characters.

To expand on this point (as I've done before) many lexemes used in CJK have multiple constructions that are considered equivalent. Korean hangul is almost a pure example of this, as input is typically done over aggregate lexemes and often re-masked as a single codepoint once the input phase is complete, but not always. Pinyin input works in a similar way but has way more complex aggregate lexemes, though the principle is similar. Even Japanese has a few examples of this (is ぷ a single character or is it [[ふ,゜]] this particular time we encounter it?).

etc.

Saying "unicode is the standard now, and UTF-8 is The One True Way" is also saying "a fantastically complex world of codepoints and construction indicators that allow for multiple representations as equivalent is now the standard". That doesn't do anything to solve the question of whether there should be a separate string type. It also ignores that Windows is natively UTF-16, not UTF-8 (though it works a lot better with UTF-8 these days).

Go read the unicode standard. It's... well, just have fun reading it. I don't anyone who understands *all* of it 100% -- because for most people the first 10% or so "works for me" (for mainstream CJK use probably the first 30% or so).

I suppose all of this is to say that when you compare the enormous number of corner cases in string handling against MERE SYNTAX the syntax is such a trivial issue that it isn't even worth time thinking about. The reason this suject comes up once a year is because we all wish there were a magical set of characters we would write into source files that mean "just make this string work, regardless what it is, because this is hard". We want a syntax that represents the system abstracting away from the underlying data and instead forcing it to mean what *we* mean -- which is insufficiently low-level for much of the work we do as programmers (so we're leaving a lot undefined there, which is bad). At what level do you need to interpret a string? The answer is different for different programs (someone writing a Korean input interpreter is in a very different case than someone writing a chat server). This desire is a kissing cousin of the grand desire for a unified method of l10n and i18n -- which turns out to be just as hard to get right as string handling because every case is different, often in a way that conflicts with another case you have to handle.

Blah blah. Strings are hard, and about half of what we think of as "strings" are really serializations of non-string data anyway.

-Craig