[erlang-questions] Binary string literal syntax

Lloyd R. Prentice lloyd@REDACTED
Thu Jun 7 16:11:58 CEST 2018


> Saying "unicode is the standard now, and UTF-8 is The One True Way" is also saying 
> "a fantastically complex world of codepoints and construction indicators that allow 
> for multiple representations as equivalent is now the standard". That doesn't do 
> anything to solve the question of whether there should be a separate string type. It 
> also ignores that Windows is natively UTF-16, not UTF-8 (though it works a lot better 
> with UTF-8 these days).

Hi Craig,

I’d like to say “right on!,” but I probably shouldn’t participate in this debate. 

For one, I’m not a professional programmer. I’ve only, painfully, worked hard to learn Erlang to solve a very specific problem.

So I bring a pragmatic beginner’s mind to this discussion that all are free to discount. And, as an English speaker, I bring an ashamedly provincial bias. Indeed, after seventy some odd years I still struggle to express myself fluently in English.

My pain point is this: I cringe now every time I want to use an Erlang string function. Since my aging memory is not now what it once was, I need to consult the reference manual frequently while I’m programming. And, I must admit, that the new string functions baffle, frustrate and, unreasonably, enrage me. I totally lose flow and concentration when I need to do what was once the simplest string operation, spending many many precious minutes trying to understand the new and improved way of going about it.

So, just a few observations for what they’re worth:

Seems to me that trying to find the one universal digital standard for representing all the wonderful organic and evolving natural languages in the world is an exercise in hubris.

There is no surer road to complexity and bloat than trying to be all things to all people.

But, yes, in this global world, we do need to communicate across natural language domains.

Esperanto is one not terribly successful attempt to do this in the non-digital world. How many among us speak Esperanto?

Yet,  we use translators and translation services quite effectively.

So, perhaps, the glyphs of each language should have their own most efficient and standardized digital representation. And serious intellectual capital should go into writing language-to-language translation packages.

All the best to all,

LRP

Sent from my iPad

> On Jun 6, 2018, at 6:48 PM, zxq9@REDACTED wrote:
> 
>> On 2018年6月6日水曜日 14時00分20秒 JST Vlad Dumitrescu wrote:
>> 
>> - The new string functions work with strings as sequences of lexemes. The
>> "list strings" are lists of characters, so for example calling length() on
>> the two representations of the same text may not return the same value.
>> Most notably, CRLF is a lexeme, but two characters.
> 
> To expand on this point (as I've done before) many lexemes used in CJK have multiple constructions that are considered equivalent. Korean hangul is almost a pure example of this, as input is typically done over aggregate lexemes and often re-masked as a single codepoint once the input phase is complete, but not always. Pinyin input works in a similar way but has way more complex aggregate lexemes, though the principle is similar. Even Japanese has a few examples of this (is ぷ a single character or is it [[ふ,゜]] this particular time we encounter it?).
> 
> etc.
> 
> Saying "unicode is the standard now, and UTF-8 is The One True Way" is also saying "a fantastically complex world of codepoints and construction indicators that allow for multiple representations as equivalent is now the standard". That doesn't do anything to solve the question of whether there should be a separate string type. It also ignores that Windows is natively UTF-16, not UTF-8 (though it works a lot better with UTF-8 these days).
> 
> Go read the unicode standard. It's... well, just have fun reading it. I don't anyone who understands *all* of it 100% -- because for most people the first 10% or so "works for me" (for mainstream CJK use probably the first 30% or so).
> 
> I suppose all of this is to say that when you compare the enormous number of corner cases in string handling against MERE SYNTAX the syntax is such a trivial issue that it isn't even worth time thinking about. The reason this suject comes up once a year is because we all wish there were a magical set of characters we would write into source files that mean "just make this string work, regardless what it is, because this is hard". We want a syntax that represents the system abstracting away from the underlying data and instead forcing it to mean what *we* mean -- which is insufficiently low-level for much of the work we do as programmers (so we're leaving a lot undefined there, which is bad). At what level do you need to interpret a string? The answer is different for different programs (someone writing a Korean input interpreter is in a very different case than someone writing a chat server). This desire is a kissing cousin of the grand desire for a unified method of l10n and i18n -- which turns out to be just as hard to get right as string handling because every case is different, often in a way that conflicts with another case you have to handle.
> 
> Blah blah. Strings are hard, and about half of what we think of as "strings" are really serializations of non-string data anyway.
> 
> -Craig
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list