[erlang-questions] Binary string literal syntax

zxq9@REDACTED zxq9@REDACTED
Thu Jun 7 04:27:34 CEST 2018


On 2018年6月7日木曜日 0時56分29秒 JST you wrote:
> 
> > On 7 Jun 2018, at 00:21, zxq9@REDACTED wrote:
> > 
> > On 2018年6月6日水曜日 11時41分01秒 JST Sean Hinde wrote:
> > 
> >> As a protocol wrangling language I would argue Erlang has no peers, but many more protocols are string based now than when the bit syntax was invented.
> > 
> > By count this is patently false. Most protocols are binary based, as the number of ad hoc binary protocols created for IoT vasty outnumber the handful of prolific string-based ones. Can you think of a better language for IoT protocol wrangling than Erlang?
> 
> No arguments from me on the suitability of Erlang for protocol wrangling. And these string based ones are definitely prolific. I spent today dealing with json in Erlang for some banking protocol

...
> > * Binary protocols are alive and well
> > * The old encodings are far from dead.
> > * You have a good point about improvements being possible and desirable.
> > * The best way to proceed is not clear.
> > * The unicode-correctish improvements to the string and unicode modules are very encouraging.
>
> Nice summary. You have obviously thought about this a lot. Any thoughts on a better solution? What would you do?
> 
> Maybe a hypothetical new string literal type treated as unicode internally but with transparent conversion to utf-8 by default when sent to io (with the option to override)? I get Japan, but utf8 is a sane default.
> 
> Or maybe some new slick syntax to create a string literal in any encoding.
> 
> The bit syntax was designed for picking apart bit twiddling telecom protocols. It was clearly not designed with the primary goal of representing alternative forms of string literals. It’s just not what you would choose for that application.

The main problem I see with this particular example is that you feel you were dealing with a "string-based protocol" because you were dealing with JSON.

You weren't -- JSON is a list of trees. It is serialized as a string, and strings are used to represent things in JSON that JSON itself is *dramatically* unsuited for, so "eveything is a string" seems reasonable to people who don't know anything about type systems or were hustled into pushing a "lipstick on a chicken" prototype into production.

That last case is so common that a lot of new coders haven't ever seen anything *but* JSON in practice. That doesn't mean we should optimize for wrongness.

The point of exacerbation is that you are using a JSON serializer that outputs lists of trees of pairs that contain binary snippets instead of lists as the string representations (Jiffy, I imagine). That isn't the best way to deal with strings in Erlang, imo.

So we have a conflation of issues here:
- Strings (or more broadly, io_data()) in Erlang can *actually* represent Unicode types because they can represent things as lexemes not just a flat array of codepoints. That's actually quite advanced.
- Binaries are just that: binaries. They were indeed never intended for advanced string processing.
- Binaries *can* represent strings, are more compact in memory and are easier to deal with in NIFs, which is why Jiffy uses them.
- Jiffy is the most common JSON serializer for Erlang.

Not a single of these issues is addressed or made easier to deal with by a new syntax that equates to <<"foo"/utf8>>. In fact, the /utf8 binary identifier has only been brought up a few times in this thread because it isn't the point.

What you *really* want, I think, is this:
1. A concrete decision about how Erlang represents UTF-8 in memory. A canonicalization.
2. A single io_data() -> utf8_string() IMPORT function.
3. Access to the canonical representation so that dealing with it in Rust/C NIFs and Erlang is not mind bending.
4. A single utf8_string() -> io_data() EXPORT function that has a default serialization rule.
5. A set of functions that allow me to pick which binary representation is output if the default is unsuitable (like when I really need cast hangul characters to their equivalent broken-down lexemes, for example).
6. A special syntax that abstracts the concept of the underlying representation for utf8 in memory.

None of these are trivial issues or should be messed about with lightly.

As for syntax, quoting we have so far for the types we have so far is great. The <<"blahblah">> thing for direct access to binaries is great. The "foo" == [$f, $o, $o] sugar is also brilliant. The fact that io_data() is a nested list of stuff can very often make complex, large manipulation of io_data() way faster in Erlang than other languages that have to traverse binary strings to do their work, even if it looks ugly (but again, remembering that the *data* you're dealing with is trees merely represented by strings is key).

So I think Erlang has really gotten all of that right.

But we still SHOULD eventually have a canonical utf8 type.

As for syntax...

I HATE prefix-glyph syntax for quotes. Ugh. Better to just give me a single-letter function name and let me do u("blah") or whatever. Then I don't have to learn anything new, at least, and can use it in a list function or whatever.

I DOUBLY HATE it when new programmes get confused by prefix-glyph syntax. You don't have to teach anyone what a normal-looking quote mark is or how to use or type them.

So if we have to have a special syntax, instead, I would recommend backticks-as-quotes.

'an_atom'
"a listy string"
<<"a binary string">>
`a canonical utf8 string`

We have a million other kinds of quotes in Japanese that would 「suit」『me』【just】《fine》 but totally screw everyone else over, sort of like german quote angle thingies would were they to be made mandatory -- but I think backticks are universally available without any special input modes (correct me if I'm wrong).

The `utf8 string` version would be a strict, canonical equivalent to <<"utf8 string"/utf8>> in memory. I'm actually not sure whether the current binary /utf8 tag forces canonicalization (or if it does, *which* unicode form is canonical in Erlang right now). The canonical representation in memory issue has to be ironed out if you want your JSON situation to improve -- and for you I think this is really the rub (whereas I have very different concerns with unicode strings, and would be a bit annoyed if an optimization in the interest of JSON made dealing with things like client-side input or string forms commonly embedded in binary protocol traffic out on my half of the planet unduly complicated).

As far as what is happening in Erlang right now to clear some of these issues up, since R19 a LOT of unicode changes have been happening, and most of them are really headed in happy directions. I would say that we need to keep this in the back of our minds, but that implementing anything like unicode canonicalization (to the point that we are happy with whatever is decided forever and ever come-come-what-may-and-screw-the-corner-cases, amen) and especially implementing any special syntax to abstract it in code is premature.

Dan Gudmundsson has done a TON of excellent work in this area and continues to do so. He has gained a huge amount of knowledge and experience about unicode and how it interacts with current representations, and he would really be the one to ask about what "should be done" and where we are in terms of reaching a unicode string type that makes sense to deal with internally, in NIFs, exported data, etc.

What am I missing?

-Craig



More information about the erlang-questions mailing list