[erlang-questions] Binary string literal syntax

Thu Jun 7 13:33:58 CEST 2018

>> 
>> The bit syntax was designed for picking apart bit twiddling telecom protocols. It was clearly not designed with the primary goal of representing alternative forms of string literals. It’s just not what you would choose for that application.
> 
> The main problem I see with this particular example is that you feel you were dealing with a "string-based protocol" because you were dealing with JSON.
> 
> You weren't -- JSON is a list of trees. It is serialized as a string, and strings are used to represent things in JSON that JSON itself is *dramatically* unsuited for, so "eveything is a string" seems reasonable to people who don't know anything about type systems or were hustled into pushing a "lipstick on a chicken" prototype into production.
> 
> That last case is so common that a lot of new coders haven't ever seen anything *but* JSON in practice. That doesn't mean we should optimize for wrongness.
> 
> The point of exacerbation is that you are using a JSON serializer that outputs lists of trees of pairs that contain binary snippets instead of lists as the string representations (Jiffy, I imagine). That isn't the best way to deal with strings in Erlang, imo.

That’s a fair summary of my current state of exacerbation :) And yes, jiffy this week. Binary snippets as keys in json suit the use case well enough so long as the decoded representation can be sanely matched against a string literal. Atoms have their other problems, and strings as lists are just plain annoying for dealing with incoming data that can also take the form of lists of other objects (is_string? - https://stackoverflow.com/questions/2479713/determining-if-an-item-is-a-string-or-a-list-in-erlang?noredirect=1&lq=1)

> 
> So we have a conflation of issues here:
> - Strings (or more broadly, io_data()) in Erlang can *actually* represent Unicode types because they can represent things as lexemes not just a flat array of codepoints. That's actually quite advanced.
> - Binaries are just that: binaries. They were indeed never intended for advanced string processing.
> - Binaries *can* represent strings, are more compact in memory and are easier to deal with in NIFs, which is why Jiffy uses them.
> - Jiffy is the most common JSON serializer for Erlang.
> 
> Not a single of these issues is addressed or made easier to deal with by a new syntax that equates to <<"foo"/utf8>>. In fact, the /utf8 binary identifier has only been brought up a few times in this thread because it isn't the point.

I tend to believe that syntax is important, not for the “Wah Wah it looks too weird I can’t use that language” reason, but because it defines the UX of the system. And UX does drive behaviour. As much as we would like to think that everyone will think really hard about exactly which representation they need at each point in their program, the current kinds of strings provide a bit of a hobsons choice.

> What you *really* want, I think, is this:
> 1. A concrete decision about how Erlang represents UTF-8 in memory. A canonicalization.
> 2. A single io_data() -> utf8_string() IMPORT function.
> 3. Access to the canonical representation so that dealing with it in Rust/C NIFs and Erlang is not mind bending.
> 4. A single utf8_string() -> io_data() EXPORT function that has a default serialization rule.
> 5. A set of functions that allow me to pick which binary representation is output if the default is unsuitable (like when I really need cast hangul characters to their equivalent broken-down lexemes, for example).
> 6. A special syntax that abstracts the concept of the underlying representation for utf8 in memory.

If we can have all that without overhead of having to parse byte by byte all incoming data to be sure it’s valid utf8 (utf8_raw mode?) that looks like an excellent way forward.

> None of these are trivial issues or should be messed about with lightly.

Agreed. The EEP process and culture of this community is well designed to weed out badly thought through proposals :)

> 
> As for syntax, quoting we have so far for the types we have so far is great. The <<"blahblah">> thing for direct access to binaries is great. The "foo" == [$f, $o, $o] sugar is also brilliant. The fact that io_data() is a nested list of stuff can very often make complex, large manipulation of io_data() way faster in Erlang than other languages that have to traverse binary strings to do their work, even if it looks ugly (but again, remembering that the *data* you're dealing with is trees merely represented by strings is key).
> 
> So I think Erlang has really gotten all of that right.
> 
> But we still SHOULD eventually have a canonical utf8 type.
> 
> As for syntax...
> 
> I HATE prefix-glyph syntax for quotes. Ugh. Better to just give me a single-letter function name and let me do u("blah") or whatever. Then I don't have to learn anything new, at least, and can use it in a list function or whatever.
> 
> I DOUBLY HATE it when new programmes get confused by prefix-glyph syntax. You don't have to teach anyone what a normal-looking quote mark is or how to use or type them.
> 
> So if we have to have a special syntax, instead, I would recommend backticks-as-quotes.
> 
> 'an_atom'
> "a listy string"
> <<"a binary string">>
> `a canonical utf8 string`

Go-lang made the backtick choice: https://golang.org/ref/spec#String_literals with some interesting sematics. \r chars are stripped out, and their backtick strings can span multiple lines.

I’ll do some digging to see how happy their community is with the choice.

My only concern would be how easy it could be to mistake ` for ‘ when reading code.

> We have a million other kinds of quotes in Japanese that would 「suit」『me』【just】《fine》 but totally screw everyone else over, sort of like german quote angle thingies would were they to be made mandatory -- but I think backticks are universally available without any special input modes (correct me if I'm wrong).

At least for programmers it ought to be available - shells have used it forever.

> The `utf8 string` version would be a strict, canonical equivalent to <<"utf8 string"/utf8>> in memory. I'm actually not sure whether the current binary /utf8 tag forces canonicalization (or if it does, *which* unicode form is canonical in Erlang right now). The canonical representation in memory issue has to be ironed out if you want your JSON situation to improve -- and for you I think this is really the rub (whereas I have very different concerns with unicode strings, and would be a bit annoyed if an optimization in the interest of JSON made dealing with things like client-side input or string forms commonly embedded in binary protocol traffic out on my half of the planet unduly complicated).

JSON handling ought not of course to be the determinant, it’s just this week’s random thing I happen to be working on. I wasn’t working on it last week when the thought came about whether we could steal some ideas from Elixir for string handling.

> As far as what is happening in Erlang right now to clear some of these issues up, since R19 a LOT of unicode changes have been happening, and most of them are really headed in happy directions. I would say that we need to keep this in the back of our minds, but that implementing anything like unicode canonicalization (to the point that we are happy with whatever is decided forever and ever come-come-what-may-and-screw-the-corner-cases, amen) and especially implementing any special syntax to abstract it in code is premature.

That’s *just* a matter of release planning :) 

> Dan Gudmundsson has done a TON of excellent work in this area and continues to do so. He has gained a huge amount of knowledge and experience about unicode and how it interacts with current representations, and he would really be the one to ask about what "should be done" and where we are in terms of reaching a unicode string type that makes sense to deal with internally, in NIFs, exported data, etc.

Darn, I should have grabbed Dan on this topic at Code BEAM STO last week!

> What am I missing?

Good question prompted a few more thoughts

It would be nice to be able to use a new string format in more of the places we use strings. So some kind of interpolation for string construction: io_lib:format(`~p`, [atom]). to follow existing conventions, or something more modern:

`some \(fun() -> “interpolated” end) string` - Swift
or
`some #{“interpolated”} string` - Elixir

Though without “printable” protocols I guess these last two wouldn’t fly

Thanks for adding your well thought out ideas and views.

Sean