[erlang-questions] Binary string literal syntax

Mon Jun 11 10:22:43 CEST 2018

Seems like a good time for a summary. Yell if I have misunderstood or mis-represented

 - There is little appetite for a new ‘bandaid' syntax mapping to <<“utf8-string”/utf8>>
 - Erlang is already on a path towards much improved string handling (although at the cost of making the simple cases more complex?)
 - No-one is arguing we have the ideal solution for strings today
 - There is appetite for a new string type. This should have a distinct dynamic type - text() ?
 - This type ought to be represented internally in a carefully chosen canonical utf-8 format
 - It needs clean and efficient mechanisms for input and output to file / network / nif with utf-8 default but option for many other representations.
 - Backtick quotes have a few votes as new syntax for such a representation. Go-lang has chosen backtick for their raw strings with some interesting semantics.
 - A new text() type would allow io to print these strings as `utf8 string` rather than falling back to binary representation.

Some issues:

 - Adding a new type raises all the usual questions about equality, ordering, conversion (implicit and via a many to many matrix of conversion functions), guards etc. It’s a much larger change than a simple syntax representation mapping to  <<"some string"/utf8>>
 - How do we concatenate them? `Hello` <> `World`?
 - How do we construct them? io_lib:format(`~p`, [atom]). `hello #{Name}` ?
 - how do we incorporate them in other string types? io:format(“~s”, [`text`]). <<“txt”, `utf8 text`, 0>>
 - How do we extract them from binary data? <<T:4/text, Rest/binary>>. What is the meaning of the length parameter? The string module already has a clear definition.
 - What does matching a literal out of binary data mean? <<`Hêllö`, Rest/text>> == <<“Hêllö World”/utf8>>
 - Prefix matching in normal code? `Hello` <> World = `Hello world` 
 - Is there already a suitable internal canonical utf-8 format? OTP team?
 - Lots of other details in the mails from everyone
 - Everything else

Sigils are orthogonal to this discussion (and one of the secondary benefits is pretty nicely realised by using backtick - the more common double quotes would not need to be escaped - yes, JSON).

So,the big question: Did this already reach a level of complexity the language will sink under?

Is it worth spending time fleshing any of this out to an EEP level of detail?

Do we come back in another year?

Would an EEP help the existing work of the OTP team in this area or is there already a clear plan and this would be a distraction?

Sean

Aside: 
We don’t have the BDFL question. Instead Erlang/OTP has a process

One old boss I respected explained a reason big companies like to buy from big companies even at many times the price - his reason was that small companies rely on people and big companies on process. The process will deliver (eventually) even if the people change three times in the middle!

> On 7 Jun 2018, at 14:29, Jesper Louis Andersen <jesper.louis.andersen@REDACTED> wrote:
> 
> On Tue, Jun 5, 2018 at 10:57 PM Sean Hinde <sean.hinde@REDACTED <mailto:sean.hinde@REDACTED>> wrote:
> My proposal would be to add an alternative notation for binary string literals in Erlang along the lines of:
> 
> ~s”Some binary string” mapping to <<"Some binary string”>>
> 
> 
> The underlying problem is that Erlang is chromodynamic, for a lack of better term[0]. In a chromodynamic language, there is one type, term(), but data of that type has "color" insofar data is used with different intent:
> 
> * ISO8859-15 strings
> * UTF-8 strings
> * Lists of integers, where each integer is a code point
> * binary() payloads
> * binary() data which has interpretation
> * bitstring()
> * integers used as sets of bits
> 
> And so on. Data is then mapped onto a given subset of term(), namely string(), [non_neg_integer()], [0..255], binary(), iolist(), iodata() etc.
> 
> Colors don't mix. We can't have green UTF-8 strings together with blue binary() data. But the onus of keeping the colors apart is on the programmer, not on the system.
> 
> Typed languages (that is the nontrivially typed ones) keeps data apart by means of a type system. So there, we can't mix a UTF-8 string with a binary() blob unless we explicitly convert between the types. However, in a chromodynamic language, we need another way to identify the colors, and this leads into the need for explicit syntactic notation to tell them apart.
> 
> Worse, our mapping of colorful data to term() is forgetful (or if I may: the mapping is desaturating). So once we have the underlying term(), we don't know from where it came.
> 
> History plays an important role of course. binary() was intended for binary() data which are just vectors of bytes. But over time, they've found other uses in Erlang systems:
> 
> * strings() - mostly due to better packing of data. Especially on 64bit machines where list cons cells have considerable overhead.
> * utf8 encoded strings
> * dynamic atoms (because Richard O'Keefe's "Split the Atoms proposal was never implemented). You can run out of atoms, but you cannot run out of binary() if you pay the price of more expensive equality checks.
> 
> Given their prominence, I think it would be good to open a discussion on a more succinct syntax for binary() data. Perhaps laced with a discussion about what utf8 strings should be in the system. Over the years, the ubiquity of binary() data has just slowly grown.
> 
> Were I the BDFL, I'd probably go with the following scheme:
> 
> string() - still a list of code points. The double quote is used: "example"
> binary() - still written as <<Binary>>
> atom() - still there, used when you need fast equality checks. I'd probably try to figure out how to GC them so they don't have the current limitation, which would open up their use for more cases where we currently use a binary()
> text() - A new type specifically for contiguous pieces of unicode. Always encoded as UTF-8. They will have a new syntax, probably `example`. Or perhaps #"example" or ~"example". The latter two has the advantage that they can generalize: ~r".*" etc, but the former is awfully readable.
> 
> This introduces a honest-to-god type for textual data where the data is processed as a whole, and it would probably accept a fully backwards compatible representation. We need to discriminate between binary() and textual data at the lowest level anyway. Otherwise you run the risk of mixing color way too often. Conversion routines should verify and crash on conversions which are not allowed.
> 
> Rationale: I'd never create the string() type were I to create a new language. A string is not defined as a list of codepoints, but rather as a vector of codepoints (which also means they are immutable). They should support O(1) concatenation (by having the internal representation be either iodata() or a finger tree). But since we have so much legacy code, we are stuck with string(), much like Haskell is where String = [Char].
> 
> End of BFDL rant :)
> 
> 
> [0] In keeping with CS tradition, I'll take a term from physics and absolutely butcher it by using it in a different context where it doesn't belong. Bear with me, for I have sinned.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180611/ca86dbd3/attachment.htm>