<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Seems like a good time for a summary. Yell if I have misunderstood or mis-represented<div class=""><br class=""></div><div class=""> - There is little appetite for a new ‘bandaid' syntax mapping to <<“utf8-string”/utf8>></div><div class=""> - Erlang is already on a path towards much improved string handling (although at the cost of making the simple cases more complex?)<div class=""> - No-one is arguing we have the ideal solution for strings today</div><div class=""> - There is appetite for a new string type. This should have a distinct dynamic type - text() ?</div><div class=""> - This type ought to be represented internally in a carefully chosen canonical utf-8 format</div><div class=""> - It needs clean and efficient mechanisms for input and output to file / network / nif with utf-8 default but option for many other representations.</div><div class=""> - Backtick quotes have a few votes as new syntax for such a representation. Go-lang has chosen backtick for their raw strings with some interesting semantics.</div><div class=""> - A new text() type would allow io to print these strings as `utf8 string` rather than falling back to binary representation.</div><div class=""><br class=""></div><div class="">Some issues:</div><div class=""><br class=""></div><div class=""> - Adding a new type raises all the usual questions about equality, ordering, conversion (implicit and via a many to many matrix of conversion functions), guards etc. It’s a much larger change than a simple syntax representation mapping to <<"some string"/utf8>></div><div class=""> - How do we concatenate them? `Hello` <> `World`?</div><div class=""> - How do we construct them? io_lib:format(`~p`, [atom]). `hello #{Name}` ?</div><div class=""> - how do we incorporate them in other string types? io:format(“~s”, [`text`]). <<“txt”, `utf8 text`, 0>></div><div class=""> - How do we extract them from binary data? <<T:4/text, Rest/binary>>. What is the meaning of the length parameter? The string module already has a clear definition.</div><div class=""> - What does matching a literal out of binary data mean? <<`Hêllö`, Rest/text>> == <<“Hêllö World”/utf8>></div><div class=""> - Prefix matching in normal code? `Hello` <> World = `Hello world` </div><div class=""> - Is there already a suitable internal canonical utf-8 format? OTP team?</div><div class=""> - Lots of other details in the mails from everyone</div><div class=""> - Everything else</div><div class=""><br class=""></div><div class="">Sigils are orthogonal to this discussion (and one of the secondary benefits is pretty nicely realised by using backtick - the more common double quotes would not need to be escaped - yes, JSON).</div><div class=""><br class=""></div><div class="">So,the big question: Did this already reach a level of complexity the language will sink under?</div><div class=""><br class=""></div><div class="">Is it worth spending time fleshing any of this out to an EEP level of detail?</div><div class=""><br class=""></div><div class="">Do we come back in another year?</div><div class=""><br class=""></div><div class="">Would an EEP help the existing work of the OTP team in this area or is there already a clear plan and this would be a distraction?</div><div class=""><br class=""></div><div class="">Sean</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Aside: </div><div class="">We don’t have the BDFL question. Instead Erlang/OTP has a process</div><div class=""><br class=""></div><div class="">One old boss I respected explained a reason big companies like to buy from big companies even at many times the price - his reason was that small companies rely on people and big companies on process. The process will deliver (eventually) even if the people change three times in the middle!</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><div><br class=""><blockquote type="cite" class=""><div class="">On 7 Jun 2018, at 14:29, Jesper Louis Andersen <<a href="mailto:jesper.louis.andersen@gmail.com" class="">jesper.louis.andersen@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">On Tue, Jun 5, 2018 at 10:57 PM Sean Hinde <<a href="mailto:sean.hinde@mac.com" class="">sean.hinde@mac.com</a>> wrote:<br class=""><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">My proposal would be to add an alternative notation for binary string literals in Erlang along the lines of:<br class="">
<br class="">
~s”Some binary string” mapping to <<"Some binary string”>><br class="">
<br class=""></blockquote><div class=""><br class=""></div>The underlying problem is that Erlang is chromodynamic, for a lack of better term[0]. In a chromodynamic language, there is one type, term(), but data of that type has "color" insofar data is used with different intent:</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">* ISO8859-15 strings</div><div class="gmail_quote">* UTF-8 strings</div><div class="gmail_quote">* Lists of integers, where each integer is a code point</div><div class="gmail_quote">* binary() payloads</div><div class="gmail_quote">* binary() data which has interpretation</div><div class="gmail_quote">* bitstring()</div><div class="gmail_quote">* integers used as sets of bits</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">And so on. Data is then mapped onto a given subset of term(), namely string(), [non_neg_integer()], [0..255], binary(), iolist(), iodata() etc.</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">Colors don't mix. We can't have green UTF-8 strings together with blue binary() data. But the onus of keeping the colors apart is on the programmer, not on the system.</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">Typed languages (that is the nontrivially typed ones) keeps data apart by means of a type system. So there, we can't mix a UTF-8 string with a binary() blob unless we explicitly convert between the types. However, in a chromodynamic language, we need another way to identify the colors, and this leads into the need for explicit syntactic notation to tell them apart.</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">Worse, our mapping of colorful data to term() is forgetful (or if I may: the mapping is desaturating). So once we have the underlying term(), we don't know from where it came.</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">History plays an important role of course. binary() was intended for binary() data which are just vectors of bytes. But over time, they've found other uses in Erlang systems:</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">* strings() - mostly due to better packing of data. Especially on 64bit machines where list cons cells have considerable overhead.</div><div class="gmail_quote">* utf8 encoded strings</div><div class="gmail_quote">* dynamic atoms (because Richard O'Keefe's "Split the Atoms proposal was never implemented). You can run out of atoms, but you cannot run out of binary() if you pay the price of more expensive equality checks.</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">Given their prominence, I think it would be good to open a discussion on a more succinct syntax for binary() data. Perhaps laced with a discussion about what utf8 strings should be in the system. Over the years, the ubiquity of binary() data has just slowly grown.</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">Were I the BDFL, I'd probably go with the following scheme:</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">string() - still a list of code points. The double quote is used: "example"</div><div class="gmail_quote">binary() - still written as <<Binary>></div><div class="gmail_quote">atom() - still there, used when you need fast equality checks. I'd probably try to figure out how to GC them so they don't have the current limitation, which would open up their use for more cases where we currently use a binary()</div><div class="gmail_quote">text() - A new type specifically for contiguous pieces of unicode. Always encoded as UTF-8. They will have a new syntax, probably `example`. Or perhaps #"example" or ~"example". The latter two has the advantage that they can generalize: ~r".*" etc, but the former is awfully readable.</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">This introduces a honest-to-god type for textual data where the data is processed as a whole, and it would probably accept a fully backwards compatible representation. We need to discriminate between binary() and textual data at the lowest level anyway. Otherwise you run the risk of mixing color way too often. Conversion routines should verify and crash on conversions which are not allowed.</div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">Rationale: I'd never create the string() type were I to create a new language. A string is not defined as a list of codepoints, but rather as a vector of codepoints (which also means they are immutable). They should support O(1) concatenation (by having the internal representation be either iodata() or a finger tree). But since we have so much legacy code, we are stuck with string(), much like Haskell is where String = [Char].</div><div class=""><br class=""></div><div class="">End of BFDL rant :)<br class=""></div><div class=""><br class=""></div><div class="gmail_quote"><br class=""></div><div class="gmail_quote">[0] In keeping with CS tradition, I'll take a term from physics and absolutely butcher it by using it in a different context where it doesn't belong. Bear with me, for I have sinned.<br class=""></div></div>
</div></blockquote></div><br class=""></div></div></body></html>