[erlang-questions] Binary string literal syntax

Jesper Louis Andersen jesper.louis.andersen@REDACTED
Thu Jun 7 14:29:22 CEST 2018


On Tue, Jun 5, 2018 at 10:57 PM Sean Hinde <sean.hinde@REDACTED> wrote:

> My proposal would be to add an alternative notation for binary string
> literals in Erlang along the lines of:
>
> ~s”Some binary string” mapping to <<"Some binary string”>>
>
>
The underlying problem is that Erlang is chromodynamic, for a lack of
better term[0]. In a chromodynamic language, there is one type, term(), but
data of that type has "color" insofar data is used with different intent:

* ISO8859-15 strings
* UTF-8 strings
* Lists of integers, where each integer is a code point
* binary() payloads
* binary() data which has interpretation
* bitstring()
* integers used as sets of bits

And so on. Data is then mapped onto a given subset of term(), namely
string(), [non_neg_integer()], [0..255], binary(), iolist(), iodata() etc.

Colors don't mix. We can't have green UTF-8 strings together with blue
binary() data. But the onus of keeping the colors apart is on the
programmer, not on the system.

Typed languages (that is the nontrivially typed ones) keeps data apart by
means of a type system. So there, we can't mix a UTF-8 string with a
binary() blob unless we explicitly convert between the types. However, in a
chromodynamic language, we need another way to identify the colors, and
this leads into the need for explicit syntactic notation to tell them apart.

Worse, our mapping of colorful data to term() is forgetful (or if I may:
the mapping is desaturating). So once we have the underlying term(), we
don't know from where it came.

History plays an important role of course. binary() was intended for
binary() data which are just vectors of bytes. But over time, they've found
other uses in Erlang systems:

* strings() - mostly due to better packing of data. Especially on 64bit
machines where list cons cells have considerable overhead.
* utf8 encoded strings
* dynamic atoms (because Richard O'Keefe's "Split the Atoms proposal was
never implemented). You can run out of atoms, but you cannot run out of
binary() if you pay the price of more expensive equality checks.

Given their prominence, I think it would be good to open a discussion on a
more succinct syntax for binary() data. Perhaps laced with a discussion
about what utf8 strings should be in the system. Over the years, the
ubiquity of binary() data has just slowly grown.

Were I the BDFL, I'd probably go with the following scheme:

string() - still a list of code points. The double quote is used: "example"
binary() - still written as <<Binary>>
atom() - still there, used when you need fast equality checks. I'd probably
try to figure out how to GC them so they don't have the current limitation,
which would open up their use for more cases where we currently use a
binary()
text() - A new type specifically for contiguous pieces of unicode. Always
encoded as UTF-8. They will have a new syntax, probably `example`. Or
perhaps #"example" or ~"example". The latter two has the advantage that
they can generalize: ~r".*" etc, but the former is awfully readable.

This introduces a honest-to-god type for textual data where the data is
processed as a whole, and it would probably accept a fully backwards
compatible representation. We need to discriminate between binary() and
textual data at the lowest level anyway. Otherwise you run the risk of
mixing color way too often. Conversion routines should verify and crash on
conversions which are not allowed.

Rationale: I'd never create the string() type were I to create a new
language. A string is not defined as a list of codepoints, but rather as a
vector of codepoints (which also means they are immutable). They should
support O(1) concatenation (by having the internal representation be either
iodata() or a finger tree). But since we have so much legacy code, we are
stuck with string(), much like Haskell is where String = [Char].

End of BFDL rant :)


[0] In keeping with CS tradition, I'll take a term from physics and
absolutely butcher it by using it in a different context where it doesn't
belong. Bear with me, for I have sinned.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180607/5e627ac0/attachment.htm>


More information about the erlang-questions mailing list