[erlang-questions] Binary string literal syntax

Wed Jun 6 15:51:19 CEST 2018

On Wed, Jun 6, 2018 at 8:56 AM, Sean Hinde <sean.hinde@REDACTED> wrote:

>
> "The default Unicode encoding in Erlang is in binaries UTF-8, which is
> also the format in which built-in functions and libraries in OTP expect to
> find binary Unicode data”
>
> There is also a strange example in the string.erl document where this
> binary <<"abc..åäö”>> is not stored as UTF-8 but instead as latin-1. Having
> an unambiguous way to represent a UTF-8 string literal would also clear
> this up.
>
> That seems to point in a clear direction.
>
>
A clarification here. In Erlang, you have to be aware of the following
possible encodings:

   - "abcdef": a string, which is made of straight up unicode codepoints.
   This means that if you write [16#1f914] you'll quite literally get "��"
   as a string, with no regards to encoding
   - <<"abcdef">> as a binary string, which is shorthand for <<$a, $b, $c,
   $d, $e, $f>>. Which is an old standard list of integers transformed as a
   binary. By default this format *does not* support unicode encodings, and
   if you put a value that is too large in there (such as 16#1f914) by
   declaring a binary like <<"��">>, you will instead find yourself with an
   overflow, and the final binary <<20>>.
   - <<"abcdef"/utf8>> as a binary string that is unicode encoded as utf-8.
   This one would work to support emojis
   - <<"abcdef"/utf16>> as a binary string that is unicode encoded as
   utf-16
   - <<"abcdef"/utf32>> as a binary string that is unicode encoded as utf-32
   - ["abcdef", <<"abcdef"/utf8>>]: iodata type list that can support
   multiple inputs. As far as I remember, your list can be codepoints as
   usual, but you'll want all the binaries to be the same encoding (ideally
   utf-8) to prevent issues where encodings get mixed

When the standard library functions say they "expect to find utf-8 by
default", it means that when you call functions such as the new ones in the
string module, or those in the unicode module where parameters can be given
(i.e. unicode:characters_to_binary/1-3), if nothing is specified, then
utf-8 is assumed for binaries. But it does not mean that the literal binary
strings you write in code are assumed to be utf-8 by default. That's
confusing, but yeah.

Aside from that, I would say that the choices Elixir made have one risky
side to them (the same is possible in Erlang but I'm calling it out because
I've seen it a few times in Elixir samples and Erlang has not historically
had as many examples of string handling). Because strings are utf8 binaries
by default in Elixir, whenever you feel like pattern matching iteratively,
you may do something like:

<<head::utf8, rest::binary>> which in Erlang would be <<Head/utf8,
Rest/binary>>. The risk of doing this is that this fetches text by
codepoint, whereas when doing text processing, it is often better to do it
by grapheme. The best example for that is the family emoji. By default, it
could be just a single codepoint, encoded on many bytes, giving: ��

That's fine and good, but the problem comes from the fact that graphical
(and logical) representation is not equal to the underlying codes creating
the final character. Those exist for all kinds of possible ligatures and
assemblies of "character parts" in various languages, but for Emojis, you
can also make a family by combining individual people: ��‍��‍��‍�� is a
family composed of 4 components with combining marks: �� + ��  + �� + ��,
where + is a special combining mark (a *zero width joiner*) between two
women and two boys. If you go ahead and consume that emoji using the /utf8
modifier, you'll break the family apart and change the semantic meaning of
the text.

If you edit the text in a text editor that traditionally has good support
for locales and all kinds of per-language rules, such as Microsoft Word
(the only one I know to do a great job of automatically handling half-width
spaces and non-breakable spaces when language asks for it), pressing
backspace on ��‍��‍��‍�� will remove the whole family as one unit. If you
do it in FireFox or Chrome, deleting that one 'character' will take you 7
backstrokes: one for each 'person' and one for each zero-width joining
character. Slack will consider them to be a single character and visual
studio code behaves like the browsers (even if both are electron apps), and
notepad.exe or many terminal emulators will instead expand them as 4 people
and implicitly drop the zero-width joining marks.

If you want to deal with unicode strings, you really should use the string
functions from the string module (String in Elixir), and work on graphemes
or codepoints depending on the context. One interesting thing there is that
you can use these to return you strings re-built up as graphemes using the
to_graphemes functions in either language:

1> string:to_graphemes("ß↑e̊").
[223,8593,[101,778]]2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
[223,8593,[101,778]]

This lets you take any unicode string, and turn it into a list that is safe
to iterate using calls such as lists:map/2 or lists comprehensions. This
can only be done through iodata(), and this might even be a better format
than what you'd get with just default UTF-8 binary strings. Pattern
matching is still risky there. Ideally you'd possibly want to do a round of
normalization first, so that characters that can be encoded in more than
one way (say â which can be a single codepoint or a+^ as two points) are
forced into a uniform representation.

The thing that I'm worried about is how could we make the richness (and
pitfalls!) of Unicode handling easier to deal with. So far I've been
pleasantly surprised that having no native string type and using codepoints
by default did force me to learn a bunch about Unicode and how to do it
right, but it's difficult to think that this is the optimal path for
everyone.

If I had to argue for something, it would be that a good "beginner" string
type would be an opaque one that inherently carries its own encoding, and
cannot be pattern-matched on unless you use a 'graphemed' + normalized
iodata structure. If you wanted to switch to codepoints for handling, then
you could convert it to a binary or to another type. But even then this
would have a weakness because you would necessarily be forced to convert
from say, a utf-8 byte stream coming from a socket, onto a different
format: this is exactly what is annoying people today when they just want
the damn strings to use "abc" because it's shorter to write.

I personally think that this is a clash between correctness and
convenience. Currently Erlang is not necessarily 'correct', but it at least
does not steer you entirely wrong through convenience since using utf8 (the
default many people want) is cumbersome. I'd personally go for a 'correct'
option (strongly typed strings that require conversions between formats
based on context and usage), but I fear that this thread and most
complaints about syntax worry first about convenience, and I don't know
that they're easy to reconcile.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180606/01162433/attachment.htm>