[erlang-questions] Binary string literal syntax
Wed Jun 6 14:56:05 CEST 2018
> On 6 Jun 2018, at 14:00, Vlad Dumitrescu <vladdu55@REDACTED> wrote:
> I have a few thoughts about this. I would favor the proposed syntax, but not if things don't get simpler. What I mean is that there's more to consider.
I was aware of having missed a few details, and aware I was undoubtedly unaware of more :)
> - Some modules don't handle binary strings, but lists of chars; most notably erl_scan. If the syntaxes are too close, it might be even more confusing when to use which form.
Very true, though equally some modules don’t handle lists of chars. erl_scan is a big one, but I guess we are all used to the endless round of list to binary and vice versa in these cases.
> - The new string functions work with strings as sequences of lexemes. The "list strings" are lists of characters, so for example calling length() on the two representations of the same text may not return the same value. Most notably, CRLF is a lexeme, but two characters.
That is a big question. How should they be represented? I was happily assuming UTF-8, but maybe it would make more sense for them to be compatible with the new string module and be stored as lexeme sequences.
Looking around it seems there are a good range of sensible options. Elixir defaults to string literals being utf8, Swift uses Unicode scalars in their internal string representation forcing conversion to get a byte based representation.
With my protocol hat on I think I would pick utf8 as that is the most likely external representation and in many cases we would never need to convert and hence be efficient, but I can see arguments for this being poor design for a language.
> - When working with a textual protocol, it's still quite often that one would use <<"prefix"/utf8, Rest/binary>>, where the current syntax still has to be used. It might be confusing?
<<#”prefix”, Rest/binary>> ?
Definitely room for deeper thought here.
> - The predefined type string() is still [char()], and for binary strings there is unicode:chardata(), which in not necessarily obvious (as these are handled by the string module).
There is a type for unicode_binary() in the unicode module which refers to a utf8 binary string. The unicode.erl docs go as far as saying:
"The default Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built-in functions and libraries in OTP expect to find binary Unicode data”
There is also a strange example in the string.erl document where this binary <<"abc..åäö”>> is not stored as UTF-8 but instead as latin-1. Having an unambiguous way to represent a UTF-8 string literal would also clear this up.
That seems to point in a clear direction.
Excellent input, thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions