[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Fri Oct 21 23:12:31 CEST 2011

My preferences:

1) Binaries are pretty good as they are. If nothing happens, and I have to
live with binaries, I won't complain.
However, binaries pretty much have to live with UTF-8 encoding to stay sane.

2) Strings probably should be of "code points" not "bytes," so they would
use 2-byte or 4-byte characters. Note that Windows uses sizeof(wchar_t) ==
2, and Linux uses sizeof(wchar_t) == 4, so there's no unambiguously "right"
choice.

3) Operations include (erlang-style) matching, (grep-style) finding,
converting, splitting, joining and trimming.

4) Random access is important! As is slicing. string:substr(from,to) should
be O(1).

5) Going to-string-from-binary, and to-binary-from-string is important. You
need to specify the encoding, too -- ascii/7-bit, iso-8859-1/8-bit, utf-8,
utf-16, 16-bit, 32-bit, etc. Both ways.

6) Ideally, reference count immutable string data so that substring
extraction is cheap. This should probably be within the current process --
there should not be overhead of locking external heaps to do most string
operations (say, for strings <= 256 code points). (This is more
implementation specifics, though)

7) All APIs that take strings-as-lists should take strings-as-strings.
Anything from file I/O to regex to json.

8) Easy conversion between binary, list (if needed at all), iolist, and
string. Strings should be "native" in iolists.

9) Do not intern strings, ever, for any reason.

10) If "stringbuilder" style semantics are transparently used behind the
scenes, that would be great. Specifically, adding a bunch of different
sub-strings up could just aggregate into a list of pieces, until such time
as an operation that needs "the whole string" is encountered, and at that
point, the pieces are transparently combined.

11) strings are first-class -- you can guard on them, pattern match on them,
etc. Ideally with binary-style or some kind of BIF syntax for primitive
operations.

Sub-question: Can you find a nicer syntax than just "string" module functino
calls?

my_func(X) when string:startswith(X, "foobar") ->
    do_a();
my_func(X) when is_string(X) ->
    do_b();
my_func(X) ->
    do_c().

List syntax?
my_func(X = "foobar" ++ _) ->
    do_a();
...

Binary syntax?
my_func(X = <<"foobar"/string,_/string>>) ->
    do_a();
...

Sincerely,

jw

--
Americans might object: there is no way we would sacrifice our living
standards for the benefit of people in the rest of the world. Nevertheless,
whether we get there willingly or not, we shall soon have lower consumption
rates, because our present rates are unsustainable.

On Thu, Oct 20, 2011 at 10:42 AM, Robert Virding <
robert.virding@REDACTED> wrote:

> This discussion of Erlang strings/unicode/string type/... crops up
> regularly. What I would like to discuss around this is: if we were to add a
> string datatype to Erlang what properties should it have? If we skip
> internal implementation details how should it behave? What type of
> operations could do on them? And why? For example would I want to be able to
> step over/create them in a similar manner as with lists? An when I access a
> "character" (whatever that may be) what do I see?
> Encoded/unencoded/codepoint or what?
>
> Once that has been agreed on, if we can, then it would be relatively easy
> to implement a string data type. If we want to.
>
> Robert
>
> ------------------------------
>
> Sorry, adding the ML back to the reply. Forgot about it :)
>
> On Thu, Oct 20, 2011 at 11:06 AM, Fred Hebert <mononcqc@REDACTED> wrote:
>
>>
>>
>> On Thu, Oct 20, 2011 at 9:26 AM, Joe Armstrong <erlang@REDACTED> wrote:
>>
>>> On Thu, Oct 20, 2011 at 2:54 PM, Fred Hebert <mononcqc@REDACTED> wrote:
>>> > No, list_to_binary and iolist_to_binary are not considered harmful.
>>>
>>> But they are dangerous. list_to_binary can fail when its argument is
>>> not list of 0.255 integers
>>> binary_to_list always works. term_to_binary and its inverse
>>> binary_to_term can never fail
>>>  (and I know about the danger of does this with Pids and Refs and funs
>>> ...)
>>>
>>
>> They are somewhat dangerous. binary_to_list -> list_to_binary should never
>> fail. The same way term_to_binary -> binary_to_term should never fail, but
>> binary_to_term -> term_to_binary can fail if you give it wrong input.
>>
>> It works the same. The problem is likely more in the rather
>> non-descriptive name of 'list_to_binary' compared to 'iolist_to_binary',
>> where you do expect the input to adhere to the iolist data type.
>>
>>
>>>
>>> >
>>> > The problem is that they both implicitly convert integers on the range
>>> > 0..255 inclusively. This works based on the definition that iolists
>>> only
>>> > contain bytes or binaries. list_to_binary will accept all iolists
>>> (deeply
>>> > nested lists of bytes (0..255) and binaries) and iolist_to_binary will
>>> > accept both iolists or flat binaries.
>>> > They were apparently meant to convert between lists and binaries of
>>> bytes,
>>> > not lists or binaries of arbitrary large (or negative) numbers. Because
>>> we
>>> > Erlang programmers have been so relying on the idea of lists as
>>> strings,
>>> > using ASCII and most Latin1 sequences of bytes made for fine
>>> conversions to
>>> > binaries and nobody ever had a problem.
>>> > Unicode standards (and their respective UTF encodings) break this
>>> assumption
>>> > that strings most of the time only contain ASCII and Latin1 integers
>>> for
>>> > their codepoints. This is why you need the unicode module's conversion
>>> > there.
>>> > The same is then true of binary_to_list. The trouble there is that the
>>> > binary representation (in bytes) of unicode strings doesn't match the
>>> list
>>> > representation of unicode as accepted by ~ts printing and whatnot. The
>>> raw
>>> > bytes representation isn't good enough, and that's what binary_to_list
>>> gives
>>> > you. Again, the unicode module is clever enough to handle that.
>>> > iolist_to_binary, list_to_binary and binary_to_list are fine when you
>>> know
>>> > they're meant to be used for bytes, not arbitrary data.
>>> > What's hurting Erlang more, I think, is the lack of Unicode algorithms
>>> as
>>> > described by Michael Uvarov. If we have a unicode string, we currently
>>> can't
>>> > get its length (in terms of graphemes, or 'characters to the human
>>> mind')
>>>
>>> I don't understand the terminology here. What is "a unicode string?"
>>
>>
>>> In Erlang "string" means "list of integers"
>>>
>>> So when you write "a unicode string" my brain interprets this as a "a
>>> list of unicode codepoints"
>>>
>>
>> I use string as a generic idea of a data type. Scheme has unicode strings,
>> Python has them, Javascript has them. Erlang doesn't properly have strings.
>> It has lists of integers that overloaded to be interpreted as a string. If
>> your list of integers (or binary) doesn't respect the encoding Erlang tries
>> to overload, then your string isn't properly unicode.
>>
>> So [49,48,8364] could be a unicode string (easy to figure out with that
>> 8364 sticking out, which is a valid unicode codepoint, and an invalid
>> Latin-1 character), but [49,48,226,130,172] contains no information
>> regarding whether it should be latin-1 (10â ¬) or unicode (utf-8, 16 or 32?)
>> (10€). In fact, it can have different unicode results depending on if you
>> treat the list as bytes or as a string. If you treat it as a string, it'll
>> see all integers as independent codepoints and won't be able to put
>> characters together into graphemes.
>>
>> The current way to do it is to use lists of codepoints for unicode strings
>> of one kind, and to use sequences of bytes in binaries for unicode strings
>> of another kind (with a more precise encoding, where you've got to pick
>> between utf8, utf16 and utf32) as far as I understand things.
>>
>> Therefore, I tend to treat 'valid unicode string' as something
>> unambiguous. Either a list of codepoints or a binary made out of bytes
>> respecting a given encoding. If you have a list of bytes, then it's not a
>> unicode string becaue it's very ambiguous and context-sensitive.
>>
>>
>>>
>>> So the "unicode string" for "10(euros)" is the erlang list [49,48,8364]
>>> and the length of this list is three (ie the number of 'characters in my
>>> mind')
>>>
>>> The problem is that if I just write X = [N1,N2,N3,....] where all the
>>> N's are in 0..255
>>> I cannot see at a glance if this is supposed to represent (say) an
>>> ascii string or
>>> a UTF-8 encoded string of unicode codepoints. There is no way of
>>> knowning. So I must add some
>>> wrapper, ie
>>>
>>>      X1 = {ascii, "abc"} or
>>>
>>>      X2 = {utf8encoded,unicode,[49,48,226,130,172]}
>>>
>>>      X3 = {unicode, "10\x{20ac}"} = {unicode, [49,48,8364]}
>>>
>>>
>>> So now I know that the string in X1 is a "asci string" (unencoded)
>>> and the string in X2 is the utf8 encoding of the unicode codepoints
>>> [49,48,8364]
>>> and the list in X3 is what I call a "unicode string"
>>>
>>
>> Yes, that's a way to do it. You're using tagged tuples to contain type
>> information on the data you carry around. This means, however, that you need
>> to change the definitions of iolists, how the unicode module works, etc. to
>> get something compatible everywhere. Then the question becomes why not
>> support them out of the box?
>>
>> Python's got things like raw strings (r"this is not escaped: \n"), normal
>> strings ("this is a linebreak: \n") or unicode strings (u"hey there"). I do
>> not like the way they did it because conversion is frankly terrible and the
>> unicode string vs. its encoding are unclear, but at least you've got native
>> support for the typing there.
>>
>> It's a tough nut to crack. In one case, we keep going with ad-hoc systems,
>> in the other, we must introduce new datatypes, change a lot of how the
>> language works, possibly break compatibility.
>>
>> If we keep going the way we are, we'll need some serious community effort
>> to get rid of the 'strings are just lists of integers' mentality, because
>> it's more complex than that, I think.
>>
>>
>>>
>>> Its just a matter of defining a convention and sticking to it - the
>>> alternative would be to make a
>>> new string type (a list with some invisible bits) but this would be
>>> rather complicated and
>>> probably not worth the effort.
>>>
>>
>> Yep. Sorry, I typed my response above before reading the end of your reply
>> :)
>>
>>>
>>> /Joe
>>>
>>>
>>> > in
>>> > any reliable way. We also can't specify locales, can't do casing and
>>> > whatnot, etc. Ideally those would be the next step for Erlang's unicode
>>> > support, I think.
>>> > On Thu, Oct 20, 2011 at 4:23 AM, Joe Armstrong <erlang@REDACTED>
>>> wrote:
>>> >>
>>> >> Interesting comment: this is almost where I could write an article
>>> with
>>> >> the
>>> >> title "list_to_binary considered harmful" - I guess if Erlang is
>>> >> serializing terms
>>> >> to be stored on disk etc. term_to_binary and its inverse should be
>>> used.
>>> >> list_to_binary seems to imply that you are going to send something to
>>> the
>>> >> outside world - and then you should stop and think hard, this is
>>> >> because there is
>>> >> no universal agreement in the outside world as to what an integer is
>>> >> (ie is it bounded or not)
>>> >> fixing a notion of an integer to something in the range 0..255 allows
>>> >> communication of
>>> >> integers, but requires a framing protocol (ie UTF8, or ASN.1) that
>>> >> tells how integers
>>> >> are encoded - but this is out of band.
>>> >>
>>> >> The problem is that I might write
>>> >>
>>> >>    X1 = "10$"    (10 dollars) or
>>> >>    X2 = "10\x{20ac}"  (10 euros)
>>> >>
>>> >> Now list_to_binary(X1) will succeed but list_to_binary(X2) will fail
>>> >>
>>> >> So maybe I should write
>>> >>
>>> >>    X1 = {ansii, "10$"}
>>> >>    X2 = {unicode,"10\x{20ac}"}
>>> >>
>>> >> If the libraries were written this way then life might be easier
>>> >>
>>> >> /Joe
>>> >>
>>> >
>>> >
>>>
>>
>>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111021/4b736d74/attachment.htm>