[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Thu Oct 20 19:42:32 CEST 2011

This discussion of Erlang strings/unicode/string type/... crops up regularly. What I would like to discuss around this is: if we were to add a string datatype to Erlang what properties should it have? If we skip internal implementation details how should it behave? What type of operations could do on them? And why? For example would I want to be able to step over/create them in a similar manner as with lists? An when I access a "character" (whatever that may be) what do I see? Encoded/unencoded/codepoint or what? 

Once that has been agreed on, if we can, then it would be relatively easy to implement a string data type. If we want to. 

Robert 

----- Original Message -----

> Sorry, adding the ML back to the reply. Forgot about it :)

> On Thu, Oct 20, 2011 at 11:06 AM, Fred Hebert < mononcqc@REDACTED >
> wrote:

> > On Thu, Oct 20, 2011 at 9:26 AM, Joe Armstrong < erlang@REDACTED >
> > wrote:
> 

> > > On Thu, Oct 20, 2011 at 2:54 PM, Fred Hebert < mononcqc@REDACTED
> > > >
> > > wrote:
> > 
> 
> > > > No, list_to_binary and iolist_to_binary are not considered
> > > > harmful.
> > 
> 

> > > But they are dangerous. list_to_binary can fail when its argument
> > > is
> > 
> 
> > > not list of 0.255 integers
> > 
> 
> > > binary_to_list always works. term_to_binary and its inverse
> > 
> 
> > > binary_to_term can never fail
> > 
> 
> > > (and I know about the danger of does this with Pids and Refs and
> > > funs
> > > ...)
> > 
> 

> > They are somewhat dangerous. binary_to_list -> list_to_binary
> > should
> > never fail. The same way term_to_binary -> binary_to_term should
> > never fail, but binary_to_term -> term_to_binary can fail if you
> > give it wrong input.
> 

> > It works the same. The problem is likely more in the rather
> > non-descriptive name of 'list_to_binary' compared to
> > 'iolist_to_binary', where you do expect the input to adhere to the
> > iolist data type.
> 

> > > >
> > 
> 
> > > > The problem is that they both implicitly convert integers on
> > > > the
> > > > range
> > 
> 
> > > > 0..255 inclusively. This works based on the definition that
> > > > iolists
> > > > only
> > 
> 
> > > > contain bytes or binaries. list_to_binary will accept all
> > > > iolists
> > > > (deeply
> > 
> 
> > > > nested lists of bytes (0..255) and binaries) and
> > > > iolist_to_binary
> > > > will
> > 
> 
> > > > accept both iolists or flat binaries.
> > 
> 
> > > > They were apparently meant to convert between lists and
> > > > binaries
> > > > of
> > > > bytes,
> > 
> 
> > > > not lists or binaries of arbitrary large (or negative) numbers.
> > > > Because we
> > 
> 
> > > > Erlang programmers have been so relying on the idea of lists as
> > > > strings,
> > 
> 
> > > > using ASCII and most Latin1 sequences of bytes made for fine
> > > > conversions to
> > 
> 
> > > > binaries and nobody ever had a problem.
> > 
> 
> > > > Unicode standards (and their respective UTF encodings) break
> > > > this
> > > > assumption
> > 
> 
> > > > that strings most of the time only contain ASCII and Latin1
> > > > integers for
> > 
> 
> > > > their codepoints. This is why you need the unicode module's
> > > > conversion
> > 
> 
> > > > there.
> > 
> 
> > > > The same is then true of binary_to_list. The trouble there is
> > > > that
> > > > the
> > 
> 
> > > > binary representation (in bytes) of unicode strings doesn't
> > > > match
> > > > the list
> > 
> 
> > > > representation of unicode as accepted by ~ts printing and
> > > > whatnot.
> > > > The raw
> > 
> 
> > > > bytes representation isn't good enough, and that's what
> > > > binary_to_list gives
> > 
> 
> > > > you. Again, the unicode module is clever enough to handle that.
> > 
> 
> > > > iolist_to_binary, list_to_binary and binary_to_list are fine
> > > > when
> > > > you know
> > 
> 
> > > > they're meant to be used for bytes, not arbitrary data.
> > 
> 
> > > > What's hurting Erlang more, I think, is the lack of Unicode
> > > > algorithms as
> > 
> 
> > > > described by Michael Uvarov. If we have a unicode string, we
> > > > currently can't
> > 
> 
> > > > get its length (in terms of graphemes, or 'characters to the
> > > > human
> > > > mind')
> > 
> 

> > > I don't understand the terminology here. What is "a unicode
> > > string?"
> > 
> 
> > > In Erlang "string" means "list of integers"
> > 
> 

> > > So when you write "a unicode string" my brain interprets this as
> > > a
> > > "a
> > 
> 
> > > list of unicode codepoints"
> > 
> 

> > I use string as a generic idea of a data type. Scheme has unicode
> > strings, Python has them, Javascript has them. Erlang doesn't
> > properly have strings. It has lists of integers that overloaded to
> > be interpreted as a string. If your list of integers (or binary)
> > doesn't respect the encoding Erlang tries to overload, then your
> > string isn't properly unicode.
> 

> > So [49,48,8364] could be a unicode string (easy to figure out with
> > that 8364 sticking out, which is a valid unicode codepoint, and an
> > invalid Latin-1 character), but [49,48,226,130,172] contains no
> > information regarding whether it should be latin-1 (10â ¬) or
> > unicode (utf-8, 16 or 32?) (10€). In fact, it can have different
> > unicode results depending on if you treat the list as bytes or as a
> > string. If you treat it as a string, it'll see all integers as
> > independent codepoints and won't be able to put characters together
> > into graphemes.
> 

> > The current way to do it is to use lists of codepoints for unicode
> > strings of one kind, and to use sequences of bytes in binaries for
> > unicode strings of another kind (with a more precise encoding,
> > where
> > you've got to pick between utf8, utf16 and utf32) as far as I
> > understand things.
> 

> > Therefore, I tend to treat 'valid unicode string' as something
> > unambiguous. Either a list of codepoints or a binary made out of
> > bytes respecting a given encoding. If you have a list of bytes,
> > then
> > it's not a unicode string becaue it's very ambiguous and
> > context-sensitive.
> 

> > > So the "unicode string" for "10(euros)" is the erlang list
> > > [49,48,8364]
> > 
> 
> > > and the length of this list is three (ie the number of
> > > 'characters
> > > in
> > > my mind')
> > 
> 

> > > The problem is that if I just write X = [N1,N2,N3,....] where all
> > > the
> > 
> 
> > > N's are in 0..255
> > 
> 
> > > I cannot see at a glance if this is supposed to represent (say)
> > > an
> > 
> 
> > > ascii string or
> > 
> 
> > > a UTF-8 encoded string of unicode codepoints. There is no way of
> > 
> 
> > > knowning. So I must add some
> > 
> 
> > > wrapper, ie
> > 
> 

> > > X1 = {ascii, "abc"} or
> > 
> 

> > > X2 = {utf8encoded,unicode,[49,48,226,130,172]}
> > 
> 

> > > X3 = {unicode, "10\x{20ac}"} = {unicode, [49,48,8364]}
> > 
> 

> > > So now I know that the string in X1 is a "asci string"
> > > (unencoded)
> > 
> 
> > > and the string in X2 is the utf8 encoding of the unicode
> > > codepoints
> > 
> 
> > > [49,48,8364]
> > 
> 
> > > and the list in X3 is what I call a "unicode string"
> > 
> 

> > Yes, that's a way to do it. You're using tagged tuples to contain
> > type information on the data you carry around. This means, however,
> > that you need to change the definitions of iolists, how the unicode
> > module works, etc. to get something compatible everywhere. Then the
> > question becomes why not support them out of the box?
> 

> > Python's got things like raw strings (r"this is not escaped: \n"),
> > normal strings ("this is a linebreak: \n") or unicode strings
> > (u"hey
> > there"). I do not like the way they did it because conversion is
> > frankly terrible and the unicode string vs. its encoding are
> > unclear, but at least you've got native support for the typing
> > there.
> 

> > It's a tough nut to crack. In one case, we keep going with ad-hoc
> > systems, in the other, we must introduce new datatypes, change a
> > lot
> > of how the language works, possibly break compatibility.
> 

> > If we keep going the way we are, we'll need some serious community
> > effort to get rid of the 'strings are just lists of integers'
> > mentality, because it's more complex than that, I think.
> 

> > > Its just a matter of defining a convention and sticking to it -
> > > the
> > 
> 
> > > alternative would be to make a
> > 
> 
> > > new string type (a list with some invisible bits) but this would
> > > be
> > 
> 
> > > rather complicated and
> > 
> 
> > > probably not worth the effort.
> > 
> 

> > Yep. Sorry, I typed my response above before reading the end of
> > your
> > reply :)
> 

> > > /Joe
> > 
> 

> > > > in
> > 
> 
> > > > any reliable way. We also can't specify locales, can't do
> > > > casing
> > > > and
> > 
> 
> > > > whatnot, etc. Ideally those would be the next step for Erlang's
> > > > unicode
> > 
> 
> > > > support, I think.
> > 
> 
> > > > On Thu, Oct 20, 2011 at 4:23 AM, Joe Armstrong <
> > > > erlang@REDACTED
> > > > >
> > > > wrote:
> > 
> 
> > > >>
> > 
> 
> > > >> Interesting comment: this is almost where I could write an
> > > >> article
> > > >> with
> > 
> 
> > > >> the
> > 
> 
> > > >> title "list_to_binary considered harmful" - I guess if Erlang
> > > >> is
> > 
> 
> > > >> serializing terms
> > 
> 
> > > >> to be stored on disk etc. term_to_binary and its inverse
> > > >> should
> > > >> be
> > > >> used.
> > 
> 
> > > >> list_to_binary seems to imply that you are going to send
> > > >> something
> > > >> to the
> > 
> 
> > > >> outside world - and then you should stop and think hard, this
> > > >> is
> > 
> 
> > > >> because there is
> > 
> 
> > > >> no universal agreement in the outside world as to what an
> > > >> integer
> > > >> is
> > 
> 
> > > >> (ie is it bounded or not)
> > 
> 
> > > >> fixing a notion of an integer to something in the range 0..255
> > > >> allows
> > 
> 
> > > >> communication of
> > 
> 
> > > >> integers, but requires a framing protocol (ie UTF8, or ASN.1)
> > > >> that
> > 
> 
> > > >> tells how integers
> > 
> 
> > > >> are encoded - but this is out of band.
> > 
> 
> > > >>
> > 
> 
> > > >> The problem is that I might write
> > 
> 
> > > >>
> > 
> 
> > > >> X1 = "10$" (10 dollars) or
> > 
> 
> > > >> X2 = "10\x{20ac}" (10 euros)
> > 
> 
> > > >>
> > 
> 
> > > >> Now list_to_binary(X1) will succeed but list_to_binary(X2)
> > > >> will
> > > >> fail
> > 
> 
> > > >>
> > 
> 
> > > >> So maybe I should write
> > 
> 
> > > >>
> > 
> 
> > > >> X1 = {ansii, "10$"}
> > 
> 
> > > >> X2 = {unicode,"10\x{20ac}"}
> > 
> 
> > > >>
> > 
> 
> > > >> If the libraries were written this way then life might be
> > > >> easier
> > 
> 
> > > >>
> > 
> 
> > > >> /Joe
> > 
> 
> > > >>
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 

> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111020/f58258d3/attachment.htm>