[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

David Mercer dmercer@REDACTED
Fri Oct 21 16:40:00 CEST 2011



From: erlang-questions-bounces@REDACTED [mailto:erlang-questions-bounces@REDACTED] On Behalf Of Björn-Egil Dahlberg
Sent: Friday, October 21, 2011 4:41 AM
To: Robert Virding
Cc: Erlang
Subject: Re: [erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]



Skickat från min iPad

20 okt 2011 kl. 19:42 skrev Robert Virding <robert.virding@REDACTED>:

This discussion of Erlang strings/unicode/string type/... crops up regularly. What I would like to discuss around this is: if we were to add a string datatype to Erlang what properties should it have? If we skip internal implementation details how should it behave? What type of operations could do on them? And why? For example would I want to be able to step over/create them in a similar manner as with lists? An when I access a "character" (whatever that may be) what do I see? Encoded/unencoded/codepoint or what?

Why? For simplicity and to reduce memory footprint.


I want to use them in iolists but for that to work string libraries needs to understand that in addition to encodings.


My premise is that the string datatype understands encodings. At least utf8. 


Should we confine ourself to cons notations? I think not.


// Björn-Egil

Once that has been agreed on, if we can, then it would be relatively easy to implement a string data type. If we want to.




Sorry, adding the ML back to the reply. Forgot about it :)

On Thu, Oct 20, 2011 at 11:06 AM, Fred Hebert <mononcqc@REDACTED> wrote:


On Thu, Oct 20, 2011 at 9:26 AM, Joe Armstrong <erlang@REDACTED> wrote:

On Thu, Oct 20, 2011 at 2:54 PM, Fred Hebert <mononcqc@REDACTED> wrote:
> No, list_to_binary and iolist_to_binary are not considered harmful.

But they are dangerous. list_to_binary can fail when its argument is
not list of 0.255 integers
binary_to_list always works. term_to_binary and its inverse
binary_to_term can never fail
 (and I know about the danger of does this with Pids and Refs and funs ...)


They are somewhat dangerous. binary_to_list -> list_to_binary should never fail. The same way term_to_binary -> binary_to_term should never fail, but binary_to_term -> term_to_binary can fail if you give it wrong input. 


It works the same. The problem is likely more in the rather non-descriptive name of 'list_to_binary' compared to 'iolist_to_binary', where you do expect the input to adhere to the iolist data type.


> The problem is that they both implicitly convert integers on the range
> 0..255 inclusively. This works based on the definition that iolists only
> contain bytes or binaries. list_to_binary will accept all iolists (deeply
> nested lists of bytes (0..255) and binaries) and iolist_to_binary will
> accept both iolists or flat binaries.
> They were apparently meant to convert between lists and binaries of bytes,
> not lists or binaries of arbitrary large (or negative) numbers. Because we
> Erlang programmers have been so relying on the idea of lists as strings,
> using ASCII and most Latin1 sequences of bytes made for fine conversions to
> binaries and nobody ever had a problem.
> Unicode standards (and their respective UTF encodings) break this assumption
> that strings most of the time only contain ASCII and Latin1 integers for
> their codepoints. This is why you need the unicode module's conversion
> there.
> The same is then true of binary_to_list. The trouble there is that the
> binary representation (in bytes) of unicode strings doesn't match the list
> representation of unicode as accepted by ~ts printing and whatnot. The raw
> bytes representation isn't good enough, and that's what binary_to_list gives
> you. Again, the unicode module is clever enough to handle that.
> iolist_to_binary, list_to_binary and binary_to_list are fine when you know
> they're meant to be used for bytes, not arbitrary data.
> What's hurting Erlang more, I think, is the lack of Unicode algorithms as
> described by Michael Uvarov. If we have a unicode string, we currently can't
> get its length (in terms of graphemes, or 'characters to the human mind')

I don't understand the terminology here. What is "a unicode string?" 

In Erlang "string" means "list of integers"

So when you write "a unicode string" my brain interprets this as a "a
list of unicode codepoints"


I use string as a generic idea of a data type. Scheme has unicode strings, Python has them, Javascript has them. Erlang doesn't properly have strings. It has lists of integers that overloaded to be interpreted as a string. If your list of integers (or binary) doesn't respect the encoding Erlang tries to overload, then your string isn't properly unicode. 


So [49,48,8364] could be a unicode string (easy to figure out with that 8364 sticking out, which is a valid unicode codepoint, and an invalid Latin-1 character), but [49,48,226,130,172] contains no information regarding whether it should be latin-1 (10â ¬) or unicode (utf-8, 16 or 32?) (10€). In fact, it can have different unicode results depending on if you treat the list as bytes or as a string. If you treat it as a string, it'll see all integers as independent codepoints and won't be able to put characters together into graphemes.


The current way to do it is to use lists of codepoints for unicode strings of one kind, and to use sequences of bytes in binaries for unicode strings of another kind (with a more precise encoding, where you've got to pick between utf8, utf16 and utf32) as far as I understand things.


Therefore, I tend to treat 'valid unicode string' as something unambiguous. Either a list of codepoints or a binary made out of bytes respecting a given encoding. If you have a list of bytes, then it's not a unicode string becaue it's very ambiguous and context-sensitive. 


So the "unicode string" for "10(euros)" is the erlang list [49,48,8364]
and the length of this list is three (ie the number of 'characters in my mind')

The problem is that if I just write X = [N1,N2,N3,....] where all the
N's are in 0..255
I cannot see at a glance if this is supposed to represent (say) an
ascii string or
a UTF-8 encoded string of unicode codepoints. There is no way of
knowning. So I must add some
wrapper, ie

     X1 = {ascii, "abc"} or

     X2 = {utf8encoded,unicode,[49,48,226,130,172]}

     X3 = {unicode, "10\x{20ac}"} = {unicode, [49,48,8364]}

So now I know that the string in X1 is a "asci string" (unencoded)
and the string in X2 is the utf8 encoding of the unicode codepoints
and the list in X3 is what I call a "unicode string"


Yes, that's a way to do it. You're using tagged tuples to contain type information on the data you carry around. This means, however, that you need to change the definitions of iolists, how the unicode module works, etc. to get something compatible everywhere. Then the question becomes why not support them out of the box?


Python's got things like raw strings (r"this is not escaped: \n"), normal strings ("this is a linebreak: \n") or unicode strings (u"hey there"). I do not like the way they did it because conversion is frankly terrible and the unicode string vs. its encoding are unclear, but at least you've got native support for the typing there.


It's a tough nut to crack. In one case, we keep going with ad-hoc systems, in the other, we must introduce new datatypes, change a lot of how the language works, possibly break compatibility.


If we keep going the way we are, we'll need some serious community effort to get rid of the 'strings are just lists of integers' mentality, because it's more complex than that, I think.


Its just a matter of defining a convention and sticking to it - the
alternative would be to make a
new string type (a list with some invisible bits) but this would be
rather complicated and
probably not worth the effort.


Yep. Sorry, I typed my response above before reading the end of your reply :) 


> in
> any reliable way. We also can't specify locales, can't do casing and
> whatnot, etc. Ideally those would be the next step for Erlang's unicode
> support, I think.
> On Thu, Oct 20, 2011 at 4:23 AM, Joe Armstrong <erlang@REDACTED> wrote:
>> Interesting comment: this is almost where I could write an article with
>> the
>> title "list_to_binary considered harmful" - I guess if Erlang is
>> serializing terms
>> to be stored on disk etc. term_to_binary and its inverse should be used.
>> list_to_binary seems to imply that you are going to send something to the
>> outside world - and then you should stop and think hard, this is
>> because there is
>> no universal agreement in the outside world as to what an integer is
>> (ie is it bounded or not)
>> fixing a notion of an integer to something in the range 0..255 allows
>> communication of
>> integers, but requires a framing protocol (ie UTF8, or ASN.1) that
>> tells how integers
>> are encoded - but this is out of band.
>> The problem is that I might write
>>    X1 = "10$"    (10 dollars) or
>>    X2 = "10\x{20ac}"  (10 euros)
>> Now list_to_binary(X1) will succeed but list_to_binary(X2) will fail
>> So maybe I should write
>>    X1 = {ansii, "10$"}
>>    X2 = {unicode,"10\x{20ac}"}
>> If the libraries were written this way then life might be easier
>> /Joe


erlang-questions mailing list


erlang-questions mailing list

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111021/c09f7182/attachment.htm>

More information about the erlang-questions mailing list