[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Thu Oct 20 18:02:14 CEST 2011

Sorry, adding the ML back to the reply. Forgot about it :)

On Thu, Oct 20, 2011 at 11:06 AM, Fred Hebert <mononcqc@REDACTED> wrote:

>
>
> On Thu, Oct 20, 2011 at 9:26 AM, Joe Armstrong <erlang@REDACTED> wrote:
>
>> On Thu, Oct 20, 2011 at 2:54 PM, Fred Hebert <mononcqc@REDACTED> wrote:
>> > No, list_to_binary and iolist_to_binary are not considered harmful.
>>
>> But they are dangerous. list_to_binary can fail when its argument is
>> not list of 0.255 integers
>> binary_to_list always works. term_to_binary and its inverse
>> binary_to_term can never fail
>>  (and I know about the danger of does this with Pids and Refs and funs
>> ...)
>>
>
> They are somewhat dangerous. binary_to_list -> list_to_binary should never
> fail. The same way term_to_binary -> binary_to_term should never fail, but
> binary_to_term -> term_to_binary can fail if you give it wrong input.
>
> It works the same. The problem is likely more in the rather non-descriptive
> name of 'list_to_binary' compared to 'iolist_to_binary', where you do expect
> the input to adhere to the iolist data type.
>
>
>>
>> >
>> > The problem is that they both implicitly convert integers on the range
>> > 0..255 inclusively. This works based on the definition that iolists only
>> > contain bytes or binaries. list_to_binary will accept all iolists
>> (deeply
>> > nested lists of bytes (0..255) and binaries) and iolist_to_binary will
>> > accept both iolists or flat binaries.
>> > They were apparently meant to convert between lists and binaries of
>> bytes,
>> > not lists or binaries of arbitrary large (or negative) numbers. Because
>> we
>> > Erlang programmers have been so relying on the idea of lists as strings,
>> > using ASCII and most Latin1 sequences of bytes made for fine conversions
>> to
>> > binaries and nobody ever had a problem.
>> > Unicode standards (and their respective UTF encodings) break this
>> assumption
>> > that strings most of the time only contain ASCII and Latin1 integers for
>> > their codepoints. This is why you need the unicode module's conversion
>> > there.
>> > The same is then true of binary_to_list. The trouble there is that the
>> > binary representation (in bytes) of unicode strings doesn't match the
>> list
>> > representation of unicode as accepted by ~ts printing and whatnot. The
>> raw
>> > bytes representation isn't good enough, and that's what binary_to_list
>> gives
>> > you. Again, the unicode module is clever enough to handle that.
>> > iolist_to_binary, list_to_binary and binary_to_list are fine when you
>> know
>> > they're meant to be used for bytes, not arbitrary data.
>> > What's hurting Erlang more, I think, is the lack of Unicode algorithms
>> as
>> > described by Michael Uvarov. If we have a unicode string, we currently
>> can't
>> > get its length (in terms of graphemes, or 'characters to the human
>> mind')
>>
>> I don't understand the terminology here. What is "a unicode string?"
>
>
>> In Erlang "string" means "list of integers"
>>
>> So when you write "a unicode string" my brain interprets this as a "a
>> list of unicode codepoints"
>>
>
> I use string as a generic idea of a data type. Scheme has unicode strings,
> Python has them, Javascript has them. Erlang doesn't properly have strings.
> It has lists of integers that overloaded to be interpreted as a string. If
> your list of integers (or binary) doesn't respect the encoding Erlang tries
> to overload, then your string isn't properly unicode.
>
> So [49,48,8364] could be a unicode string (easy to figure out with that
> 8364 sticking out, which is a valid unicode codepoint, and an invalid
> Latin-1 character), but [49,48,226,130,172] contains no information
> regarding whether it should be latin-1 (10â ¬) or unicode (utf-8, 16 or 32?)
> (10€). In fact, it can have different unicode results depending on if you
> treat the list as bytes or as a string. If you treat it as a string, it'll
> see all integers as independent codepoints and won't be able to put
> characters together into graphemes.
>
> The current way to do it is to use lists of codepoints for unicode strings
> of one kind, and to use sequences of bytes in binaries for unicode strings
> of another kind (with a more precise encoding, where you've got to pick
> between utf8, utf16 and utf32) as far as I understand things.
>
> Therefore, I tend to treat 'valid unicode string' as something unambiguous.
> Either a list of codepoints or a binary made out of bytes respecting a given
> encoding. If you have a list of bytes, then it's not a unicode string becaue
> it's very ambiguous and context-sensitive.
>
>
>>
>> So the "unicode string" for "10(euros)" is the erlang list [49,48,8364]
>> and the length of this list is three (ie the number of 'characters in my
>> mind')
>>
>> The problem is that if I just write X = [N1,N2,N3,....] where all the
>> N's are in 0..255
>> I cannot see at a glance if this is supposed to represent (say) an
>> ascii string or
>> a UTF-8 encoded string of unicode codepoints. There is no way of
>> knowning. So I must add some
>> wrapper, ie
>>
>>      X1 = {ascii, "abc"} or
>>
>>      X2 = {utf8encoded,unicode,[49,48,226,130,172]}
>>
>>      X3 = {unicode, "10\x{20ac}"} = {unicode, [49,48,8364]}
>>
>>
>> So now I know that the string in X1 is a "asci string" (unencoded)
>> and the string in X2 is the utf8 encoding of the unicode codepoints
>> [49,48,8364]
>> and the list in X3 is what I call a "unicode string"
>>
>
> Yes, that's a way to do it. You're using tagged tuples to contain type
> information on the data you carry around. This means, however, that you need
> to change the definitions of iolists, how the unicode module works, etc. to
> get something compatible everywhere. Then the question becomes why not
> support them out of the box?
>
> Python's got things like raw strings (r"this is not escaped: \n"), normal
> strings ("this is a linebreak: \n") or unicode strings (u"hey there"). I do
> not like the way they did it because conversion is frankly terrible and the
> unicode string vs. its encoding are unclear, but at least you've got native
> support for the typing there.
>
> It's a tough nut to crack. In one case, we keep going with ad-hoc systems,
> in the other, we must introduce new datatypes, change a lot of how the
> language works, possibly break compatibility.
>
> If we keep going the way we are, we'll need some serious community effort
> to get rid of the 'strings are just lists of integers' mentality, because
> it's more complex than that, I think.
>
>
>>
>> Its just a matter of defining a convention and sticking to it - the
>> alternative would be to make a
>> new string type (a list with some invisible bits) but this would be
>> rather complicated and
>> probably not worth the effort.
>>
>
> Yep. Sorry, I typed my response above before reading the end of your reply
> :)
>
>>
>> /Joe
>>
>>
>> > in
>> > any reliable way. We also can't specify locales, can't do casing and
>> > whatnot, etc. Ideally those would be the next step for Erlang's unicode
>> > support, I think.
>> > On Thu, Oct 20, 2011 at 4:23 AM, Joe Armstrong <erlang@REDACTED>
>> wrote:
>> >>
>> >> Interesting comment: this is almost where I could write an article with
>> >> the
>> >> title "list_to_binary considered harmful" - I guess if Erlang is
>> >> serializing terms
>> >> to be stored on disk etc. term_to_binary and its inverse should be
>> used.
>> >> list_to_binary seems to imply that you are going to send something to
>> the
>> >> outside world - and then you should stop and think hard, this is
>> >> because there is
>> >> no universal agreement in the outside world as to what an integer is
>> >> (ie is it bounded or not)
>> >> fixing a notion of an integer to something in the range 0..255 allows
>> >> communication of
>> >> integers, but requires a framing protocol (ie UTF8, or ASN.1) that
>> >> tells how integers
>> >> are encoded - but this is out of band.
>> >>
>> >> The problem is that I might write
>> >>
>> >>    X1 = "10$"    (10 dollars) or
>> >>    X2 = "10\x{20ac}"  (10 euros)
>> >>
>> >> Now list_to_binary(X1) will succeed but list_to_binary(X2) will fail
>> >>
>> >> So maybe I should write
>> >>
>> >>    X1 = {ansii, "10$"}
>> >>    X2 = {unicode,"10\x{20ac}"}
>> >>
>> >> If the libraries were written this way then life might be easier
>> >>
>> >> /Joe
>> >>
>> >
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111020/4d86e204/attachment.htm>