[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Michael Turner michael.eugene.turner@REDACTED
Sat Oct 22 13:38:10 CEST 2011


An intrinsic string type solves three problems:

 (1) stronger typing available where desired
 (2) better I/O formatting especially in cases where people want to go
beyond Latin-1
 (3) better performance (speed, space) by virtue of sequential
allocation of characters in memory

It seems to me that about 90% of what people want out of strings could
be done with a string record type containing one or more binaries.
That seems a good way forward for prototyping the behavior of the
desired intrinsic type.

On Sat, Oct 22, 2011 at 6:12 AM, Jon Watte <jwatte@REDACTED> wrote:
> My preferences:
> 1) Binaries are pretty good as they are. If nothing happens, and I have to
> live with binaries, I won't complain.

Depends on how important (1) and (2) are. For most people, maybe not very.

> However, binaries pretty much have to live with UTF-8 encoding to stay sane.

Yes, although a module that defined and manipulated a record-defined
string "type" might go a long way toward wrapping such insanity.

> 2) Strings probably should be of "code points" not "bytes," so they would
> use 2-byte or 4-byte characters. Note that Windows uses sizeof(wchar_t) ==
> 2, and Linux uses sizeof(wchar_t) == 4, so there's no unambiguously "right"
> choice.

Yes, the anachronism of talking about how many characters you need to
represent a single character still leaves me speechless.

> 3) Operations include (erlang-style) matching, (grep-style) finding,
> converting, splitting, joining and trimming.

Note that splitting is a tricky concept for languages like Japanese
that don't have wordspacing, and in which the analogous thing might be
splitting based on shifts between any of four distinct character sets
(plus punctuation) within the overall Japanese character set.
Othewise, +1.

> 4) Random access is important! As is slicing. string:substr(from,to) should
> be O(1).

Binaries get you most of this already, so it goes back to how
important (1) and (2) are to you. Erlang isn't used much for
string-intensive stuff in part because it's not really oriented that
way. But if it were more string-oriented, more people would use it for
that. I know I would. First-class representation of strings would help
a lot.

> 5) Going to-string-from-binary, and to-binary-from-string is important. You
> need to specify the encoding, too -- ascii/7-bit, iso-8859-1/8-bit, utf-8,
> utf-16, 16-bit, 32-bit, etc. Both ways.

You'd also want to be compatible with older Erlang code that does
strings with lists, so that's a whole added set of conversion
functions.

> 6) Ideally, reference count immutable string data so that substring
> extraction is cheap. This should probably be within the current process --
> there should not be overhead of locking external heaps to do most string
> operations (say, for strings <= 256 code points). (This is more
> implementation specifics, though)

Maybe binaries wrapped in a string record type could do this pretty
well already.

> 7) All APIs that take strings-as-lists should take strings-as-strings.
> Anything from file I/O to regex to json.

Nice idea in the long run, but it would be a lot of work. It couldn't
all be done at once. And it might never be finished before certain
APIs became obsolete anyway. The conversion functions in your (5)
could fill the gap in the meantime.

> 8) Easy conversion between binary, list (if needed at all), iolist, and
> string. Strings should be "native" in iolists.

I'm getting nervous about how big this job would be. ;-)

> 9) Do not intern strings, ever, for any reason.

By building up from binaries, which are reference counted, I think you
satisfy this one for free.

> 10) If "stringbuilder" style semantics are transparently used behind the
> scenes, that would be great. Specifically, adding a bunch of different
> sub-strings up could just aggregate into a list of pieces, until such time
> as an operation that needs "the whole string" is encountered, and at that
> point, the pieces are transparently combined.

A string record type in standard Erlang might be able to do this kind
of "lazy evaluation". Consolidation/compactification might be done
incrementally, in the interests of maintaining soft-real-time
performance.

> 11) strings are first-class -- you can guard on them, pattern match on them,
> etc. Ideally with binary-style or some kind of BIF syntax for primitive
> operations.

It would be interesting to see how much guard-syntax-compatible stuff
you could do with a standard-Erlang string record type. It might not
be very syntactically flavorful, but I think a lot could be done. If
you defined those guards in terms of macros, and equivalent macros
could be written if and when there's an intrinsic string type, you
wouldn't need to change much code.

> Sub-question: Can you find a nicer syntax than just "string" module functino
> calls?

"Functino"? You mean those functions they discovered recently that go
slightly faster than light?

Or am I thinking of something else?

> my_func(X) when string:startswith(X, "foobar") ->
>     do_a();

If you think of strings as a kind of structured hybrid of lists and
binaries, and delimit string literals in a way that reflects that dual
nature, like

   <[ ... ]>

then maybe the following?

  my_func(<[ "foobar" |_ ]> -> do_a();

Under the hood, Erlang might (for all we know) represent the matched
argument as a list of binaries (e.g., [<<"fo">>, <<"oba">>, <<"r">> |
X] ), and also take the opportunity of argument matching to combine
this list transparently into one binary (at least to make a
contiguously allocated "foobar") if there happens to be the right-size
chunk of space on the process heap at that point. Relatively light
refactorings of existing string-intensive Erlang code might then yield
significant time savings and dramatic space savings.

-michael turner


> On Thu, Oct 20, 2011 at 10:42 AM, Robert Virding
> <robert.virding@REDACTED> wrote:
>>
>> This discussion of Erlang strings/unicode/string type/... crops up
>> regularly. What I would like to discuss around this is: if we were to add a
>> string datatype to Erlang what properties should it have? If we skip
>> internal implementation details how should it behave? What type of
>> operations could do on them? And why? For example would I want to be able to
>> step over/create them in a similar manner as with lists? An when I access a
>> "character" (whatever that may be) what do I see?
>> Encoded/unencoded/codepoint or what?
>>
>> Once that has been agreed on, if we can, then it would be relatively easy
>> to implement a string data type. If we want to.
>>
>> Robert
>>
>> ________________________________
>>
>> Sorry, adding the ML back to the reply. Forgot about it :)
>>
>> On Thu, Oct 20, 2011 at 11:06 AM, Fred Hebert <mononcqc@REDACTED> wrote:
>>>
>>>
>>> On Thu, Oct 20, 2011 at 9:26 AM, Joe Armstrong <erlang@REDACTED> wrote:
>>>>
>>>> On Thu, Oct 20, 2011 at 2:54 PM, Fred Hebert <mononcqc@REDACTED> wrote:
>>>> > No, list_to_binary and iolist_to_binary are not considered harmful.
>>>>
>>>> But they are dangerous. list_to_binary can fail when its argument is
>>>> not list of 0.255 integers
>>>> binary_to_list always works. term_to_binary and its inverse
>>>> binary_to_term can never fail
>>>>  (and I know about the danger of does this with Pids and Refs and funs
>>>> ...)
>>>
>>> They are somewhat dangerous. binary_to_list -> list_to_binary should
>>> never fail. The same way term_to_binary -> binary_to_term should never fail,
>>> but binary_to_term -> term_to_binary can fail if you give it wrong input.
>>>
>>> It works the same. The problem is likely more in the rather
>>> non-descriptive name of 'list_to_binary' compared to 'iolist_to_binary',
>>> where you do expect the input to adhere to the iolist data type.
>>>
>>>>
>>>> >
>>>> > The problem is that they both implicitly convert integers on the range
>>>> > 0..255 inclusively. This works based on the definition that iolists
>>>> > only
>>>> > contain bytes or binaries. list_to_binary will accept all iolists
>>>> > (deeply
>>>> > nested lists of bytes (0..255) and binaries) and iolist_to_binary will
>>>> > accept both iolists or flat binaries.
>>>> > They were apparently meant to convert between lists and binaries of
>>>> > bytes,
>>>> > not lists or binaries of arbitrary large (or negative) numbers.
>>>> > Because we
>>>> > Erlang programmers have been so relying on the idea of lists as
>>>> > strings,
>>>> > using ASCII and most Latin1 sequences of bytes made for fine
>>>> > conversions to
>>>> > binaries and nobody ever had a problem.
>>>> > Unicode standards (and their respective UTF encodings) break this
>>>> > assumption
>>>> > that strings most of the time only contain ASCII and Latin1 integers
>>>> > for
>>>> > their codepoints. This is why you need the unicode module's conversion
>>>> > there.
>>>> > The same is then true of binary_to_list. The trouble there is that the
>>>> > binary representation (in bytes) of unicode strings doesn't match the
>>>> > list
>>>> > representation of unicode as accepted by ~ts printing and whatnot. The
>>>> > raw
>>>> > bytes representation isn't good enough, and that's what binary_to_list
>>>> > gives
>>>> > you. Again, the unicode module is clever enough to handle that.
>>>> > iolist_to_binary, list_to_binary and binary_to_list are fine when you
>>>> > know
>>>> > they're meant to be used for bytes, not arbitrary data.
>>>> > What's hurting Erlang more, I think, is the lack of Unicode algorithms
>>>> > as
>>>> > described by Michael Uvarov. If we have a unicode string, we currently
>>>> > can't
>>>> > get its length (in terms of graphemes, or 'characters to the human
>>>> > mind')
>>>>
>>>> I don't understand the terminology here. What is "a unicode string?"
>>>>
>>>> In Erlang "string" means "list of integers"
>>>>
>>>> So when you write "a unicode string" my brain interprets this as a "a
>>>> list of unicode codepoints"
>>>
>>> I use string as a generic idea of a data type. Scheme has unicode
>>> strings, Python has them, Javascript has them. Erlang doesn't properly have
>>> strings. It has lists of integers that overloaded to be interpreted as a
>>> string. If your list of integers (or binary) doesn't respect the encoding
>>> Erlang tries to overload, then your string isn't properly unicode.
>>> So [49,48,8364] could be a unicode string (easy to figure out with that
>>> 8364 sticking out, which is a valid unicode codepoint, and an invalid
>>> Latin-1 character), but [49,48,226,130,172] contains no information
>>> regarding whether it should be latin-1 (10â ¬) or unicode (utf-8, 16 or 32?)
>>> (10€). In fact, it can have different unicode results depending on if you
>>> treat the list as bytes or as a string. If you treat it as a string, it'll
>>> see all integers as independent codepoints and won't be able to put
>>> characters together into graphemes.
>>> The current way to do it is to use lists of codepoints for unicode
>>> strings of one kind, and to use sequences of bytes in binaries for unicode
>>> strings of another kind (with a more precise encoding, where you've got to
>>> pick between utf8, utf16 and utf32) as far as I understand things.
>>> Therefore, I tend to treat 'valid unicode string' as something
>>> unambiguous. Either a list of codepoints or a binary made out of bytes
>>> respecting a given encoding. If you have a list of bytes, then it's not a
>>> unicode string becaue it's very ambiguous and context-sensitive.
>>>
>>>>
>>>> So the "unicode string" for "10(euros)" is the erlang list [49,48,8364]
>>>> and the length of this list is three (ie the number of 'characters in my
>>>> mind')
>>>>
>>>> The problem is that if I just write X = [N1,N2,N3,....] where all the
>>>> N's are in 0..255
>>>> I cannot see at a glance if this is supposed to represent (say) an
>>>> ascii string or
>>>> a UTF-8 encoded string of unicode codepoints. There is no way of
>>>> knowning. So I must add some
>>>> wrapper, ie
>>>>
>>>>      X1 = {ascii, "abc"} or
>>>>
>>>>      X2 = {utf8encoded,unicode,[49,48,226,130,172]}
>>>>
>>>>      X3 = {unicode, "10\x{20ac}"} = {unicode, [49,48,8364]}
>>>>
>>>>
>>>> So now I know that the string in X1 is a "asci string" (unencoded)
>>>> and the string in X2 is the utf8 encoding of the unicode codepoints
>>>> [49,48,8364]
>>>> and the list in X3 is what I call a "unicode string"
>>>
>>> Yes, that's a way to do it. You're using tagged tuples to contain type
>>> information on the data you carry around. This means, however, that you need
>>> to change the definitions of iolists, how the unicode module works, etc. to
>>> get something compatible everywhere. Then the question becomes why not
>>> support them out of the box?
>>> Python's got things like raw strings (r"this is not escaped: \n"), normal
>>> strings ("this is a linebreak: \n") or unicode strings (u"hey there"). I do
>>> not like the way they did it because conversion is frankly terrible and the
>>> unicode string vs. its encoding are unclear, but at least you've got native
>>> support for the typing there.
>>> It's a tough nut to crack. In one case, we keep going with ad-hoc
>>> systems, in the other, we must introduce new datatypes, change a lot of how
>>> the language works, possibly break compatibility.
>>> If we keep going the way we are, we'll need some serious community effort
>>> to get rid of the 'strings are just lists of integers' mentality, because
>>> it's more complex than that, I think.
>>>
>>>>
>>>> Its just a matter of defining a convention and sticking to it - the
>>>> alternative would be to make a
>>>> new string type (a list with some invisible bits) but this would be
>>>> rather complicated and
>>>> probably not worth the effort.
>>>
>>> Yep. Sorry, I typed my response above before reading the end of your
>>> reply :)
>>>>
>>>> /Joe
>>>>
>>>>
>>>> > in
>>>> > any reliable way. We also can't specify locales, can't do casing and
>>>> > whatnot, etc. Ideally those would be the next step for Erlang's
>>>> > unicode
>>>> > support, I think.
>>>> > On Thu, Oct 20, 2011 at 4:23 AM, Joe Armstrong <erlang@REDACTED>
>>>> > wrote:
>>>> >>
>>>> >> Interesting comment: this is almost where I could write an article
>>>> >> with
>>>> >> the
>>>> >> title "list_to_binary considered harmful" - I guess if Erlang is
>>>> >> serializing terms
>>>> >> to be stored on disk etc. term_to_binary and its inverse should be
>>>> >> used.
>>>> >> list_to_binary seems to imply that you are going to send something to
>>>> >> the
>>>> >> outside world - and then you should stop and think hard, this is
>>>> >> because there is
>>>> >> no universal agreement in the outside world as to what an integer is
>>>> >> (ie is it bounded or not)
>>>> >> fixing a notion of an integer to something in the range 0..255 allows
>>>> >> communication of
>>>> >> integers, but requires a framing protocol (ie UTF8, or ASN.1) that
>>>> >> tells how integers
>>>> >> are encoded - but this is out of band.
>>>> >>
>>>> >> The problem is that I might write
>>>> >>
>>>> >>    X1 = "10$"    (10 dollars) or
>>>> >>    X2 = "10\x{20ac}"  (10 euros)
>>>> >>
>>>> >> Now list_to_binary(X1) will succeed but list_to_binary(X2) will fail
>>>> >>
>>>> >> So maybe I should write
>>>> >>
>>>> >>    X1 = {ansii, "10$"}
>>>> >>    X2 = {unicode,"10\x{20ac}"}
>>>> >>
>>>> >> If the libraries were written this way then life might be easier
>>>> >>
>>>> >> /Joe
>>>> >>
>>>> >
>>>> >
>>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>



More information about the erlang-questions mailing list