[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Thu Oct 27 11:46:38 CEST 2011

> On Thu, Oct 20, 2011 at 11:06 AM, Fred Hebert <mononcqc@REDACTED> wrote:
>>
>> On Thu, Oct 20, 2011 at 9:26 AM, Joe Armstrong <erlang@REDACTED> wrote:
>>>
>>> On Thu, Oct 20, 2011 at 2:54 PM, Fred Hebert <mononcqc@REDACTED> wrote:
>>> > No, list_to_binary and iolist_to_binary are not considered harmful.
>>>
>>> But they are dangerous. list_to_binary can fail when its argument is
>>> not list of 0.255 integers

I wouldn't consider them dangerous.  When working with list of *bytes*
and binaries e.g. to handle low level protocols I want them to fail if
the assumption 0..255 is not true.

I've always seen iolists  as methods to get bytes in and out and
around in the language (binaries and lists are not mentioned because
they are just specializations of iolists).

They are not even called "strings" --- except when used with the string module.

>> Yes, that's a way to do it. You're using tagged tuples to contain type
>> information on the data you carry around. This means, however, that you need
>> to change the definitions of iolists, how the unicode module works, etc. to
>> get something compatible everywhere. Then the question becomes why not
>> support them out of the box?

I don't think the definition of iolists and all the code around it
should be changed.  They are for code that wants to handle bytes and
nothing else.  As long as we stay away from the notion that a list of
0..255 could also be seen as  utf-8 encoded string not much ambiguity
results:

If its a iolist in the current definition it can either be a bunch of
bytes or you could also see it as Latin 1 encoded string if you want.
If there is a integer outside of 0..255 I would want it break because
at least in my code it would mean there is a bug.

If we add some form of tagged data-type with encoding included
somewhere (wether built in or just building it out of what is already
there) there is not much collision with existing modules --- actually
it would only collide with the string module.

If we leave out the implementation detail and look at the semantics only:

Lets say we have agreed on a more or less opaque type for unicode text
which includes its encoding we can't simply use the preexisting io
functions with this unicode text data-type.  In order to make it
really work smoothly there has to be some notion of the encoding of
the i/o channel to be able to convert the internal format to what is
expected in the outside world.

So as a first step there is need for some functions that convert
to/from iolist with some specified encoding.  When we take care of the
integers in these lists always meaning Latin-1 and convert everything
utf* encoded to binary parts maximum compatibility with preexisting
code can be obtained.

On top of these functions one could always implement some intelligent
io-channel type with some notion of a *current* encoding of the
channel with auto-conversion.  Encoding needs to be switchable to be
really useful (imagine writing a MIME encoded mail where you have to
switch from 7bit ASCII header to the encoding of whatever text part
you are writing to bytes if you are attaching something binary.

If you feed some standard iolist to these new io-channel type it is
always interpreted as 0..255 Latin-1.  There could also be a new type
of iolist like structure: deep lists of integers 0..255, binaries
(both implicit Latin-1) and the new encoded text parts.

The nice thing is that all this fancy automatic io-channel stuff can
be added as a second step and only *if* the need arises.  I'm not sure
these fancy io-channels are needed but some might like to have some.

This leaves the *implementation* issue whether to have some built-in
new type or build the new encoded text type out of tuples and
binaries.

Given the semantics above I see not much gained by implementing as new
internal type.
Binaries as storage have all the semantics that are also used in most
good string implementations.  And if something should be missing:
extend the binary type with the feature needed and probably also help
other uses for it.

To help not to mix up "string" and the new thing I'd suggest the name
"text" or "enc_text" for the new type.  So there can be new modules
text and maybe also text_io handling this new things.  If it is well
established the string module could even be deprecated in the far
future.  Regarding how to store the encoding:

{enc_text, utf16, <<...>>}

but also:

{enc_text, bom_encoded. <<16#FE, 16#FF, ...>>}

forms should be supported.  Only having bom encoding would break some
advantages of sub binary referencing (a sub_text function could return
non bom encoded subparts of the original binary with explicit
encoding).

Of course instead of the binary storage some deep lists of:

{enc_text ...}

Integers (now interpreted as code points)

<<binaries>> interpreted in the encoding of the enclosing {enc_text, ...}

Should be supported.  Conversion of normal iolists is very easy then:

iolist_to_text(Iolist) -> {enc_text, latin1, Iolist}.

The rest of the implementation is left as exercise for the reader ;-)

Cheers
-- Peer