[erlang-questions] byte() vs. char() use in documentation
Raimo Niskanen
raimo+erlang-questions@REDACTED
Tue May 3 11:45:49 CEST 2011
On Mon, May 02, 2011 at 08:43:33PM +0100, James Churchman wrote:
> So just for my own understanding, and as it seems extremely important (
> strings are quite important these days!), as it stands now:
>
> iolists cant can only ( officially?) contain utf8? ( as no utf8 code point
> will exceed 255, like latin1 / asci, and are therefor are all byte() )
Richard O'Keefe explained this nicely, I'll just elaborate.
iolists were when introduced for handling byte sequences,
not having to copy but just building them in nested form,
either from individual bytes or from binaries.
Back then characters were only latin-1 hence matched bytes nicely
and therefore iolists. This is no longer true. Now you will
have to do a translation from a seqence of characters into the
corresponding byte sequence in an iolist. The preferred
representation in erlang is utf-8 since it is the default
for e.g the unicode module and for the ~t modifier in the
io module when printing strings
>
> strings can be of utf8 utf16 or utf32, but only the utf8 version is allowed
The programmer should regard strings as a sequence of unicode code points.
As such they are just that and there is no encoding to bother about.
The code point number uniquely defines which unicode character it is.
> in an iolist? ( and therefore if you wanted an "iolist" ( eg a non flat list
UTF-8 is not the only encoding allowed in an iolist. You can do any encoding
as you desire. If you use the unicode module the default format for encoding
and decoding of binaries is utf8, but utf16 or utf32 big or little endian
is easy to do. An iolist is just a sequence of bytes.
> of chars) that contained utf 16 or 32 code points you would have to stick
An iolist is a non-flat list of bytes. Do not mixup bytes with characters.
> exclusively to lists ( strings) and not binaries and use lists:flatten
You can not just use lists:flatten on a unicode character string to get
an iolist. The Unicode code points > 255 are still there. You will
have to encode the unicode characters into a suitable byte representation
e.g using the unicode module.
> before you finished with it, to remove all the nested lists )
>
> binaries can be of any unicode type..
Binaries are sequences of bytes. Period. You decide what they mean.
>
> also there does seem to be a needed distinction between char() and byte() as
> they are not the same at all, but the documentation is wrong as at the
> moment iolists can infact only contain byte() not char()
Yes.
>
> the suggested direction is to repair the docs so that they specify only
> allowing 0~255 ints( byte() ) in iolists rather than allowing io-lists to
> contain any string as they did before the introduction of unicode / in the
> days of latin1 etc.. ?
Yes. That iolists could contain any string was by accident since there
were no characters > 255 in the days of latin1. Since iolists is about
sequences of bytes they can not be fixed into being allowed to contain
any character. For that to be possible you would have to define the
byte encoding for iolists, or store the byte encoding with a particular
iolist. Since there is so many byte encodings that are used it is
better to make this visible to the programmer so he/she is forced to
understand the byte encoding problem and to handle it explicitly.
Therefore is iolists now as the always were secuences of bytes (8-bit).
And that is all.
>
>
> i think that that goes agents most ( even erlang implementers :-) ) opinion
> of what an iolist is ( that being a list of any valid string or binary) but
I think not.
An iolist is any valid byte or binary sequence. Binaries are sequences
of bytes. They are all about bytes.
Characters and strings are today vastly more complex beasts than they
were when US-ASCII and later ISO-LATIN-1 was the norm. This must
be visible to the programmer.
> maybe ( to raise a totally different problem) would prevent the possibility
> of an iolist having a mixed unicode type and still begin "valid" ( even tho
> i guess this is still possible as binaries can in fact be other utf
> representations)
I repeat again. The programmer decides what the bytes mean. The list
[0,0,16#21,16#2b] e.g would mean "angstrom sign" if the encoding is
UTF-32 big endian. And that is a valid iolist.
But [16#212b] is not.
>
>
>
> On 2 May 2011 11:13, Kostis Sagonas <kostis@REDACTED> wrote:
>
> > Raimo Niskanen wrote:
> >
> >>
> >> This became messy when char() was re-defined from latin-1 character
> >> to unicode character. That affected string() that affected iolist()
> >> and the latter was incorrect.
> >>
> >> We must clean up the mess.
> >>
> >
> > Right. The sooner it happens the better it is.
> >
> > ... Either by completing the notion of char()
> >>
> >> being unicode and hence rewriting iolist() to contain byte() and binary(),
> >> or by reverting to char() being latin-1 char and using unicode:char()
> >> and unicode:string() where that is correct...
> >>
> >
> > Please, by all means do the former. The latter will only cause havoc
> > everywhere. For starters, I do not see any need in having two different
> > basic types (byte() and char()) denoting (pretty much) the same thing. The
> > only thing this does is cause unnecessary confusion to newcomers (and
> > apparently to some old-timers too). Second, if you choose the latter you
> > will eventually have to change lots of type inference code, because I
> > promise you I will not do this, and believe me you don't want to go there...
> > (The Vietnam jungle is probably a friendlier place ;) )
> >
> > Cheers,
> > Kostis
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> >
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
--
/ Raimo Niskanen, Erlang/OTP, Ericsson AB
More information about the erlang-questions
mailing list