[erlang-questions] byte() vs. char() use in documentation
Fri May 6 10:47:48 CEST 2011
On Thu, May 05, 2011 at 09:12:26AM -0500, David Mercer wrote:
> In the past few days, various people wrote:
> > [Various stuff debating Unicode, characters, glyphs, codepoints, code
> units, bits, bytes, strings, iolists, etc. etc.]
> I think most programmers are content treating each Unicode codepoint as a
> "character," regardless of whether that is strictly correct or not. It is
> the unit that is strung together to make strings. A list of Unicode
> codepoints seems the reasonable canonical way of representing strings.
> char() :: 0..16#10ffff
> string() :: [char()]
> The only issue is in the encoding of strings into binary formats, and this
> becomes relevant when you have iolists, which mix string pieces with encoded
So far all correct and very well.
But: iolists are only about bytes, not characters. They are for being
able to efficiently construct a byte sequence e.g to send to a driver.
> string pieces (binaries). If anyone ever wants to decode an iolist into a
> straight string (if this is ever done), it seems reasonable to let the user
> specify the encoding of the binaries within the iolist. On the other hand,
> if someone is converting an iolist into an encoded binary, we can either (1)
> assume the binaries in the iolist are already in the target encoding, (2)
> let the programmer specify the encoding of the source iolist binaries and
> have them automatically converted to the target encoding. The advantage of
> the former is that it allows iolists to represent more than just strings:
> they could represent strings with embedded binary components (where
> character encoding is irrelevant). There's nothing that says every binary
> in an iolist has to decode into a string. I'm going to put that in all-caps
> for emphasis: THERE'S NOTHING THAT SAYS EVERY BINARY IN AN IOLIST HAS TO
> DECODE INTO A STRING. Thus, I would assume iolists are defined as:
All that also very well. There's nothing that says every binary in an iolist
has to decode into a string.
But: there is just raw binary data in an iolist.
> iolist() :: [char() | binary() | iolist()]
Not char(); byte()!
iolist() :: [byte() | binary() | iolist()]
> and we have functions available in some library somewhere:
> iolist_to_string(iolist(), SourceEncoding) -> string()
> iolist_to_binary(iolist(), TargetEncoding) -> binary().
These exists (simplified):
unicode:characters_to_list(characters(), SourceEncoding) -> string()
characters(), SourceEncoding, DestEncoding) -> binary()
characters() :: [char() | binary()]
> (You'd probably also have iolist_to_string/1 and iolist_to_binary/1, which
> assume UTF-8.) We *might* also want to define an iolist_to_binary/3, which
unicode:characters_to_list/1 and unicode:characters_to_binary/1 actually
does that, i.e assumes utf8 for embedded binaries.
erlang:iolist_to_binary/1 assumes latin1 i.e translates nothing.
> does convert binaries in the iolist from one encoding to another, but I'd
> consider that optional.
unicode:characters_to_binary/3 does exactly that.
> Anyone disagree?
Just that iolists only should (and does) contain bytes and binaries.
That your suggested functions already exist in the module unicode,
and that erlang:iolist_to_binary does (and can not be changed)
> erlang-questions mailing list
/ Raimo Niskanen, Erlang/OTP, Ericsson AB
More information about the erlang-questions