[erlang-questions] byte() vs. char() use in documentation

Raimo Niskanen raimo+erlang-questions@REDACTED
Fri May 6 10:47:48 CEST 2011


On Thu, May 05, 2011 at 09:12:26AM -0500, David Mercer wrote:
> In the past few days, various people wrote:
> 
> > [Various stuff debating Unicode, characters, glyphs, codepoints, code
> units, bits, bytes, strings, iolists, etc. etc.]
> 
> I think most programmers are content treating each Unicode codepoint as a
> "character," regardless of whether that is strictly correct or not.  It is
> the unit that is strung together to make strings.  A list of Unicode
> codepoints seems the reasonable canonical way of representing strings.
> Thus:
> 
> 	char() :: 0..16#10ffff
> 	string() :: [char()]
> 
> The only issue is in the encoding of strings into binary formats, and this
> becomes relevant when you have iolists, which mix string pieces with encoded

So far all correct and very well.

But: iolists are only about bytes, not characters. They are for being
able to efficiently construct a byte sequence e.g to send to a driver.

> string pieces (binaries).  If anyone ever wants to decode an iolist into a
> straight string (if this is ever done), it seems reasonable to let the user
> specify the encoding of the binaries within the iolist.  On the other hand,
> if someone is converting an iolist into an encoded binary, we can either (1)
> assume the binaries in the iolist are already in the target encoding, (2)
> let the programmer specify the encoding of the source iolist binaries and
> have them automatically converted to the target encoding.  The advantage of
> the former is that it allows iolists to represent more than just strings:
> they could represent strings with embedded binary components (where
> character encoding is irrelevant).  There's nothing that says every binary
> in an iolist has to decode into a string.  I'm going to put that in all-caps
> for emphasis: THERE'S NOTHING THAT SAYS EVERY BINARY IN AN IOLIST HAS TO
> DECODE INTO A STRING.  Thus, I would assume iolists are defined as:

All that also very well. There's nothing that says every binary in an iolist
has to decode into a string.

But: there is just raw binary data in an iolist.

> 
> 	iolist() :: [char() | binary() | iolist()]

Not char(); byte()!
	iolist() :: [byte() | binary() | iolist()]

> 
> and we have functions available in some library somewhere:
> 
> 	iolist_to_string(iolist(), SourceEncoding) -> string()
> 	iolist_to_binary(iolist(), TargetEncoding) -> binary().

These exists (simplified):

	unicode:characters_to_list(characters(), SourceEncoding) -> string()
	unicode:characters_to_binary(
		characters(), SourceEncoding, DestEncoding) -> binary()

	characters() :: [char() | binary()]

> 
> (You'd probably also have iolist_to_string/1 and iolist_to_binary/1, which
> assume UTF-8.)  We *might* also want to define an iolist_to_binary/3, which

unicode:characters_to_list/1 and unicode:characters_to_binary/1 actually
does that, i.e assumes utf8 for embedded binaries.

erlang:iolist_to_binary/1 assumes latin1 i.e translates nothing.

> does convert binaries in the iolist from one encoding to another, but I'd
> consider that optional.

unicode:characters_to_binary/3 does exactly that.

> 
> Anyone disagree?

Just that iolists only should (and does) contain bytes and binaries.
That your suggested functions already exist in the module unicode,
and that erlang:iolist_to_binary does (and can not be changed)
assume latin1.

/ Raimo



> 
> Cheers,
> 
> DBM
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB



More information about the erlang-questions mailing list