[erlang-questions] byte() vs. char() use in documentation

David Mercer dmercer@REDACTED
Thu May 5 16:12:26 CEST 2011


In the past few days, various people wrote:

> [Various stuff debating Unicode, characters, glyphs, codepoints, code
units, bits, bytes, strings, iolists, etc. etc.]

I think most programmers are content treating each Unicode codepoint as a
"character," regardless of whether that is strictly correct or not.  It is
the unit that is strung together to make strings.  A list of Unicode
codepoints seems the reasonable canonical way of representing strings.
Thus:

	char() :: 0..16#10ffff
	string() :: [char()]

The only issue is in the encoding of strings into binary formats, and this
becomes relevant when you have iolists, which mix string pieces with encoded
string pieces (binaries).  If anyone ever wants to decode an iolist into a
straight string (if this is ever done), it seems reasonable to let the user
specify the encoding of the binaries within the iolist.  On the other hand,
if someone is converting an iolist into an encoded binary, we can either (1)
assume the binaries in the iolist are already in the target encoding, (2)
let the programmer specify the encoding of the source iolist binaries and
have them automatically converted to the target encoding.  The advantage of
the former is that it allows iolists to represent more than just strings:
they could represent strings with embedded binary components (where
character encoding is irrelevant).  There's nothing that says every binary
in an iolist has to decode into a string.  I'm going to put that in all-caps
for emphasis: THERE'S NOTHING THAT SAYS EVERY BINARY IN AN IOLIST HAS TO
DECODE INTO A STRING.  Thus, I would assume iolists are defined as:

	iolist() :: [char() | binary() | iolist()]

and we have functions available in some library somewhere:

	iolist_to_string(iolist(), SourceEncoding) -> string()
	iolist_to_binary(iolist(), TargetEncoding) -> binary().

(You'd probably also have iolist_to_string/1 and iolist_to_binary/1, which
assume UTF-8.)  We *might* also want to define an iolist_to_binary/3, which
does convert binaries in the iolist from one encoding to another, but I'd
consider that optional.

Anyone disagree?

Cheers,

DBM




More information about the erlang-questions mailing list