[erlang-questions] strings vs binaries

Richard A. O'Keefe ok@REDACTED
Wed Aug 19 05:30:32 CEST 2015


Lists are sequences of whatever you want,
so you can represent a Unicode text as a list
of Unicode/ISO10646 code-points with no trouble.
This means that string = list of integer is quite
a convenient way to process Unicode.

Binaries are a compact way to store Unicode
sequences encoded in UTF-8 (or other explicitly
specified encodings, thanks to the unicode module).
UTF8 and UTF16 are not noticeably convenient
representations for processing.

In fact strictly speaking a Unicode text is
a sequence of [Base_Character|Floating_Diacriticals]
sequences (or possibly even something more complicated
which I shall spare you), so for moving through a
Unicode character one "logical" character at a time a
list of lists of codepoints may be even more
convenient.

The one thing you must NOT do in ANY programming
language is to believe for one instant that there
is ANY text representation that should be used
always and everywhere, still less that something
*called* "string" is the right tool for the job
(or that it isn't).

Erlang, for example, has quite a few functions that
support "iolists" or these days the unicode:chardata()
type, see http://erlang.org/doc/man/unicode.html#type-chardata

Now that representation is pretty awful for most
processing purposes, but is absolutely BRILLIANT for
concatenation and output.  And the unicode module
lets you convert it to a simple list or a binary if
and when you need to.

Always, you need to ask "What am I going to do with this?"




More information about the erlang-questions mailing list