[erlang-questions] correct terminology for referring to strings

Tue Jul 31 14:44:34 CEST 2012

On 2012-07-31, at 11:53 , Michael Turner wrote:

>> << An Erlang "string" is simply a list of integers.  Each integer can
>> represent any Unicode codepoint/character. >>
> 
> Except that Unicode codepoints represents characters, right?

Depends, the definition of "character" is quite ambiguous.

By "character", many people mean what Unicode calls "grapheme" (a concrete
shape or shape-group displayed on a medium[-1]). The meaning of the word may
also change across cultures, for instance concerning diacritics some
cultures consider the base+diacritic(s) as a single character, others as
multiples. And it becomes very tought to define for e.g. hangul, is the
character a hangul block or the jamo composing it?[0]

The Unicode Standard itself lists 4 different and potentially
incompatible meanings for the word:

(1) The smallest component of written language that has semantic value;
    refers to the abstract meaning and/or shape, rather than a specific
    shape (see also glyph), though in code tables some form of visual
    representation is essential for the reader’s understanding.
(2) Synonym for abstract character.
(3) The basic unit of encoding for the Unicode character encoding.
(4) The English name for the ideographic written elements of Chinese origin

where "abstract character" is defined as:

A unit of information used for the organization, control, or
representation of textual data.
* When representing data, the nature of that data is generally symbolic
  as opposed to some other kind of data (for example, aural or visual).
  Examples of such symbolic data include letters, ideographs, digits,
  punctuation, technical symbols, and dingbats.
* An abstract character has no concrete form and should not be confused
  with a glyph.
* An abstract character does not necessarily correspond to what a user
  thinks of as a “character” and should not be confused with a grapheme.
* The abstract characters encoded by the Unicode Standard are known as
  Unicode abstract characters.
* Abstract characters not directly encoded by the Unicode Standard can
  often be represented by the use of combining character sequences.

In most meanings of the word "character", a character maps to a
(potentially unary) *sequence* of unicode code-points, there isn't a
1:1 mapping.

[-1] don't take my word for it, I might have fucked up my recollection,
     I regularly get confused between the precise meanings of "glyph"
     and "grapheme"

[0] Modern hangul is written as syllabic blocks, each block is composed
    of three jamo (letters, technically 2 to 5 for ancient/historical
    texts). For instance 한 (han) is a block composed of three jamo ㅎ,
    ㅏ, and ㄴ, and unicode allows encoding it as either HANGUL SYLLABLE
    HAN or a sequence of HANGUL CHOSEONG HIEUH, HANGUL JUNGSEONG A and
    HANGUL JONGSEONG NIEUN. But is it a character or not?