[erlang-questions] correct terminology for referring to strings

Wed Aug 1 06:33:08 CEST 2012

On 31/07/2012, at 9:53 PM, Michael Turner wrote:

>> << An Erlang "string" is simply a list of integers.  Each integer can
>> represent any Unicode codepoint/character. >>
> 
> Except that Unicode codepoints represents characters, right?

Wrong.

One Unicode codepoint may represent what a particular language
views as two distinct graphemes.  (This occurs in encoding English,
for example: in 'belovéd' the diacritical mark is a stress accent
and so é counts as two separate graphemes.)

One grapheme may require two or more Unicode codepoints.
Some characters, well, 26FD FE0E is a black-and white picture
of a petrol pump, but  26FD FE0F is a colour version.  Either
of these is perceived by users as a single 'character'.
FE0E represents "select text style for the previous thingy";
FEOF represents "select emoji style for it".  You'd be hard
pressed to call either FE0E or FE0F a "character".

The majority of codepoints represent nothing at all (yet).

The thing people *still* don't get about Unicode is that
with ASCII and EBCDIC and Latin-1 there really was such a
thing as a "character" that a string was a sequence of, but
in Unicode, a string is *not* a sequence of characters but
a *well-formed* sequence of codepoints.  You *can't* represent
the "emoji style FUEL PUMP" <<character>> by a single number,
only by a *sequence* of codepoints.

I keep meaning to write a small book called "Strings Made Difficult."

From the Unicode FAQ:

Q: So is a combining character sequence the same as a “character”?

A: That depends. For a programmer, a Unicode code value represents a single character (for exceptions, see below). For an end user, it may not. The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

...
Q: How should characters (particularly composite characters) be counted, for the purposes of length, substrings, positions in a string, etc.?

A: In general, there are 3 different ways to count characters. Each is illustrated with the following sample string. 
“a” + umlaut + greek_alpha + \uE0000.
(the latter is a private use character)

1. Code Units: e.g. how many bytes are in the physical representation of the string. Example:
In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80]
In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00]
In UTF-32BE, it has 16 bytes. [00 00 00 61 00 00 03 08 00 00 03 B1 00 0E 00 00]

2. Codepoints: how may code points are in the string.
The sample has 4 code points. This is equivalent to the UTF-32BE count divided by 4.

3. Graphemes: what end-users consider as characters.
A default grapheme cluster is specified in UAX #29, Unicode Text Segmentation, as well as in UTS #18, Unicode Regular Expressions.

The choice of which one to use depends on the tradeoffs between efficiency and comprehension. For example, Java, Windows and ICU use #1 with UTF-16 for all low-level string operations, and then also supply layers above that provide for #2 and #3 boundaries when circumstances require them. This approach allows for efficient processing, with allowance for higher-level usage. However, for a very high level application, such as word-processing macros, graphemes alone will probably be sufficient.

Q

> You can't
> have a representation of a representation.[*]
> 
> I suggest:
> 
> << In Erlang, strings are represented as lists of integers. These
> integers are Unicode codepoints, each representing a character. >>
> 
> That way, anybody who's unclear on what "codepoint" means gets a
> freebie definition of it. In the Unicode context, it's probably wrong,
> technically, but perhaps good enough for this purpose.
> 
> -michael turner
> 
> [*] Douglas Hofstadter might beg to differ, but he's not on this list.
> 
> 
> On Tue, Jul 31, 2012 at 6:41 PM, Paul Barry <paul.james.barry@REDACTED> wrote:
>> Hi Joe.
>> 
>> I think "string literal" is pretty widely understood (it even has a
>> WikiPedia entry, here: http://en.wikipedia.org/wiki/String_literal).
>> 
>> What threw me about your sentence was the use of the word 'codepoint',
>> which will be OK for those already familiar with Unicode, but might
>> confuse those who are not.  My feeling (and this might be a gross
>> over-simplification) is that most North-American programmers know
>> about Unicode but don't let it worry them too much, resulting in less
>> of a familiarity with it than might be necessary (and I apologize to
>> any North-American programmers that this comment rubs the wrong way).
>> Perhaps "unicode characters" might be easier to read/understand?
>> Although not probably totally technically correct...
>> 
>> Another thing that you might wish to consider is breaking the sentence
>> in two, as follows:
>> 
>> << An Erlang "string" is simply a list of integers.  Each integer can
>> represent any Unicode codepoint/character. >>
>> 
>> Just my 2 cent.
>> 
>> Paul.
>> 
>> On 31 July 2012 10:24, Joe Armstrong <erlang@REDACTED> wrote:
>>> I'm working on a 2'nd edition of my book, and have got to strings :-)
>>> Strings confuse everybody, including me so I have a few questions:
>>> 
>>> To start with Erlang doesn't have strings - it has lists (not strings)
>>> and it has string literals.
>>> 
>>> I want to define a string - is this correct:
>>> 
>>> << A "string" is a list of integers where the integers
>>>      represent Unicode codepoints. >>
>>> 
>>> Questions:
>>>    Is the sentence inside << .. >> using the correct terminology?
>>>    If not what should it say?
>>> 
>>>    Is the sentence inside << ... >> widely understood, do you think this
>>>    would confuse a lot of people?
>>> 
>>>    Is the phrase "string literal" widely understood?
>>> 
>>> 
>>> Cheers
>>> 
>>> /Joe
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>> 
>> 
>> 
>> --
>> Paul Barry, w: http://paulbarry.itcarlow.ie - e: paul.barry@REDACTED
>> Lecturer, Computer Networking: Institute of Technology, Carlow, Ireland.
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions