[erlang-questions] correct terminology for referring to strings
Richard O'Keefe
ok@REDACTED
Wed Aug 1 06:33:08 CEST 2012
On 31/07/2012, at 9:53 PM, Michael Turner wrote:
>> << An Erlang "string" is simply a list of integers. Each integer can
>> represent any Unicode codepoint/character. >>
>
> Except that Unicode codepoints represents characters, right?
Wrong.
One Unicode codepoint may represent what a particular language
views as two distinct graphemes. (This occurs in encoding English,
for example: in 'belovéd' the diacritical mark is a stress accent
and so é counts as two separate graphemes.)
One grapheme may require two or more Unicode codepoints.
Some characters, well, 26FD FE0E is a black-and white picture
of a petrol pump, but 26FD FE0F is a colour version. Either
of these is perceived by users as a single 'character'.
FE0E represents "select text style for the previous thingy";
FEOF represents "select emoji style for it". You'd be hard
pressed to call either FE0E or FE0F a "character".
The majority of codepoints represent nothing at all (yet).
The thing people *still* don't get about Unicode is that
with ASCII and EBCDIC and Latin-1 there really was such a
thing as a "character" that a string was a sequence of, but
in Unicode, a string is *not* a sequence of characters but
a *well-formed* sequence of codepoints. You *can't* represent
the "emoji style FUEL PUMP" <<character>> by a single number,
only by a *sequence* of codepoints.
I keep meaning to write a small book called "Strings Made Difficult."
From the Unicode FAQ:
Q: So is a combining character sequence the same as a “character”?
A: That depends. For a programmer, a Unicode code value represents a single character (for exceptions, see below). For an end user, it may not. The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.
...
Q: How should characters (particularly composite characters) be counted, for the purposes of length, substrings, positions in a string, etc.?
A: In general, there are 3 different ways to count characters. Each is illustrated with the following sample string.
“a” + umlaut + greek_alpha + \uE0000.
(the latter is a private use character)
1. Code Units: e.g. how many bytes are in the physical representation of the string. Example:
In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80]
In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00]
In UTF-32BE, it has 16 bytes. [00 00 00 61 00 00 03 08 00 00 03 B1 00 0E 00 00]
2. Codepoints: how may code points are in the string.
The sample has 4 code points. This is equivalent to the UTF-32BE count divided by 4.
3. Graphemes: what end-users consider as characters.
A default grapheme cluster is specified in UAX #29, Unicode Text Segmentation, as well as in UTS #18, Unicode Regular Expressions.
The choice of which one to use depends on the tradeoffs between efficiency and comprehension. For example, Java, Windows and ICU use #1 with UTF-16 for all low-level string operations, and then also supply layers above that provide for #2 and #3 boundaries when circumstances require them. This approach allows for efficient processing, with allowance for higher-level usage. However, for a very high level application, such as word-processing macros, graphemes alone will probably be sufficient.
Q
> You can't
> have a representation of a representation.[*]
>
> I suggest:
>
> << In Erlang, strings are represented as lists of integers. These
> integers are Unicode codepoints, each representing a character. >>
>
> That way, anybody who's unclear on what "codepoint" means gets a
> freebie definition of it. In the Unicode context, it's probably wrong,
> technically, but perhaps good enough for this purpose.
>
> -michael turner
>
> [*] Douglas Hofstadter might beg to differ, but he's not on this list.
>
>
> On Tue, Jul 31, 2012 at 6:41 PM, Paul Barry <paul.james.barry@REDACTED> wrote:
>> Hi Joe.
>>
>> I think "string literal" is pretty widely understood (it even has a
>> WikiPedia entry, here: http://en.wikipedia.org/wiki/String_literal).
>>
>> What threw me about your sentence was the use of the word 'codepoint',
>> which will be OK for those already familiar with Unicode, but might
>> confuse those who are not. My feeling (and this might be a gross
>> over-simplification) is that most North-American programmers know
>> about Unicode but don't let it worry them too much, resulting in less
>> of a familiarity with it than might be necessary (and I apologize to
>> any North-American programmers that this comment rubs the wrong way).
>> Perhaps "unicode characters" might be easier to read/understand?
>> Although not probably totally technically correct...
>>
>> Another thing that you might wish to consider is breaking the sentence
>> in two, as follows:
>>
>> << An Erlang "string" is simply a list of integers. Each integer can
>> represent any Unicode codepoint/character. >>
>>
>> Just my 2 cent.
>>
>> Paul.
>>
>> On 31 July 2012 10:24, Joe Armstrong <erlang@REDACTED> wrote:
>>> I'm working on a 2'nd edition of my book, and have got to strings :-)
>>> Strings confuse everybody, including me so I have a few questions:
>>>
>>> To start with Erlang doesn't have strings - it has lists (not strings)
>>> and it has string literals.
>>>
>>> I want to define a string - is this correct:
>>>
>>> << A "string" is a list of integers where the integers
>>> represent Unicode codepoints. >>
>>>
>>> Questions:
>>> Is the sentence inside << .. >> using the correct terminology?
>>> If not what should it say?
>>>
>>> Is the sentence inside << ... >> widely understood, do you think this
>>> would confuse a lot of people?
>>>
>>> Is the phrase "string literal" widely understood?
>>>
>>>
>>> Cheers
>>>
>>> /Joe
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>>
>> --
>> Paul Barry, w: http://paulbarry.itcarlow.ie - e: paul.barry@REDACTED
>> Lecturer, Computer Networking: Institute of Technology, Carlow, Ireland.
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
More information about the erlang-questions
mailing list