[erlang-questions] correct terminology for referring to strings

Jan Burse janburse@REDACTED
Tue Jul 31 17:40:46 CEST 2012


Masklinn schrieb:
> Say it's a sequence of code points (reified as integers)? That's exactly
> what it is. If people don't know what a code point is, they can look it
> up. In any case, this shouldn't bring along any undue semantic baggage
> and misconception.
>

If they are code points, there needs to be a reference to
the Unicode version (4.0 or 6.0 etc..), a clarification whether
on a specific platforms private codes are supported (i.e. apple
sign  on mac), a clarfication which planes are supported (basic
plane only or supplementary planes also or UCS etc..).

	http://en.wikipedia.org/wiki/Universal_Character_Set

In the ISO Core Standard for Prolog (ISO/IEC 13211-1) the problem
is simply solved as follows:

	The processor character set PCS is an implementation
	defined character set. The members of PCS shall include
	each character defined by char (6.5).

	PCS may include additional members, known as extended
	characters. It shall be implementation defined for each
	extended character whether it is a graphic char, or an
	alphanumeric char, or a solo char, or a layout char, or a
	meta char.

	char (* 6.5 *)
	= graphic char (* 6.5.1 *)
	  alphanumeric char (* 6.5.2 *)
           solo char (* 6.5.3 l )
           layout char (* 65.4 *)
           meta char (* 6.5.5 *) ;

Means the standard does not know about a Unicode extension. But
it requires that in a Unicode extensions at least one can deal with
the same minimal subset unchanged, and all else is implementation
specific, i.e. Prolog system specific. Whereby even the subset
is not specified exactly what coding it is, we only have:

     NOTE - These requirements on the collating sequence are
     satisfied by both ASCII and EBCDIC.

What the standard not did forsee was that there could be different
stream encodings on the same processor. So although we have already
in the standard:

     NOTE - A character code may correspond to more than
     one byte in a stream. Thus, inputting a single character
     may consume several bytes from an input stream, and writing
     a single character may output several bytes to an output stream.

The current practice is that many Prolog systems offer an encoding/1
option in the stream handling, although no corrigenda has yet
picked that up. See for example SWI Prolog:

 
http://www.swi-prolog.org/pldoc/doc_for?object=section%282,%272.18%27,swi%28%27/doc/Manual/widechars.html%27%29%29

Bye



More information about the erlang-questions mailing list