[eeps] Commit: r54 - eeps/trunk
raimo+eeps@REDACTED
raimo+eeps@REDACTED
Wed Nov 12 15:31:44 CET 2008
Author: pan
Date: 2008-11-12 15:31:43 +0100 (Wed, 12 Nov 2008)
New Revision: 54
Modified:
eeps/trunk/eep-0010.txt
Log:
Spell checking and general improvement on readability
Modified: eeps/trunk/eep-0010.txt
===================================================================
--- eeps/trunk/eep-0010.txt 2008-11-11 10:22:48 UTC (rev 53)
+++ eeps/trunk/eep-0010.txt 2008-11-12 14:31:43 UTC (rev 54)
@@ -1,6 +1,6 @@
EEP: 10
Title: Representing Unicode characters in Erlang
-Version: $Id: unicode_in_erlang.txt,v 1.11 2008/11/11 10:12:28 pan Exp pan $
+Version: $Id: unicode_in_erlang.txt,v 1.12 2008/11/12 14:29:51 pan Exp $
Last-Modified: $Date$
Author: Patrik Nyblom
Status: Draft
@@ -39,7 +39,7 @@
system should provide. The Unicode support is by no means complete if
this EEP is implemented, but implementation will be feasible.
-The EEP also suggests library functions and bit sytax to deal with
+The EEP also suggests library functions and bit syntax to deal with
alternative encodings. However, one *standard* encoding is suggested,
which will be what library functions in Erlang are expected to
support, while other representations are supported only in terms of
@@ -235,8 +235,8 @@
unicode:utf8_to_list(Bin) -> UL
-Where Bin is a binary consisting of unicode characters encoded as
-UTF-8 and UL is a plain list of unicode characters.
+Where Bin is a binary consisting of Unicode characters encoded as
+UTF-8 and UL is a plain list of Unicode characters.
To allow for conversion to and from latin1 the functions::
@@ -244,10 +244,10 @@
and::
- unicode:latin1_list_to_list(LM) -> UL
+ unicode:latin1_list_to_unicode_list(LM) -> UL
would do the same job. Actually latin1_list_to_list is not necessary
-in this context, as it is more of a iolist-function, but should be
+in this context, as it is more of an iolist-function, but should be
present for completeness.
The fact that lists of integers representing latin1 characters are a
@@ -273,15 +273,15 @@
The word "characters" is used to denote a possibly complex
representation of characters in the encoding concerned, like a short
word for "a possibly mixed and deep list of characters and/or binaries
-in either latin1 representation or unicode".
+in either latin1 representation or Unicode".
Giving latin1 as the encoding would mean that all of ML should be
interpreted as latin1 characters, implying that integers > 255 in the
-list would be an error. Giving unicode as the encoding would mean that
+list would be an error. Giving Unicode as the encoding would mean that
all integers 0..16#10ffff are accepted and the binaries are expected
to already be UTF-8 coded.
-In the same way, conversion to lists of unicode characters could be done with a function::
+In the same way, conversion to lists of Unicode characters could be done with a function::
unicode:characters_to_list(ML, InEncoding) -> list()
ML := A mixed Unicode list or a mixed latin1 list
@@ -290,16 +290,16 @@
I think the approach of two simple conversion functions
characters_to_binary/2 and characters_to_list/2 is attractive, despite the fact
that certain combinations of in-data would be somewhat harder to
-convert (e.g. combinations of unicode characters > 255 in a list with
+convert (e.g. combinations of Unicode characters > 255 in a list with
binaries in latin1). Extending the bit syntax to cope with UTF-8 would
make it easy to write special conversion functions to handle those
rare situations where the above mentioned functions cannot do the job.
-To accomodate other encodings, the characters_to_binary functionality
+To accommodate other encodings, the characters_to_binary functionality
could be extended to handle other encodings as well. A more general
functionality could be provided with the following functions
-(preferrably placed in their own module, the module name 'unicode'
-beeing a good name-candidate):
+(preferably placed in their own module, the module name 'unicode'
+being a good name-candidate):
**characters_to_binary(ML) -> binary() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}**
@@ -322,7 +322,7 @@
- Rest := Mixed list as specified for ML.
The option 'unicode' is an alias for utf8, as this is the
-preferred encoding for unicode characters in binaries. Error tuples
+preferred encoding for Unicode characters in binaries. Error tuples
are returned when the data cannot be encoded/decoded due to errors
in indata and incomplete tuples when the indata is possibly correct
but truncated.
@@ -343,7 +343,7 @@
Here also the option 'unicode' denotes the default Erlang encoding
of utf8 in binaries and is therefore an alias for utf8. Error- and
-incomplete-tuples ase returned in the same way as for
+incomplete-tuples are returned in the same way as for
characters_to_binary.
Note that as the datatypes returned upon success are well defined,
@@ -351,7 +351,7 @@
returning the clunky {ok, Data} tuples even though the error and
incomplete tuples can be returned. This makes the functions simpler to
use when the encoding is known to be correct while return values can
-still be checked easilly.
+still be checked easily.
Bit syntax
@@ -372,7 +372,7 @@
As bit syntax is often used to interpret data from various external
sources, it would be useful to have corresponding utf16 and utf32
-types as well. While UTF-8 and UTF-16 and UTF-32 is easily interpreted
+types as well. While UTF-8, UTF-16 and UTF-32 are easily interpreted
with the current bit syntax implementation, the suggested specific
types would be convenient for the programmer. Also Unicode imposes
restrictions in terms of range and has some forbidden ranges which are best
@@ -405,7 +405,7 @@
implementations use binaries for communication, which in practice has
made the io-protocol contain bytes, not general characters.
-Futhermore has the fact that the io-system currently works with
+Furthermore has the fact that the io-system currently works with
characters that can be represented as bytes been utilized in numerous
applications, so that output from io-functions (i.e. io_lib:format)
has been sent directly to entities only accepting byte input (like
@@ -441,7 +441,7 @@
To make a solution that as far as possible does not break current code
and also keeps (or reverts to) the original intention of the
io-system protocol, I suggest a scheme where the formatting functions
-that return lists, keep to the current behaviour except when the
+that return lists, keep to the current behavior except when the
translation-modifier is used, in which case binaries in UTF-8 encoding
are returned.
@@ -462,7 +462,7 @@
3> io_lib:format("~ts",[UniString]).
-\- would return a (deep) list with the unicode string as a binary::
+\- would return a (deep) list with the Unicode string as a binary::
[[<<208,154,208,176,208,186,208,178,208,190,32,208,181,32,
85,110,105,99,111,100,101,32,63>>]]
@@ -490,7 +490,7 @@
\- would be accepted.
-The corresponding behavior of io:fread/2,3 would be to expect unicode data in this call::
+The corresponding behavior of io:fread/2,3 would be to expect Unicode data in this call::
11> io:fread(File,'',"~ts").
@@ -500,10 +500,10 @@
The actual io-protocol, on the other hand, should deal only with
Unicode, meaning that when data is converted to binaries for sending,
-all data should be transflated into UTF-8. When lists of integers are
-used in communication, the latin1 and unicode representations are the
+all data should be translated into UTF-8. When lists of integers are
+used in communication, the latin1 and Unicode representations are the
same, why no conversion or restrictions apply. Recall that the
-io-system is built so that characters should have one translation
+io-system is built so that characters should have one interpretation
regardless of the io-server. The only possible encoding would be a
Unicode one.
@@ -514,31 +514,31 @@
Generally, writing
to an io-server using the file-module will only be possible with
-byte-oriented data, while using the io-module will work on unicode
+byte-oriented data, while using the io-module will work on Unicode
characters. Calling the function file\:write/2 will send the bytes to
the file as is, as files are byte-oriented, but when writing on a file
-using the io-module, unicode characters are expected and handled.
+using the io-module, Unicode characters are expected and handled.
-The io-protocol will make conversons of bytes into unicode when
+The io-protocol will make conversions of bytes into Unicode when
sending to io-servers, but if the file is byte-oriented, the
conversion back will make this transparent to the user. All bytes are
representable in UTF-8 and can be converted back and forth without hassle.
The incompatible change will have to be to the put_chars function in
-io. It should only allow unicode data, not iodata() as it is
-documented to do now. The big change beeing that any binaries provided
+io. It should only allow Unicode data, not iodata() as it is
+documented to do now. The big change being that any binaries provided
to the function need to be in UTF-8. However, most usage of this
-function is restricted to lists, why this incompatible change not is
-expected to cause trouble for users.
+function is restricted to lists, why this incompatible change is
+expected not to cause trouble for users.
-To handle possible unicode text data on a file, one should be able to
+To handle possible Unicode text data on a file, one should be able to
provide encoding parameters when opening a file. A file should by
default be opened for byte (or latin1) encoding, while the option to
open it for i.e. utf8 translation should be available.
Lets look at some examples:
-Example 1 - common byte-oriented reading
+Example 1 - common byte-oriented writing
........................................
A file is opened as usual with file\:open. We then want to write bytes
@@ -556,45 +556,45 @@
UTF-8 bytes written to the file as expected.
- Using io:put_chars, the io-server will return an error if any of the
- unicode characters sent are not possible to represent in one
+ Unicode characters sent are not possible to represent in one
byte. Characters representable in latin1 will however be written
nicely even though they might be encoded as UTF-8 in binaries sent
- to io:put_chars. As long ars the io_lib:format function is used
+ to io:put_chars. As long as the io_lib:format function is used
without the translation-modifier, everything will be valid latin1
and all return values will be lists, why it is both valid Unicode *and*
possible to write on a default file. Old code will function as
before, except when feeding io:put_chars with latin1 binaries, in
- that case the call should be exchanged for a file\:write call.
+ that case the call should be replaced with a file\:write call.
-Example 2 - Unicode-orinted writing
-...................................
+Example 2 - Unicode-oriented writing
+....................................
-A file is opened using a parameter telling that unicode data should be
+A file is opened using a parameter telling that Unicode data should be
written in a defined encoding, in this case we'll select UTF-16/bigendian to
-avoid mixups with the native UTF-8 encoding. We open the file with
+avoid mix-ups with the native UTF-8 encoding. We open the file with
file\:open(Name,[write,{encoding,utf16,bigendian}]).
- Using file\:write with iodata(), the io-protocol will convert into
- the default unicode representation (UTF-8) and send the data to the
+ the default Unicode representation (UTF-8) and send the data to the
io-server, which will in turn convert the data to UTF-16 and put it
on the file. The file is to be regarded as a text file and all
iodata() sent to it will be regarded as text.
-- If the data is already in unicode representation (say UTF-8) it
+- If the data is already in Unicode representation (say UTF-8) it
should not be written to this type of file using file\:write,
io:put_chars is expected to be used (which is not a problem as
- unicode data should not exist in old code and this is only a problem
+ Unicode data should not exist in old code and this is only a problem
when the file is opened to translate).
- If the data is in the Erlang default Unicode format, it can be
written to the file using io:put_chars. This works for all types of
lists with integers and for binaries in UTF-8, for other
representations (most notably latin1 in binaries) the data should be
- converted using unicode:characters_to_XXX(Data,latin1) prior to
+ converted using Unicode:characters_to_XXX(Data,latin1) prior to
sending. For latin1 mixed lists (iodata()), file\:write can also be
used directly.
-To sum up this case - unicode strings (including latin1 lists) are
+To sum up this case - Unicode strings (including latin1 lists) are
written to a converting file using io:put_chars, but pure iodata() can
also be implicitly converted to the encoding by using file\:write.
@@ -605,8 +605,8 @@
together with io:put_chars.
- Data formatted with io_lib:format can still be written to a raw file
- using file\:write. The data will end up beeing written as is. If the
- translation modifier is consequently used when formatting, the file
+ using file\:write. The data will end up being written as is. If the
+ translation modifier is consistently used when formatting, the file
will get the native UTF-8 encoding, if no translation modifiers are
used, the file will have latin1 encoding (each character in the list
returned from io_lib:format will be representable as a latin1
@@ -623,44 +623,44 @@
writing.
- file\:read on any file will expect the io-protocol to deliver data as
- Unicode. Each byte will be converted to unicode by the io_server and
+ Unicode. Each byte will be converted to Unicode by the io_server and
turned back to a byte by file\:read
-- If the file actually contains unicode characters, they will bytewise
- be converted to unicode and then bach, giving file\:read the
+- If the file actually contains Unicode characters, they will be byte-wise
+ converted to Unicode and then back, giving file\:read the
original encoding. If read as (or converted to) binaries they can
- then easilly be converted back to the Erlang default representation
+ then easily be converted back to the Erlang default representation
by means of the conversion routines.
- If the file is read with io:get_chars, all characters will be
returned in a list as expected. All characters will be latin1, but
- that is a subset of unicode and there will be no difference to
- reading a translating file. If the file however contains unicode
+ that is a subset of Unicode and there will be no difference to
+ reading a translating file. If the file however contains Unicode
converted characters and is read in this way, the return value from
io:get_chars will be hard to interpret, but that is to be
expected. If such a functionality is desired, the list can be
converted to a binary with list_to_binary and then explored as a
- unicode entity in the encoding the file actually has.
+ Unicode entity in the encoding the file actually has.
Example 5 - Unicode file reading
................................
-As when writing, reading unicode converting files is best done with
+As when writing, reading Unicode converting files is best done with
the io-module. Let's once again assume UTF-16 on the file.
- When reading using file\:read, the UTF-16 data will be converted into
- a unicode representation native to Erlang and sent to the
+ a Unicode representation native to Erlang and sent to the
client. If the client is using file\:read, it will translate the data
- back to bytes in the same way as bytes where translated to unicode
+ back to bytes in the same way as bytes were translated to Unicode
for the protocol when writing. Is everything representable as bytes,
- the function will succeed, but if any unicode character larger than
+ the function will succeed, but if any Unicode character larger than
255 is present, the function will fail with a decoding error.
-- Unicode data in the range over codepoint 255 can not be retrieved by
+- Unicode data in the range over code-point 255 can not be retrieved by
use of the file-module. The io-module should be used instead.
-- io:get_chars and io:get_line will work on the unicode data provided
- by the io-protocol. All unicode returns will be as unicode lists as
+- io:get_chars and io:get_line will work on the Unicode data provided
+ by the io-protocol. All Unicode returns will be as Unicode lists as
expected. The fread function will return UTF-8 encoded binaries only
when the translation modifier is supplied.
@@ -668,8 +668,8 @@
.......................
As with writing, only the file module can be used and only byte
-oriented data is read. If encoded, The encoding will remain when
-reading and writing.
+oriented data is read. If encoded, the encoding will remain when
+reading and writing raw files.
Conclusions from the examples
@@ -678,17 +678,17 @@
With this solution, the file module is consistent with latin1
io_servers (aka common files) and raw files. A file type, a translating
file, is added for the io-module to be able to get implicit conversion
-of it's Unicode data (another example of such an io_server with
+of its Unicode data (another example of such an io_server with
implicit conversion would of course be the
terminal). Interface-wise,common files behave as before and we only
get added functionality.
-The downsides are the subtly changed behaviour of io:put_chars and the
+The downsides are the subtly changed behavior of io:put_chars and the
performance impact by the conversion to and from Unicode
representations when using the file module on non-raw files with
default (latin1/byte) encoding. The latter may be possible to change
by extending the io-protocol to tag whole chunks of data as bytes
-(latin1) or unicode, but using raw files for writing large amounts of
+(latin1) or Unicode, but using raw files for writing large amounts of
data is often the better solution in those cases.
@@ -706,9 +706,9 @@
---------------------------------------
I also suggest a module 'unicode', containing functions for
-converting between representations of unicode. The default format for
+converting between representations of Unicode. The default format for
all functions should be utf8 in binaries to point out this as the
-preferred internal representation of unicode characters in binaries.
+preferred internal representation of Unicode characters in binaries.
The two main conversion functions should be characters_to_binary/3 and
characters_to_list/2 as described above.
@@ -731,7 +731,7 @@
UTF-32 will need to be supported in a similar way as UTF-16, both for
completeness and for the range-checking that will be involved when
-converting unicode characters.
+converting Unicode characters.
Formatting
----------
@@ -743,10 +743,10 @@
functionality for code not using the translation modifier, but will
return UTF-8 binaries when ordered to.
-The fread function should in the same way accept unicode data only
+The fread function should in the same way accept Unicode data only
when the "t" modifier is used.
-The io-protocol need to be changed o always handle Unicode characters.
+The io-protocol need to be changed to always handle Unicode characters.
Options given when opening a file will allow for implicit conversion of
text files.
More information about the eeps
mailing list