[eeps] Commit: r54 - eeps/trunk

Wed Nov 12 15:31:44 CET 2008

Author: pan
Date: 2008-11-12 15:31:43 +0100 (Wed, 12 Nov 2008)
New Revision: 54

Modified:
   eeps/trunk/eep-0010.txt
Log:
Spell checking and general improvement on readability


Modified: eeps/trunk/eep-0010.txt
===================================================================

--- eeps/trunk/eep-0010.txt	2008-11-11 10:22:48 UTC (rev 53)
+++ eeps/trunk/eep-0010.txt	2008-11-12 14:31:43 UTC (rev 54)
@@ -1,6 +1,6 @@
 EEP: 10
 Title: Representing Unicode characters in Erlang
-Version: $Id: unicode_in_erlang.txt,v 1.11 2008/11/11 10:12:28 pan Exp pan $
+Version: $Id: unicode_in_erlang.txt,v 1.12 2008/11/12 14:29:51 pan Exp $
 Last-Modified: $Date$
 Author: Patrik Nyblom
 Status: Draft
@@ -39,7 +39,7 @@
 system should provide. The Unicode support is by no means complete if
 this EEP is implemented, but implementation will be feasible.
 
-The EEP also suggests library functions and bit sytax to deal with
+The EEP also suggests library functions and bit syntax to deal with
 alternative encodings. However, one *standard* encoding is suggested,
 which will be what library functions in Erlang are expected to
 support, while other representations are supported only in terms of
@@ -235,8 +235,8 @@
 
     unicode:utf8_to_list(Bin) -> UL
 
-Where Bin is a binary consisting of unicode characters encoded as
-UTF-8 and UL is a plain list of unicode characters.
+Where Bin is a binary consisting of Unicode characters encoded as
+UTF-8 and UL is a plain list of Unicode characters.
 
 To allow for conversion to and from latin1 the functions::
 
@@ -244,10 +244,10 @@
 
 and::
 
-    unicode:latin1_list_to_list(LM) -> UL
+    unicode:latin1_list_to_unicode_list(LM) -> UL
 
 would do the same job. Actually latin1_list_to_list is not necessary
-in this context, as it is more of a iolist-function, but should be 
+in this context, as it is more of an iolist-function, but should be 
 present for completeness.
 
 The fact that lists of integers representing latin1 characters are a
@@ -273,15 +273,15 @@
 The word "characters" is used to denote a possibly complex
 representation of characters in the encoding concerned, like a short
 word for "a possibly mixed and deep list of characters and/or binaries
-in either latin1 representation or unicode".
+in either latin1 representation or Unicode".
 
 Giving latin1 as the encoding would mean that all of ML should be
 interpreted as latin1 characters, implying that integers > 255 in the
-list would be an error. Giving unicode as the encoding would mean that
+list would be an error. Giving Unicode as the encoding would mean that
 all integers 0..16#10ffff are accepted and the binaries are expected
 to already be UTF-8 coded.
 
-In the same way, conversion to lists of unicode characters could be done with a function::
+In the same way, conversion to lists of Unicode characters could be done with a function::
 
     unicode:characters_to_list(ML, InEncoding) -> list() 
         ML := A mixed Unicode list or a mixed latin1 list
@@ -290,16 +290,16 @@
 I think the approach of two simple conversion functions
 characters_to_binary/2 and characters_to_list/2 is attractive, despite the fact
 that certain combinations of in-data would be somewhat harder to
-convert (e.g. combinations of unicode characters > 255 in a list with
+convert (e.g. combinations of Unicode characters > 255 in a list with
 binaries in latin1). Extending the bit syntax to cope with UTF-8 would
 make it easy to write special conversion functions to handle those
 rare situations where the above mentioned functions cannot do the job.
 
-To accomodate other encodings, the characters_to_binary functionality
+To accommodate other encodings, the characters_to_binary functionality
 could be extended to handle other encodings as well. A more general
 functionality could be provided with the following functions
-(preferrably placed in their own module, the module name 'unicode'
-beeing a good name-candidate):
+(preferably placed in their own module, the module name 'unicode'
+being a good name-candidate):
 
 
 **characters_to_binary(ML) -> binary() | {error, Encoded, Rest} | {incomplete, Encoded, Rest}**
@@ -322,7 +322,7 @@
 - Rest := Mixed list as specified for ML.
 
 The option 'unicode' is an alias for utf8, as this is the
-preferred encoding for unicode characters in binaries. Error tuples
+preferred encoding for Unicode characters in binaries. Error tuples
 are returned when the data cannot be encoded/decoded due to errors
 in indata and incomplete tuples when the indata is possibly correct
 but truncated.
@@ -343,7 +343,7 @@
 
 Here also the option 'unicode' denotes the default Erlang encoding
 of utf8 in binaries and is therefore an alias for utf8. Error- and
-incomplete-tuples ase returned in the same way as for
+incomplete-tuples are returned in the same way as for
 characters_to_binary.
 
 Note that as the datatypes returned upon success are well defined,
@@ -351,7 +351,7 @@
 returning the clunky {ok, Data} tuples even though the error and
 incomplete tuples can be returned. This makes the functions simpler to
 use when the encoding is known to be correct while return values can
-still be checked easilly.
+still be checked easily.
 
 
 Bit syntax
@@ -372,7 +372,7 @@
 
 As bit syntax is often used to interpret data from various external
 sources, it would be useful to have corresponding utf16 and utf32
-types as well. While UTF-8 and UTF-16 and UTF-32 is easily interpreted
+types as well. While UTF-8, UTF-16 and UTF-32 are easily interpreted
 with the current bit syntax implementation, the suggested specific
 types would be convenient for the programmer. Also Unicode imposes
 restrictions in terms of range and has some forbidden ranges which are best
@@ -405,7 +405,7 @@
 implementations use binaries for communication, which in practice has
 made the io-protocol contain bytes, not general characters.
 
-Futhermore has the fact that the io-system currently works with
+Furthermore has the fact that the io-system currently works with
 characters that can be represented as bytes been utilized in numerous
 applications, so that output from io-functions (i.e. io_lib:format)
 has been sent directly to entities only accepting byte input (like
@@ -441,7 +441,7 @@
 To make a solution that as far as possible does not break current code
 and also keeps (or reverts to) the original intention of the
 io-system protocol, I suggest a scheme where the formatting functions
-that return lists, keep to the current behaviour except when the
+that return lists, keep to the current behavior except when the
 translation-modifier is used, in which case binaries in UTF-8 encoding
 are returned.
 
@@ -462,7 +462,7 @@
 
     3> io_lib:format("~ts",[UniString]).
 
-\- would return a (deep) list with the unicode string as a binary::
+\- would return a (deep) list with the Unicode string as a binary::
 
     [[<<208,154,208,176,208,186,208,178,208,190,32,208,181,32,
         85,110,105,99,111,100,101,32,63>>]]   
@@ -490,7 +490,7 @@
 
 \- would be accepted. 
 
-The corresponding behavior of io:fread/2,3 would be to expect unicode data in this call::
+The corresponding behavior of io:fread/2,3 would be to expect Unicode data in this call::
 
     11> io:fread(File,'',"~ts").
 
@@ -500,10 +500,10 @@
 
 The actual io-protocol, on the other hand, should deal only with
 Unicode, meaning that when data is converted to binaries for sending,
-all data should be transflated into UTF-8. When lists of integers are
-used in communication, the latin1 and unicode representations are the
+all data should be translated into UTF-8. When lists of integers are
+used in communication, the latin1 and Unicode representations are the
 same, why no conversion or restrictions apply. Recall that the
-io-system is built so that characters should have one translation
+io-system is built so that characters should have one interpretation
 regardless of the io-server. The only possible encoding would be a
 Unicode one.
 
@@ -514,31 +514,31 @@
 
 Generally, writing
 to an io-server using the file-module will only be possible with
-byte-oriented data, while using the io-module will work on unicode
+byte-oriented data, while using the io-module will work on Unicode
 characters. Calling the function file\:write/2 will send the bytes to
 the file as is, as files are byte-oriented, but when writing on a file
-using the io-module, unicode characters are expected and handled.
+using the io-module, Unicode characters are expected and handled.
 
-The io-protocol will make conversons of bytes into unicode when
+The io-protocol will make conversions of bytes into Unicode when
 sending to io-servers, but if the file is byte-oriented, the
 conversion back will make this transparent to the user. All bytes are
 representable in UTF-8 and can be converted back and forth without hassle.
 
 The incompatible change will have to be to the put_chars function in
-io. It should only allow unicode data, not iodata() as it is
-documented to do now. The big change beeing that any binaries provided
+io. It should only allow Unicode data, not iodata() as it is
+documented to do now. The big change being that any binaries provided
 to the function need to be in UTF-8. However, most usage of this
-function is restricted to lists, why this incompatible change not is
-expected to cause trouble for users.
+function is restricted to lists, why this incompatible change is
+expected not to cause trouble for users.
 
-To handle possible unicode text data on a file, one should be able to
+To handle possible Unicode text data on a file, one should be able to
 provide encoding parameters when opening a file. A file should by
 default be opened for byte (or latin1) encoding, while the option to
 open it for i.e. utf8 translation should be available.
 
 Lets look at some examples:
 
-Example 1 - common byte-oriented reading
+Example 1 - common byte-oriented writing
 ........................................
 
 A file is opened as usual with file\:open. We then want to write bytes
@@ -556,45 +556,45 @@
   UTF-8 bytes written to the file as expected.
 
 - Using io:put_chars, the io-server will return an error if any of the
-  unicode characters sent are not possible to represent in one
+  Unicode characters sent are not possible to represent in one
   byte. Characters representable in latin1 will however be written
   nicely even though they might be encoded as UTF-8 in binaries sent
-  to io:put_chars. As long ars the io_lib:format function is used
+  to io:put_chars. As long as the io_lib:format function is used
   without the translation-modifier, everything will be valid latin1
   and all return values will be lists, why it is both valid Unicode *and*
   possible to write on a default file. Old code will function as
   before, except when feeding io:put_chars with latin1 binaries, in
-  that case the call should be exchanged for a file\:write call.
+  that case the call should be replaced with a file\:write call.
 
-Example 2 - Unicode-orinted writing
-...................................
+Example 2 - Unicode-oriented writing
+....................................
 
-A file is opened using a parameter telling that unicode data should be
+A file is opened using a parameter telling that Unicode data should be
 written in a defined encoding, in this case we'll select UTF-16/bigendian to
-avoid mixups with the native UTF-8 encoding. We open the file with
+avoid mix-ups with the native UTF-8 encoding. We open the file with
 file\:open(Name,[write,{encoding,utf16,bigendian}]). 
 
 - Using file\:write with iodata(), the io-protocol will convert into
-  the default unicode representation (UTF-8) and send the data to the
+  the default Unicode representation (UTF-8) and send the data to the
   io-server, which will in turn convert the data to UTF-16 and put it
   on the file. The file is to be regarded as a text file and all
   iodata() sent to it will be regarded as text.
 
-- If the data is already in unicode representation (say UTF-8) it
+- If the data is already in Unicode representation (say UTF-8) it
   should not be written to this type of file using file\:write,
   io:put_chars is expected to be used (which is not a problem as
-  unicode data should not exist in old code and this is only a problem
+  Unicode data should not exist in old code and this is only a problem
   when the file is opened to translate).
 
 - If the data is in the Erlang default Unicode format, it can be
   written to the file using io:put_chars. This works for all types of
   lists with integers and for binaries in UTF-8, for other
   representations (most notably latin1 in binaries) the data should be
-  converted using unicode:characters_to_XXX(Data,latin1) prior to
+  converted using Unicode:characters_to_XXX(Data,latin1) prior to
   sending. For latin1 mixed lists (iodata()), file\:write can also be
   used directly.
 
-To sum up this case - unicode strings (including latin1 lists) are
+To sum up this case - Unicode strings (including latin1 lists) are
 written to a converting file using io:put_chars, but pure iodata() can
 also be implicitly converted to the encoding by using file\:write.
 
@@ -605,8 +605,8 @@
 together with io:put_chars.
 
 - Data formatted with io_lib:format can still be written to a raw file
-  using file\:write. The data will end up beeing written as is. If the
-  translation modifier is consequently used when formatting, the file
+  using file\:write. The data will end up being written as is. If the
+  translation modifier is consistently used when formatting, the file
   will get the native UTF-8 encoding, if no translation modifiers are
   used, the file will have latin1 encoding (each character in the list
   returned from io_lib:format will be representable as a latin1
@@ -623,44 +623,44 @@
 writing.
 
 - file\:read on any file will expect the io-protocol to deliver data as
-  Unicode. Each byte will be converted to unicode by the io_server and
+  Unicode. Each byte will be converted to Unicode by the io_server and
   turned back to a byte by file\:read
 
-- If the file actually contains unicode characters, they will bytewise
-  be converted to unicode and then bach, giving file\:read the
+- If the file actually contains Unicode characters, they will be byte-wise
+  converted to Unicode and then back, giving file\:read the
   original encoding. If read as (or converted to) binaries they can
-  then easilly be converted back to the Erlang default representation
+  then easily be converted back to the Erlang default representation
   by means of the conversion routines.
 
 - If the file is read with io:get_chars, all characters will be
   returned in a list as expected. All characters will be latin1, but
-  that is a subset of unicode and there will be no difference to
-  reading a translating file. If the file however contains unicode
+  that is a subset of Unicode and there will be no difference to
+  reading a translating file. If the file however contains Unicode
   converted characters and is read in this way, the return value from
   io:get_chars will be hard to interpret, but that is to be
   expected. If such a functionality is desired, the list can be
   converted to a binary with list_to_binary and then explored as a
-  unicode entity in the encoding the file actually has.
+  Unicode entity in the encoding the file actually has.
 
 Example 5 - Unicode file reading
 ................................
 
-As when writing, reading unicode converting files is best done with
+As when writing, reading Unicode converting files is best done with
 the io-module. Let's once again assume UTF-16 on the file.
 
 - When reading using file\:read, the UTF-16 data will be converted into
-  a unicode representation native to Erlang and sent to the
+  a Unicode representation native to Erlang and sent to the
   client. If the client is using file\:read, it will translate the data
-  back to bytes in the same way as bytes where translated to unicode
+  back to bytes in the same way as bytes were translated to Unicode
   for the protocol when writing. Is everything representable as bytes,
-  the function will succeed, but if any unicode character larger than
+  the function will succeed, but if any Unicode character larger than
   255 is present, the function will fail with a decoding error.
 
-- Unicode data in the range over codepoint 255 can not be retrieved by
+- Unicode data in the range over code-point 255 can not be retrieved by
   use of the file-module. The io-module should be used instead.
 
-- io:get_chars and io:get_line will work on the unicode data provided
-  by the io-protocol. All unicode returns will be as unicode lists as
+- io:get_chars and io:get_line will work on the Unicode data provided
+  by the io-protocol. All Unicode returns will be as Unicode lists as
   expected. The fread function will return UTF-8 encoded binaries only
   when the translation modifier is supplied.
 
@@ -668,8 +668,8 @@
 .......................
 
 As with writing, only the file module can be used and only byte
-oriented data is read. If encoded, The encoding will remain when
-reading and writing.
+oriented data is read. If encoded, the encoding will remain when
+reading and writing raw files.
 
 
 Conclusions from the examples
@@ -678,17 +678,17 @@
 With this solution, the file module is consistent with latin1
 io_servers (aka common files) and raw files. A file type, a translating
 file, is added for the io-module to be able to get implicit conversion
-of it's Unicode data (another example of such an io_server with
+of its Unicode data (another example of such an io_server with
 implicit conversion would of course be the
 terminal). Interface-wise,common files behave as before and we only
 get added functionality. 
 
-The downsides are the subtly changed behaviour of io:put_chars and the
+The downsides are the subtly changed behavior of io:put_chars and the
 performance impact by the conversion to and from Unicode
 representations when using the file module on non-raw files with
 default (latin1/byte) encoding. The latter may be possible to change
 by extending the io-protocol to tag whole chunks of data as bytes
-(latin1) or unicode, but using raw files for writing large amounts of
+(latin1) or Unicode, but using raw files for writing large amounts of
 data is often the better solution in those cases.
 
 
@@ -706,9 +706,9 @@
 ---------------------------------------
 
 I also suggest a module 'unicode', containing functions for
-converting between representations of unicode. The default format for
+converting between representations of Unicode. The default format for
 all functions should be utf8 in binaries to point out this as the
-preferred internal representation of unicode characters in binaries. 
+preferred internal representation of Unicode characters in binaries. 
 
 The two main conversion functions should be characters_to_binary/3 and
 characters_to_list/2 as described above.
@@ -731,7 +731,7 @@
 
 UTF-32 will need to be supported in a similar way as UTF-16, both for
 completeness and for the range-checking that will be involved when
-converting unicode characters.
+converting Unicode characters.
 
 Formatting
 ----------
@@ -743,10 +743,10 @@
 functionality for code not using the translation modifier, but will
 return UTF-8 binaries when ordered to.  
 
-The fread function should in the same way accept unicode data only
+The fread function should in the same way accept Unicode data only
 when the "t" modifier is used.
 
-The io-protocol need to be changed o always handle Unicode characters.
+The io-protocol need to be changed to always handle Unicode characters.
 Options given when opening a file will allow for implicit conversion of
 text files.