[erlang-questions] Vector instructions

Richard A. O'Keefe ok@REDACTED
Tue Apr 8 04:40:23 CEST 2008


On 7 Apr 2008, at 12:34 pm, Andrew Whitehead wrote:

> Hiya,
>
> Although I don't think you are really suggesting this,

I am not only "not REALLY" suggesting it, I am not suggesting it
to any degree whatsoever.

> the fact that few languages have yet implemented Unicode properly is  
> of course no excuse to ignore the problem.

Nor am I, nor could any competent reader for one moment imagine that I
might be, suggesting that it is.

On the contrary, I have been aware of Unicode issues since Unicode 1.0  
came
out and have made some specific proposals to the Prolog community about
handling Unicode in Prolog.

What I *AM* saying is that I am fed up to the back teeth with people who
DON'T seem to have thought through the issues thoroughly suggesting that
the designers of a tiny language designed for shipping BYTES around the
net are somehow culpable for not coming up with a string design that has
so far completely eluded the vastly larger and better funded C++, Java,
Javascript, &c communities.  For yet another comparison, look at
SmartEiffel:  CHARACTER is 8 bits, STRING is 8 bits, UNICODE_STRING
elements are INTEGER_32 values, not characters, and Erlang already has
more things you can do with a Unicode string.

What I *AM* suggesting is that instead of slagging off Joe Armstrong &  
Co,
people who want good Unicode support in Erlang should start writing code
and/or EEPs and get it done themselves.

> Using integer lists in place of explicitly marked strings was never  
> a good design decision.

This is flat-out wrong.
Remember Alan Perlis's "Epigrams in Programming", number 9:
	It is better to have 100 functions operate on one data structure
	than 10 functions on 10 data structures.

> It makes it impossible for functions to distinguish between strings  
> and other lists without walking through every element.

This has nothing to do with the choice of lists as such and everything  
to
do with the fact that Erlang is dynamically typed.  In particular, this
turned out to be one of the STRENGTHS of Erlang string support:   
iolists.
We can build up complex texts using O(1) concatenation and then flatten
in linear time to something flat if we want it.  Now with Unicode,  
iolists
turn out to be worth even more than just efficient concatenation.

Think of a "string" as a sequence of "characters".
Now what is a "character"?  (From the viewpoint of a user.)
It is *not* the same thing as a Unicode code point.
The thing that the user expects to step over with a single
Ctrl-F or Right Arrow keypress may well be represented by
several Unicode code-points.

Using lists means that we have a choice: we can represent a text as a  
list
of Unicode codepoints, but we can ALSO represent a text as a list of
things-the-user-thinks-of-as-characters, where each element might be a
single code-point, or it might be a list of code-points.  This second
representation is more convenient for stepping over; far more  
convenient.
But it is also easy to flatten when we've a need to.

> Strings can be encapsulated of course, but this is just a workaround  
> for the fact that string constants are completely undistinguished,  
> and it means you have to unwrap them before using any of the built- 
> in string functions.

This is in no way different from other data types in Erlang.
When does {1,1} represent the pixel near the top left corner of a  
window,
when does it represent the rational number 1 = 1/1, when does it  
represent
a latitude/longitude pair, when does it represent the complex number  
1+i,
when does it represent "board 0, connector 0", &c.

In *all* cases involving Erlang data, you as programmer have to KNOW
what you are expecting.

I will buy this as an argument for compile-time types.


> How many times have developers had to explain why printed output is  
> showing up as a list of numbers?

Do you mean developers OF Erlang or developers IN Erlang?
If the first, the answer is "as many times as they have been asked by
people who didn't bother to read the documentation thoroughly".
If the second, the answer is "as many times as they have provided raw
Erlang terms to non-developers."

> How many hacks are there in the emulator in order to deal with  
> strings efficiently?

Very few.

> (I'm thinking of the binary representation of terms for example.)  
> Why can I tack atoms and floats and record values onto a list that  
> was created from a string constant, and then have io:write choke  
> halfway through printing it?

Because Erlang has no compile-time types and you were careless.
It's not a problem I've ever had in Prolog or Erlang.
>

> The only benefits of the current implementation are:
> - There's less work required of the language designer.

   + There is far less for the language USER to remember;
     there is only one incrementally-constructible sequence type
     and only one set of function names to remember.

If you don't think that's a problem, you've never found yourself unhappy
with Scheme because STRING-LENGTH and STRING-APPEND are different  
functions
from LENGTH and APPEND, and you've never found your program "choke  
halfway"
because you forgot to call STRING->LIST or LIST->STRING at some point.
And you have never found yourself hopping mad because there is some  
function
that *is* available for lists but *isn't* available for strings so you  
have
to do (list->string ... (string->list ...) ...) just to patch around  
this.
>
> - Since you can treat strings as lists, you get to reuse some of  
> your list processing functions. But really, the usefulness of this  
> is overrated. How often do you really need to reverse a string,

Fairly often, actually.  Tail recursive functions building a string
commonly end by reversing it.

> or repeat it some number of times,

Fairly often, actually.  In fact strings are almost the only lists I do
that to.

> or calculate the maximum character value..

How else would you decide whether all the characters
  - are BMP (maximum <= 65535)
  - are Latin 1 (maximum <= 255)
  - are ASCII (maximum <= 127)

> at least with string-specific versions of the functions you need it  
> wouldn't be possible to corrupt your string.

I've used Scheme enough to detest string-specific versions of functions.
Since lists are immutable, it is impossible right now to corrupt a  
string.
>
> - It's easier to iterate through characters, for a narrow definition  
> of character. A Unicode-aware implementation should let you iterate  
> through either code points or composed characters.

"composed character" is not in the Unicode glossary;
I suspect you mean "combining character sequence".

There is nothing about lists that prevents an Erlang unicode: library
offering the facility of iterating over combining character sequences.

>
>
> If Erlang strings were in fact lists of Unicode code points then the  
> situation might be more tenable, but they aren't,

Precisely BECAUSE Erlang strings are nothing other than lists of  
numbers,
there is nothing that stops you using Unicode code points in "string"  
data.

We are agreed that Erlang needs good Unicode support.
We are agreed that at the very least any places that limit "strings"
to 8 bits should be relaxed.
I hope we are agreed that library support for Unicode-aware operations
can be at least prototyped without a special data type.

Larceny's string representations are of course tuned for a language
with mutable strings, which Erlang isn't.




More information about the erlang-questions mailing list