[erlang-questions] Erlang 3000?

Richard O'Keefe ok@REDACTED
Thu Nov 20 03:49:06 CET 2008


On 20 Nov 2008, at 3:04 am, Johnny Billquist wrote:
> And I don't agree. You are mixing semantics with syntax, in my mind
> (syntax is probably not the right word here, but I'm no typographer  
> so I
> don't know the correct term, but I hope you understand what I mean).
> There is no uppercase version of ß, so it can't be converted to  
> uppercase.
> The fact that you write SS instead of ß, when you want it in uppercase
> don't mean that it's the same letter, just that it has the same  
> meaning.

I think you are asking the wrong question.
The question I would ask is "If I had a string with a sharp s in it
and I asked for the upper case version what would I expect to get?"
I put it to you that anyone asking for a string to be converted to
entirely upper case would be rather upset to find a lower case letter
in the result.

The whole idea of a 1-1 lower-upper case mapping is historically
absurd.  Even in English and French, we had the distinction between
long s and final s in lower case (but not upper case), where it was
impossible to convert a capital S to lower case without knowing
what its context was.

> Conversion of a string to uppercase can be regarded in two ways.  
> Either
> you replace each character with it's uppercase version, and characters
> that don't have an uppercase version you leave be.

This is not the case conversion process needed for centuries' worth
of European writing in Latin scripts, which is unabashedly contextual.
It's purely and simply a computer-oriented HACK that worked in ASCII
and some versions of EBCDIC simply because they had very few letters.
ASCII didn't even provide a full repertoire for English.  It's not
something that was much wanted for its own sake.
>
> Or you can try to convert the string as such to an uppercase version,
> where some letters might need to be replaced by sequences of other
> characters.

It's like comparison:  human-oriented text comparison is *not*
a lexicographical extension of character ordering, not even in
English.  (The way that strcmp() puts 'Boxers' before 'artists'
has never made much sense to people not intimately familiar with
ASCII.)

It's always advisable to distinguish between operations performed
for the *computer's* benefit (like sorting a bunch of strings in
order to eliminate duplicates or build an index) and operations
performed for some *human's* benefit (like sorting entries in a
telephone directory).  The second kind is much much harder.

> I personally usually are satisfied with the previous, but I guess  
> that's
> anyones choice.
>
> And I also believe that this is one of the more serious flaws of
> Unicode. It mixes semantics with syntax. So you have, for instance
> several A-ring characters, for use in different type of contexts, but
> that is all artificial and unfortunate.
> It's like in the old days, when you had several different minus  
> signs on
> punched cards, for different uses.

Eh?  I did a lot of keypunching in the old days, using both 026 and 029
keypunches.  Even a few cards with a hand punch.  Oh yeah, and some
96-column cards; I don't remember what that punch was called.  I assure
you, each of them had exactly one minus sign.  Which punched cards are
you talking about?

Bear in mind that early punched card codes had at most 64
characters, and that the 026 and 029 keypunches didn't even
have lower case letters.  PL/I (designed in the 60s) even got
by with just 48 characters.  The BCL 6-bit character set used
by Burroughs machines had only one hyphen-minus character and
that was not unusual.  Heck, into the 90s people using COBOL
in IBM shops sometimes didn't even have semicolons on their
print chains.  (Yes, I really did have to mark a large student
assignment written in C but printed on just such a printer, in
the early 90s.)

> Hmm, looking at Unicode, I can see
> that they have reintroduced this ambiguity. You have hyphen-minus
> (U+002D), hyphen (U+2010) and minus (U+2212) and you also have a  
> number
> of different dashes.
> Try to figure out which one you want when you are writing.

As a long-standing TeX user, it couldn't be easier:
The thing that links the parts of multi-element words is a hyphen.
It's short.
The thing that links the ends of a range like a--z or 1--10 is an
en-dash.  It's longer.  (It's Option-Minus on a Mac.)
The thing that is sort of like an extra-fat comma---used for example
in parentheses like this one---is an em-dash.  It's longer still.
(It's Option-Shift-Minus on a Mac.)
The thing you use in a mathematical formula is a minus sign
and in TeX you just use the minus key.  But it's clearly distinct
from the other three.

It looks really horrible to see people using a hyphen when they
mean a dash.
>

> (According to one myth this "problem" actually caused the Mariner 1 to
> fail and self destruct, since the poor Fortran programmer hade used a
> hyphen instead of a minus for a constant. Not sure if it's true or  
> not,
> and the web don't give a sure answer.)

That's not only a mistake, it's not even the RIGHT mistake.
This is supposed to be a quote from the official report:

    NASA-JPL-USAF Mariner R-1 Post-Flight Review Board determined that
    the omission of a hyphen in coded computer instructions transmitted
    incorrect guidance signals to Mariner spacecraft boosted by two- 
stage
    Atlas-Agena from Cape Canaveral on July 21.  Omission of hyphen in
    data editing caused computer to swing automatically into a series of
    unnecessary course correction signals which threw spacecraft off
    course so that it had to be destroyed.

It wasn't using a hyphen instead of a minus, it was OMITTING a
hyphen that caused the problem.  (It took me 3 minutes to find that.)

> There are more examples like this, where Unicode mess things up  
> because
> it mix the visual impression of a character with semantic meaning of  
> the
> character.)

Well, no, not really.  The root problem is the political requirement
for round-trip compatibility with most previous national and  
international
character set standards.  Unicode had to be bug-compatible.
>
>
> And when I learned German in school many years ago, I was taught  
> that ß
> was more or less the equivalent of sz. :-)

It was indeed an s-z ligature, but it stands for "ss" not "sz".
(Other languages may of course use it for other purposes, which
just adds to the "fun".)




More information about the erlang-questions mailing list