[erlang-questions] Erlang Syntax and "Patterns" (Again)
Richard A. O'Keefe
ok@REDACTED
Fri Mar 18 00:50:35 CET 2016
On 17/03/16 11:53 pm, Steve Davis wrote:
> > ROK said:
> > Yawn.
> (What am I doing trying to argue with ROK??? Am I MAD?)
>
> 1) Why is it people rant about "string handling" in Erlang?
Because it is not the same as Java.
>
> 2) Principle of least surprise:
> 1> [H|T] = [22,87,65,84,33].
> [22,87,65,84,33]
> 2> H.
> 22
> 3> T.
> "WAT!”
This is a legitimate complaint, but it confuses two things.
There is *STRING HANDLING*, which is fine, and
there is *LIST PRINTING*, which causes the confusion.
For comparison, DEC-10 Prolog, PDP-11 Prolog, C-Prolog, and Quintus Prolog
all did STRING HANDLING as lists of character codes, but
all did LIST PRINTING without ever converting lists of numbers to strings.
The answer was that there was a library procedure to print a list of
integers as a string and you could call that whenever you wanted to,
such as in a user-defined pretty-printing procedure. Here's a transcript
from SICStus Prolog:
| ?- write([65,66,67]).
[65,66,67]
yes
| ?- write("ABC").
[65,66,67]
yes
The heuristic used by the debugger in some Prologs was that a list of
integers between 32 and 126 inclusive was printed as a string; that
broke down with Latin 1, and broke harder with Unicode. The simple
behaviour mandated by the standard that lists of integers print as
lists of integers confuses people once, then they learn that string
quotes are an input notation, not an output notation, and if they want
string notation in output, they have to call a special procedure to get it.
The ISO Prolog committee introduced a horrible alternative which the
DEC-10 Prolog designers had experienced in some Lisp systems and
learned to hate: flip a switch and "ABC" is read as ['A','B','C']. The
principal reason given for that was that the output was semi-readable.
One of my arguments against it was that this required every Prolog
system to be able to hold 17*2**16 atoms, and I new for a fact that
many would struggle to do so. The retort was "they must be changed
to make a special case for one-character atoms". Oh well, no silver
bullet.
That does serve as a reminder, though, that using [a,b,c] instead of
[$a,$b,$c] is *possible* in Erlang.
Just to repeat the basic point: the printing of (some) integer lists as
strings is SEPARABLE from the issue of how strings are represented and
processed; that could be changed without anything else in the language
changing.
>
> 3) A codec should be perfectly reversible i.e. X = encode(decode(X)).
> Without tagging, merely parsing out a string as a list is not
> perfectly reversible.
Here you are making a demand that very few other programming languages
can support. For example, take JavaScript. "\u0041" is read as "A",
and you are not going to get "\u0041" back from "A". You're not even
going to get "\x41" back from it, even though "\x41" == "A".
Or take Erlang, where
1> 'foo bar'.
'foo bar'
2> 'foobar'.
foobar
with the same kind of thing happening in Prolog.
And of COURSE reading [1 /* one */, 2 /* deux */, 4 /* kvar */]
in JavaScript preserves the comments so that re-encoding the
data structure restores the input perfectly. </sarc>
Or for that matter consider floating point numbers, where
even the languages that produce the best possible conversions
cannot promise that encode(decode(x)) == x.
No, I'm sorry, this "perfectly reversible codec" requirement sets up
a standard that NO programming language I'm aware of satisfies.
It is, in fact, a straw man. What you *can* ask, and what some
language designers and implementers strive to give you, is
decode(encode(decode(x))) == decode(x).
But to repeat the point made earlier, the way that lists of plausible
character codes is printed is SEPARABLE from the way strings are
represented and handled and in an ancestral language is SEPARATE.
>
> 4) What is the right way to implement the function is_string(List)
> correctly?
>
> *ducks*
That really is a "have you stopped beating your wife, answer yes or no"
sort of question.
It depends on the semantics you *want* it to have. The Quintus
library didn't provide any such predicate, but it did provide
plausible_chars(Term)
when Term is a sequence of integers satisfying
is_graphic(C) or is_space(C),
possibly ending with a tail that is a variable or
a variable bound by numbervars/3.
Notice the careful choice of name: not IS (certainly) a string,
but is a PLAUSIBLE list of characters.
It was good enough for paying customers to be happy with the
module it was part of (which was the one offering the
non-usual portray_chars(Term) command).
One of the representations Quintus used for strings (again, a
library feature, not a core language feature) was in Erlang
notation {external_string,FileName,Offset,Length}, and idea
that struck the customer I developed it for as a great
innovation, when I'd simply stolen it from Smalltalk!
The thing is that STRINGS ARE WRONG for most things,
however represented. For example, when Java changed
the representation of String so that slicing became a
costly operation, I laughed, because I had my own representation
of strings that provided O(1) concatenation as well as cheap
slicing. (Think Erlang iolists and you won't be far wrong.)
The Pop2 language developed and used at Edinburgh
represented file names as lists, e.g., [/dev/null] was in
Erlang notation ['/',dev.'/',null]. This made file name
manipulation easier than representing them as strings.
Any time there is internal structure, any time there is scope
for sharing substructure, any time you need to process
the parts of a string, strings are wrong.
The PERL lesson is that regular expressions are a fantastic
tool for doing the wrong thing quite simply.
More information about the erlang-questions
mailing list