[erlang-questions] erlang-questions Digest, Vol 17, Issue 45

Mon Oct 13 04:28:47 CEST 2008

On 10 Oct 2008, at 6:24 pm, Edwin Fine wrote:

> On Thu, Oct 9, 2008 at 11:53 PM, Richard O'Keefe <ok@REDACTED>  
> wrote:
> On 10 Oct 2008, at 3:11 pm, Edwin Fine wrote:
> Ok, then to preserve the Principle Of Least Astonishment, let  
> string:split accept a regular expression, which is just a string  
> with special RE operators. If the string contains no RE operators,  
> use an optimized special case of split (like the one you wrote) that  
> does not use an RE engine. Get the best of both worlds.
>
> No, that *violates* the principle of least astonishment, Big Time!
>
> First, absolutely nothing else whatever in the 'string' module
> has anything to do with regular expressions.  This would be
> highly exceptional and very confusing.
>
> I disagree. Take for example the String classes of Ruby (http://www.ruby-doc.org/core/classes/String.html#M000818 
> ), JavaScript (http://www.w3schools.com/jsref/jsref_split.asp) and  
> Java (http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#split(java.lang.String) 
> ), all of which use split with an RE.

We *agree* that languages that have a 'split' function take
a regular expression as an argument.  If you will recall, that
was my point.

However, it is simply a fact that the 'string' module in
 >>>Erlang<<< has nothing connected with regular expressions in
it.  It used to, but they were removed.  FOR ERLANG, it would
now be astonishing to find just one function relating to
regular expressions in 'string'.

We *have* a split that uses a regular expression argument.
It's in the regular expression module, where it is NOT
surprising to find a regular expression used.

You wanted a function that was a converse of the 'join'
function.  Since split-with-an-RE is *not* a converse of
'join', it cannot be the function you were asking for.
A function of the kind you asked for *can* be written (and
can be more efficient that regexp:split/2), and to be
consistent with the Principle of Least Astonishment, it must
not be given the name 'split', because a converse of the
'join' function does not do what the 'split' function does.

> I think it's a matter of taste or opinion. I don't think split  
> belongs in the regexp module.

It's about regular expressions.
There is therefore no better place for it.

Bear in mind that there is more than one syntax for
regular expressions around.  A posix_regexp:split/2 function
and a pcre:split/2 function would behave differently, and it
would be confusing to put them both in 'string'.

> The new re module correctly does not include it. I think  
> regexp:split should be deprecated, in fact, the entire regexp module  
> should be deprecated because enough people have used it and been  
> burned by its lack of performance (because it is written in Erlang  
> when it should have been written in C) that it warrants an entire  
> Caveat section in the Erlang Efficiency Guide.

Rewriting in C a regular expression module that works on
*LINKED LISTS* of character codes would be like painting
the Titanic.  If you are that worried about bulk string
handling, you are using binaries.

We *AGREE* that regular expressions implemented at the level of
C or better working on space-efficient strings would be a good
thing.

The lack of performance of the regexp module has very little to do
with the language it is written in, by the way.  On the contrary,
it has to do with the *design*.  There are two problems with
regexp:split/2.

(1) Presumably for the sake of 'convenience', the functions in
     the 'regexp' module that accept a regular expression accept
     EITHER a translated regular expression OR a string, which
     they first translate.  It is fatally easy to write
	regexp:split(Line, "[ \t]*,[ \t]*")
     instead of
	{ok,Separator} = regexp:parse("[ \t]*,[ \t]*"),
	...
	regexp:split(Line, Separator).

     I haven't measured how large the cost is, but it is certainly
     a trap for the unwary.  (In AWK, the argument is usually
     translated at compile time.)

(2) The internal representation is suboptimal.
     The regexp module translates regular expressions to trees,
     indeed, to ASTs.  At match time, it interprets these trees.
     Grep and Mawk translate to a kind of automaton, which amongst
     other things admits a number of special case hacks.

One thing that could be done fairly straightforwardly in a system
like Erlang would be to have a HiPRE -- a run-time compiler from
regexp ASTs to native code, which could be faster than a C
implementation.

By the way, the manual page for the 're' module starts out
"This is an experimental module and
  the interface is subject to change"
so to praise it for omitting 'split' is to go beyond what
the evidence will bear.
This is all the more so since the existing interface WILL
do 90% of what split/2 needs:
	- compile a regular expression
	- find all the matches of that expression in a binary
	- report matches as location pairs
To build split/2 on top of that, the hardest thing would be
slogging your way through the 're.3' manual page figuring out
the options.

> Third, if there were a string:split/2 that used regular
> expressions, that would make it very incompatible with
> string:join/2, which doesn't.  I thought we were wanting
>        string:join(string:split(String, Sep), Sep)
> to give back Sep.
>
> Did you mean to write "to give back String?"

Yes I did.
>
>
> Yes, that is what we want, and it is what will happen, because it is  
> meaningless to join using an RE, isn't it. You can't join things  
> together with a regular expression.

Exactly.  Which is why 'split' is *NOT* a converse of 'join'.

> So Sep would *have* to be a literal string.

A literal string Sep passed to join/2 is treated as a string.
A literal string Sep passed to split/2 is treated as a regular
expression, which is NOT consistent with its use in join/2.

> As it happens, split and join work exactly like this in other  
> languages and don't seem to overly confuse people.

Well, in AWK it doesn't confuse people because AWK doesn't
*have* a join/2 function.  In Perl, the first argument of
join/2 is normally written with quotation marks and the first
argument of split/2 is normally written with slashes, acting
as a visible reminder that it is not a string.

But Erlang does not have the /.../ syntax for regular expressions.
If you can find a language where strings and regular expressions
have identical literal syntax and where people are not confused
by split taking an RE and join taking a string, then you have a
case.  Only then.

> So we are stuck with contriving some name because of history  
> (lists:split should have been lists:split_as from Haskell, regexp  
> already has split).

> This, in spite of there being more than enough precedents amongst  
> other languages that use split as a method of their String classes.

Since Erlang doesn't have classes, so what?
It is you who I understand to be saying that "split" should have
a regular expression argument because that's what other languages
do.  Fine.  THAT'S WHAT WE HAVE IN ERLANG.  We have a 'split'
function that acts just like the split function in Perl (except for
the argument order).  If you want
  - a function called 'split' for splitting strings
	We agree about this!
  - that takes a regular expression argument
	We agree about this!
  - a function called 'join' that takes a string argument
	We agree about this!
  - a function X that is a converse of 'join'
	We agree about this!
THEN the function X should not be called 'split'.

It appears that we don't agree about this conclusion, and I
honestly cannot see why not.

Also, IF there were a unique best regular expression syntax
we would agree that 'split' should be in the 'string' module.
But there are several candidates on the table.
Perl-compatible regular expressions are very expressive, but
the expressiveness comes at a heavy price: efficiency.
POSIX-compatible regular expressions are less expressive,
but in return you get guaranteed linear time matching.
I believe that the regular expression language in HyTime is
comparable to POSIX ones.

While I hold no brief for Perl-style "regular expressions",
they are in wide use, so support is warranted.
Support for efficiently implementable regular expressions
is ALSO warranted, especially considering that there is a
free C implementation that *is* efficient.

So there is good reason to have at least two regular
expression modules.  (We agree that neither of them is the
existing 'regexp'.)  And that means there is good reason
for each to have its own 'split' function.

Let's drop the argument about whether the 'split' function

>
> By the way, I have a use for
>        string:join(reverse([Stuff|tail(reverse(
>        string:unjoin(Thing, ".")))]), ".")
> so I would be very unhappy to have to write one of
> those string literals as "\\." and the other NOT.
>
> That's a VERY good point. It's a glaring inconsistency. Other  
> languages deal with that by having a regular expression type and  
> special syntax (e.g. /abc/ or %r{abc}) to avoid confusion, and  
> (IIRC) if a string type is passed in stead of an RE type, the  
> receiving method treats it as a non-RE.

Sadly, no.  Take this example.

	my $data = 'a,b,c';
	my @values = split(',', $data);
	print "$#values\n";
	foreach my $item (@values) {
	    print "$item\n";
	}

The output is
	2
	a
	b
	c

Now change the commas to dots:

	my $data = 'a.b.c';
	my @values = split('.', $data);
	# the rest as before

The output is
	-1

The string has to be changed to '\\.' to work as expected.
(Tested with Perl 5.8.8, tested on UltraSPARC II Solaris 2.10
and Intel Core 2 Duo MacOS 10.5.4.)  I have no idea what
other languages do, except for Squeak Smalltalk, whose analogue
of 'split' takes only a fun as an argument.

>  One would then be able to use the same literal in both join and  
> unjoin. Maybe Erlang should have a new regular expression type and  
> syntax, seeing as it is going to be used in more and more  
> applications that do heavy text processing.

We've had that debate before.  The position I took then was that
text patterns in a language like Erlang should be expressed as
trees, rather like Bigloo.  See
http://www-sop.inria.fr/mimosa/fp/Bigloo/doc/bigloo-12.html#Regular-parsing
This is vastly more convenient for programs that *construct* regular
expressions than the traditional AWK-style // stuff, and for tricky
regular expressions, it has been my experience that it is easier to
get right than the linear form.

> The fact that for example "add" and "subtract" are not related
> in that way really doesn't signify anything, just as the fact
> that there are many colours doesn't mean that black is a bad
> choice.
>
> But you wouldn't create a function named "unadd" when "subtract " is  
> a more acceptable usage - would you?

If there was *already* an *incompatible* function called 'subtract',
I might be driven to it.  Come to think of it, in Smalltalk the
opposite of 'add' is 'remove', not 'subtract'.

> Interesting. One of the words you wrote above gave me an idea. How  
> about meeting halfway? How about two new functions,  
> string:separate(String, Separator) and string:unseparate(List,  
> Separator)? No clash and it makes even more sense (to me) than split  
> and join.

Suits me.

'join' already has other meanings in computing, like |><|.