[erlang-questions] [enhancement] string:split/2

Fri Oct 10 07:00:58 CEST 2008

On 10 Oct 2008, at 3:32 pm, Edwin Fine wrote:
> My background and experience has led me to believe that primitive  
> operations, or functions that are likely to be heavily used, should  
> be implemented in a low-level language like C, and not in the higher- 
> level language.

This is why, until Intergraph took it over, Quintus was the
peer of Prologs.

We didn't believe that.  We ate our own dog food.
If some aspect of our Prolog system was so slow that
we were tempted to rewrite it in C, we said "oh no,
our customers will find it too slow too" and improved
our system.

Some other companies said "we must have a fast compiler,
so we'll write our compiler in C."  That meant their customers
found performance problems in generated code before they did.
OOPS.

Whether you call it split/2 or unjoin/2, it ISN'T a primitive
operation.  It's very easy to define in terms of other operations.
And it HASN'T been heavily used, or it would already be there.

> I consider string:split to be something that will be heavily used in  
> text-processing applications (which will also become heavily used, I  
> predict).

I've done a lot of splitting in my time, and I must say that I
have never found it to be the performance bottleneck.  There's one
thing I've learned over the years, and that is never to think I
know where the time will go.  Always measure.  I've been progamming
since the early 70s, and I am *still* surprised every time.

> Since string operations really need to be efficient,

Do they?  Remind me never to use Java or Perl then.
They need to be efficient *enough* not to be the bottleneck.
For some applications, they might be.  For some, not.
Strings are almost always the wrong tool for the job.

> I wrongly assumed that the string operations were written in C and  
> that the equivalent Erlang code would not be nearly as efficient.

The 'string' library deals with strings represented as lists of
integer character codes.  C code _could_ be faster, but it would
be hell to write.  First you get your code working, then you
measure it on real applications, then you improve your compiler,
then you measure again, and only afterwards do you even think about
putting things in C.  And when you do, you think in terms of
better building blocks, not in terms of doing the top level
operation directly.

Now this is what has already been happening slowly in Erlang.
List length, reverse, and member have all migrated to C.  They
turn out to be very common building blocks.

In 'string', instead of hacking unjoin/2, it would be best to
hack the version of 'prefix' that I sent (which either returns
the list after the prefix or the atom 'no').  That version can
do the job of the existing 'prefix' code in 'string', but not
conversely.

> I'd rather ensure that the function is implemented efficiently to  
> start off with, and not require HiPE to be used to get performance.

What is the point of worrying about the efficiency of a function
when we don't even have agreement on its name?  Get it right *first*.

Talking about "not requir[ing] HiPE" is like saying
"I demand that your C compiler produce efficient code
without optimisation."  Setting compiler options routinely
produces more than a factor of 2 in C speed these days.
(Even with the Intel C compiler, I found changing from
its default options gave me a factor of 2.)

Until you have *measured* the actual performance of my
implementation of unjoin/2 in a real program working on
real data, you have no reason to complain of its efficiency.

> One of these nights when I have some spare time, I am going to try  
> your code out (with and without HiPE) against a 'C'  or C++ version  
> of split and join so that I can have some hard data to tell for sure  
> one way or the other.

What counts is the performance of the *whole* program.

What might be more important in a complete program would be
using binaries instead of strings.  That could usefully go in C.

Let's take a look at a use of the 'string' module.

extract_sequences(Fmt, Need0) ->
     case string:chr(Fmt, $~) of
         0 -> {ok,lists:reverse(Need0)};         %That's it
         Pos ->
             Fmt1 = string:substr(Fmt, Pos+1),   %Skip ~
             case extract_sequence(1, Fmt1, Need0) of
                 {ok,Need1,Rest} -> extract_sequences(Rest, Need1);
                 Error -> Error
             end
     end.

The problem here is that whoever did this was thinking in a
very C-like way.  Suppose we had a different version of
member/2, one which like its Lisp original returned a
list: [] means not found, [X|_] means "found here".

	memq(X, [H|T]) when H =/= X -> memq(X, T);
	memq(_, L) -> L.

Then the extract_sequences/2 function could be

extract_sequences(Fmt, Need0) ->
     case memq($~, Fmt)
       of [] ->
	     {ok, lists:reverse(Need0)}
        ; [_|Fmt1] ->
	     case extract_sequence(1, Fmt1, Need0)
	       of {ok,Need1,Rest} ->
		      extract_sequences(Rest, Need1)
		; Error ->
		      Error
	     end
     end.

The original version traverses a list twice.
The revised version traverses it once.
In this case, coding string:substr/2 would have been
the wrong way to get performance; the best way is to
eliminate it entirely.

Recently, I have been saying to myself "Suppose we had a
rigid separation between guards and expressions, suppose
we tried to completely do without 'false' and 'true' in
expressions.  What then?"  The answer so far is "my code
is the better for it.  Every time."  As far as I am
concerned, the effort spent coding member/2 in C was
effort spent on the *wrong* version of that concept.
>
> This is how I see it: even given the increasingly fast computer  
> architectures of today, one should do one's best not to waste  
> processor time when performance is desired.

Indeed.  But a programmer's primary job is to deliver
working code on time and within budget.  Time spent
optimising functions that don't actually matter very
much to the overall performance of the complete system
is time wasted.  Writing things in a low level language
like C takes much more effort than writing them in a
high level language.  So you code as much as you can in
your high level language and play around with it until the
design settles, then you benchmark on real data, and when
you know where the bottlenecks are, do you then recode in
C?  No, you try first to redesign the slow stuff out.
(Like the extract_sequences/2 example above.)

> I have seen users of interpreted, byte-code interpreted, and  
> threaded-interpreted languages implement computationally expensive,  
> heavily used functions in the language instead of in a low-level C  
> library. In almost all cases where the code was not I/O bound, there  
> was a significant penalty for writing such functions. Rewriting  
> those functions in C/C++ often gave one or two orders of magnitude  
> improvement.

Fine.  But you have to know which ones.  Which means you have
to have a sufficiently complete sufficiently working system to
run real benchmarks to find out.

I've implemented enough computationally intensive heavily used
functions in my time to have learned that writing them first in
the highest level language available can be a huge time saver
in getting them right.

Up until a year ago I had a PhD student working on data mining
who decided to do all her coding in Python.  That was the right
decision for her.  The code ran fast *enough*, and there wasn't
any point optimising an algorithm when the point of her research
was to find out what the algorithm should be.  I've just finished
co-supervising a 4th-year project where the student implemented
an information retrieval engine in Common Lisp so that he could
use Genetic Programming to learn a stemming algorithm.  Common
Lisp was fast *enough*, and it took a long time to figure out
what his algorithm (not the learned algorithm) should be.  Any
time spent writing that in C would have slowed the project down,
possibly preventing its completion.  (The year before another
student, trying to do the same thing in Java, came to grief.
Mind you, he was a worse student, but the difficulty of coding
anything in Java, compared with Lisp, certainly didn't help him.)

> I know that the overall algorithm can make a huge difference, and  
> that we should not waste our time with premature optimization, but  
> at the same time we should not waste CPU time with belated  
> pessimization. If we can predict based on experience that something  
> (a) is likely to be heavily used and (b) will be more efficient in a  
> lower-level language then we should implement it in said language.

I don't believe your prediction.
unjoin/2 is useful, true, but it's not *that* useful.

Writing an algorithm in Erlang isn't pessimisation and
it hardly counts as "belated".  If you think that badly
of Erlang, why use it?

> All the little inefficiencies can add up into a steaming great  
> system-wide inefficiency. The death of a thousand cuts.
>
> Now, that having been said, I have not devoted time (yet) to  
> studying the relative efficiency of Erlang-implemented operations  
> compared to C code.

It is an amazingly tricky thing to do.
Erlang is very bad at some things, like floating-point intensive
computations, that C is very good at.
It is very good at other things, like pattern matching and
concurrency, that C is very bad at.

Inter-language comparisons that use both languages effectively
are as scarce as hen's teeth.  (That is, a few have been constructed
by wizards in the lab, but none are known in the wild.)
>
> Here's the old war story ploy.
>
> I once ported a C++ application to an IBM mainframe - which had  
> (gasp!) a C++ compiler - and saw the mainframe get absolutely  
> *caned* by my laptop performance-wise in many cases. Investigation  
> at that time revealed that although the mainframe wiped out my  
> laptop in terms of I/O speed and bandwidth, my laptop just smoked  
> the IBM when it came to what goes on in the Web: dynamic string  
> processing and lots of memory allocations and deallocations. IBM,  
> seeing the writing on the wall, modified its hardware architecture,  
> IIRC by adding special string opcodes, and improved its runtime  
> libraries, to support more efficient string-processing and better  
> dynamic memory handling.

The z/Series ISA is readily available on the Web.
The /360 always *had* special string opcodes.
Some of the new opcodes add Unicode support.

Dynamic memory handling is a library issue, and there are
good, so-so, and spectacularly bad C and C++ libraries for
that job.  In this case, it had nothing to do with the
hardware, but with the system library.  I suspect that it
was (ab)using GETMAIN instead of running suballocators.
I don't know why the "gasp!"; IBM mainframes had C compilers,
quite good ones for the time, before C++ was commonly
available on most UNIX systems.

The really funny thing here is that C++ strings are notorious.
I'm sure it would have been easy to beat C++ on the laptop.