[erlang-questions] [enhancement] string:split/2

Fri Oct 10 04:32:33 CEST 2008

On Thu, Oct 9, 2008 at 7:21 PM, Richard O'Keefe <ok@REDACTED> wrote:

>
> On 9 Oct 2008, at 7:43 pm, Edwin Fine wrote:
>
>> I'm assuming you are providing this code as an "executable requirements
>> document." I was rather hoping that it could be implemented as a BIF.
>> I'll have to study your code below so I can learn some "fast and fancy"
>> Erlang!
>>
>
> Well, no.  I was providing the code as a first draft of what
> could be dropped into string.erl.  I see no reason why it should
> be a BIF, if by that you mean something implemented in C.

Yes, that is what I meant.

>
> If the performance of unjoining should be a bottleneck in some

My background and experience has led me to believe that primitive
operations, or functions that are likely to be heavily used, should be
implemented in a low-level language like C, and not in the higher-level
language. I consider string:split to be something that will be heavily used
in text-processing applications (which will also become heavily used, I
predict). Since string operations really need to be efficient, I wrongly
assumed that the string operations were written in C and that the equivalent
Erlang code would not be nearly as efficient.

> program, after getting over my astonishment I would recommend
> unrolling.  For example,
>
>  unjoin0([C|Cs]) ->
>>   [[C] | unjoin0(Cs)];
>> unjoin0([]) ->
>>   [].
>>
>
> would become
>
> unjoin0([A,B,C,D|S]) -> [[A],[B],[C],[D] | unjoin0(S);
> unjoin0([A,B,C])     -> [[A],[B],[C]];
> unjoin0([A,B])       -> [[A],[B]];
> unjoin0([A])         -> [[A]];
> unjoin0([])          -> [].
>
> That plus HiPE can often do surprisingly well.

I'd rather ensure that the function is implemented efficiently to start off
with, and not *require* HiPE to be used to get performance. One of these
nights when I have some spare time, I am going to try your code out (with
and without HiPE) against a 'C'  or C++ version of split and join so that I
can have some hard data to tell for sure one way or the other.

This is how I see it: even given the increasingly fast computer
architectures of today, one should do one's best not to waste processor time
when performance is desired. I have seen users of interpreted, byte-code
interpreted, and threaded-interpreted languages implement computationally
expensive, heavily used functions in the language instead of in a low-level
C library. In almost all cases where the code was not I/O bound, there was a
significant penalty for writing such functions. Rewriting those functions in
C/C++ often gave one or two orders of magnitude improvement. I know that the
overall algorithm can make a huge difference, and that we should not waste
our time with premature optimization, but at the same time we should not
waste CPU time with belated pessimization. If we can predict based on
experience that something (a) is likely to be heavily used and (b) will be
more efficient in a lower-level language then we should implement it in said
language. All the little inefficiencies can add up into a steaming great
system-wide inefficiency. The death of a thousand cuts.

Now, that having been said, I have not devoted time (yet) to studying the
relative efficiency of Erlang-implemented operations compared to C code.
Obviously in a large-scale sense, Erlang is generally efficient enough given
the uses to which it has been put. On the gripping hand, the uses to which
Erlang has been put are going to push its envelope as it gains more
traction. It's going to be used in many more Web-related applications, which
means *inter alia* much more text processing involving fiddling around with
strings (excuse the unintended pun). Fiddling with strings can and does kill
systems that are not designed to do that.

Here's the old war story ploy.

I once ported a C++ application to an IBM mainframe - which had (gasp!) a
C++ compiler - and saw the mainframe get absolutely *caned* by my laptop
performance-wise in many cases. Investigation at that time revealed that
although the mainframe wiped out my laptop in terms of I/O speed and
bandwidth, my laptop just smoked the IBM when it came to what goes on in the
Web: dynamic string processing and lots of memory allocations and
deallocations. IBM, seeing the writing on the wall, modified its hardware
architecture, IIRC by adding special string opcodes, and improved its
runtime libraries, to support more efficient string-processing and better
dynamic memory handling.

Now Erlang has been known to present challenges when it comes to string
processing (and pattern matching and file I/O - see the WideFinder project)
if the unwary don't take special steps like using binaries for efficiency,
and keeping as much as possible in iolists instead of having everything
flattened. We all know that Erlang was originally intended for soft
real-time type applications. The way things go is that as a language proves
its merits, and becomes more and more in demand because (in Erlang's case)
it has a reputation for creating highly reliable and available systems,
people will start pushing the envelope of its original design parameters.
Reading the history of Awk is interesting in this regard; it was put to uses
which the authors never dreamed of (like implementing a database - horror).
Demands came in for more and better library functions. I predict the same
will happen with Erlang, if it is not already happening.

It is well to say that one shouldn't use one single language to do
everything (and I agree with that caution), but in my experience, software
managers don't like having too many languages floating about because of the
recruitment, maintenance and training burden. Many developers are not as
gifted as the readers of this list and find it difficult to hop between
different languages. This all creates a tendency to use the primary language
for as many things as possible (the old hammer and nails analogy), even if
it wasn't designed for that. It is therefore probably a good idea that we
don't pessimize Erlang by writing fundamental code in Erlang that arguably
should be implemented as 'C' BIFs. And I am sure there will be argument :)

But this is something to do AFTER measurement!
>

Ideally, yes, and then again, maybe not, if we strongly suspect that it's
going to bite us later NOT to do it now. Here's the problem: like I wrote
earlier about the death of a thousand cuts, what if there isn't a few big
bottlenecks? What if the system is slow simply because it is bloated from
too many unnecessary opcodes being executed in a fairly distributed pattern?

Like John Haugeland (sp?) wrote earlier, sometimes 20+ years of programming
experience gives one an intuitive idea regarding performance. Every little
bit counts (excuse the pun). I am just generally saying that if one doesn't
watch out, the system gradually bloats with new high-level code and then
it's not so performant anymore.

Now I will duck into the flame-proof bomb shelter.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20081009/b78c807b/attachment.htm>