[erlang-questions] Erlang shows its slow face!

Tue Nov 16 02:55:01 CET 2010

On 16/11/2010, at 4:02 AM, Edmond Begumisa wrote:

> Hello Richard,
> 
> I have a question...
> 
>> Erlang is a fine tool for jobs like this one.
> 
> Interesting... I would have never thought this. I would have instinctively reached for a language with mutable memory, thinking that this would make the most sense for combinatorial* work (I don't know this for fact, and I could be very wrong, but that's what instinct would tell me.) 

> * I've been calling this permutation but I don't know if that's accurate.

It isn't.  "Permutation" has a very specific meaning, which does not apply here.

Let me quibble slightly:  *memoisation* can help a lot for some combinatorial
algorithms, but mutable memory as such doesn't strike me as obviously useful.
(Indeed, since mutable memory makes memoisation invalid, it can seriously get
in your way.)  Good support for large integers helps a lot.

Different algorithms may well have different needs.
> 
>> There is little point in optimising a bad algorithm.
> 
> Well put. But say you have an 'ok' algorithm. Not the best, but not blatantly flawed either.

I have a C program to compute various things from a data set which consists of
a hundred million lines, each containing four non-negative integers.  My
program had to read it several times.  (The first pass determines the sizes
of major areas; the second pass determines the size of minor areas within
major areas; the third pass loads the data.)  It gets the effect of sorting
without actually sorting (or more precisely, by count-sorting).  The first
version of the program took 68 seconds per pass on the fastest machine I
had access to.  Considering that wc takes 20 seconds to _not_ process the
numbers in the file, that didn't seem too bad.  But bypassing stdio and using
my own code let me get the time down to 6 seconds per pass, which means that
I can afford to run the program a lot more often.

That was a case where there was a _good_ algorithm, but a horrible library.
(Time is linear in the size of the input; results depend on all of the input;
up to a constant factor the algorithm is as good as it gets.)

But in the case we are discussing here, the initial algorithm was O(N**3)
when an O(N**2) algorithm -- in fact, several of them, there's even a
Wikipedia article on enumerating Pythagorean triples -- was not hard to find.

> 
> So my question is: if version 1 isn't performing "reasonably" acceptably for Garcia's purpose, and version 1 isn't blatantly flawed.

To me, version 1 *IS* blatantly flawed.
In a pythagorean triple (a,b,c), a and b uniquely determine c,
so it seems *obvious* that a quadratic algorithm should be sought.
(It is not obvious that one *exists*, just that it should be sought.)

Make no mistake:  I have used brute force algorithms quite happily myself.
But I never blame the language for the performance I get when I do so!

> Isn't this a strong indicator that he's using the wrong tool?

Ah, but what precisely is "the tool" in question?
- functional languages as such?
- Erlang?
- list comprehensions as such?
- Erlang's current implementation of list comprehensions?

Here are some more times.  Erlang R14A, MacBook Pro.

8> pyth:t(p3, 300).
{p3,300,{length,146},{msec,10}}
9> pyth:t(p2, 300). 
{p2,300,{length,146},{msec,210}}
10> pyth:t(p1, 300). 
{p1,300,{length,146},{msec,970}}
11> pyth:t(p0, 300). 
{p0,300,{length,146},{msec,420}}

What's p0?
It's the utterly naive cubic loop nest, with the loops written
as function calls working on integers, with the only list in
existence being the one that's being built.

The time for p0 was 420 milliseconds in Erlang.
On the same machine, the same function, the time was
91 milliseconds in SML/NJ.
            SML/NJ  MLton   Erlang
n=100 L= 34 s=0.004 m=0.004 e=0.020
n=200 L= 86 s=0.028 m=0.027 e=0.130
n=300 L=146 s=0.090 m=0.086 e=0.410
n=400 L=210 s=0.213 m=0.202 e=0.980
n=500 L=274 s=0.415 m=0.394 e=1.900

We see that Erlang is about 5 times slower than SML.
Here's the SML code.

fun p0 n =
    let fun fa     0 l = l
          | fa     a l = fb n a l
        and fb   0 a l = fa (a-1) l
          | fb   b a l = fc n b a l
        and fc 0 b a l = fb (b-1) a l
          | fc c b a l = fc (c-1) b a (
                            if a+b+c <= n andalso a*a+b*b = c*c
                            then (a,b,c)::l else l)
    in fa n []
    end

It's completely tail recursive, building the list from right to
left.  *THIS* is the version that is strictly comparable to any
C or C# code one might plausibly write; it allocates no
storage not part of the result, all looping is done with "while"
loops that compare and increment integers.  What's more, SML is
strongly typed, and the Dialyzer should be able to handle this,
so the compiler *knows* it is dealing with integers.

I grant you that this code is not immediately clear.
Clear code in SML might look like this:

fun cat_tab l u f =     (* concatenate + tabulate *)
    let fun loop i r = if i < l then r else loop (i-1) (f i @ r)
     in loop u []
    end                 (* _happens_ not to be in List *)

fun p1 n =
    cat_tab 1 n (fn a =>
        cat_tab 1 n (fn b =>
            cat_tab 1 n (fn c =>
                if a+b+c <= n andalso a*a+b*b = c*c
                then [(a,b,c)] else [])))

What happens to the times?
	    (----------direct-----) (------cat_tab------)   original!
            SML/NJ  MLton   Erlang  SML/NJ  Mlton   Erlang  Erlang
n=100 L= 34 s=0.004 m=0.004 e=0.020 s=0.015 m=0.006 e=0.040 o=0.040
n=200 L= 86 s=0.028 m=0.027 e=0.130 s=0.120 m=0.039 e=0.300 o=0.290
n=300 L=146 s=0.090 m=0.086 e=0.410 s=0.403 m=0.128 e=0.950 o=0.980
n=400 L=210 s=0.213 m=0.202 e=0.980 s=0.954 m=0.303 e=2.240 o=2.280
n=500 L=274 s=0.415 m=0.394 e=1.900 s=1.857 m=0.591 e=4.300 o=4.400

The MLton compiler is a whole-program compiler, and I've a
strong suspicion that it is able to inline cat_tab, so it
may be able to avoid building the answer twice.  I'm pretty
sure that SML/NJ isn't inlining cat_tab, and is building the
answer twice.  While the Erlang compiler _could_ be taught
about cat_tab, it's not generally wonderful at inlining recursive
procedures, which is why the rightmost column shows list
comprehension not doing so well.

There are two basic issues here:
(1) The Erlang compiler doesn't really handle list comprehension as
    native loops; it compiles list comprehensions to recursive
    functions.  lists:seq(1, N) is not special syntax -- unlike Haskell --
    and almost surely builds an actual list.

    This could be changed, and maybe some day it will change.

(2) Erlang is compiled incrementally to support hot loading and
    (again in order to support hot loading) does not do much if any
    cross-module inlining.
    MLton is a whole-program optimising compiler, it can SML _are_
    allowed to do cross-module inlining by the nature of the language.

    This *can't* be changed without making Erlang less useful for its job.    

Ah, but there's one more wrinkle!

            SML/NJ  Erlang  SML/NJ     Smalltalk
n=100 L= 34 s=0.004 e=0.020 s= 0.250   p1=0.016
n=200 L= 86 s=0.028 e=0.130 s= 1.904   p1=0.109
n=300 L=146 s=0.090 e=0.410 s= 6.567   p1=0.369
n=400 L=210 s=0.213 e=0.980 s=15.581   p1=0.873
n=500 L=274 s=0.415 e=1.900 s=30.218   p1=1.705 

(3) Erlang integer arithmetic is *not* limited to some ridiculous
    size like 32 bits, it is _semantically_ unbounded integer
    arithmetic, with a space-efficient representation and fast
    calculation path for small integers.  To get something
    comparable in SML, you have to use the type IntInf.int
    instead of Int.int.  And when you do *that*, Erlang really
    really shines.

    The difference here is that if you use int in C or Java or C#
    or SML, then you'll find a positive integer n such that n+1 < 0.
    That kind of nonsense simply doesn't happen in Erlang.

    It comes with a price, but the price of *not* using a Lispy
    (or Smalltalky, or Erlangy) approach to integers is the price
    of getting wrong answers without knowing.

    The Smalltalk column was done using my Smalltalk-via-C compiler.
    It might be possible to go a factor of 2 faster using machine-
    specific assembly code, but what we seem to be looking at in
    the Erlang column really does seem to be no more than the
    price of arithmetic that gives correct results.

And by the way, no, I am not advocating writing code like p0 above
*UNTIL* it has been determined that the function is a performance
problem and that the algorithm is good.  Like I said, you should
stick with the clean high level code until you understand the
problem space better.