[erlang-questions] Erlang shows its slow face!

Wed Nov 17 04:08:43 CET 2010

On 17/11/2010, at 3:20 AM, Kostis Sagonas wrote:

> Richard O'Keefe wrote:
>> ... SNIP
>> The time for p0 was 420 milliseconds in Erlang.
>> On the same machine, the same function, the time was
>> 91 milliseconds in SML/NJ.
>>            SML/NJ  MLton   Erlang
>> n=100 L= 34 s=0.004 m=0.004 e=0.020
>> n=200 L= 86 s=0.028 m=0.027 e=0.130
>> n=300 L=146 s=0.090 m=0.086 e=0.410
>> n=400 L=210 s=0.213 m=0.202 e=0.980
>> n=500 L=274 s=0.415 m=0.394 e=1.900
>> We see that Erlang is about 5 times slower than SML.
> 
> Is Erlang running native code here?
> (Because both SML/NJ and MLton are.)

Yes.
> 
>> Here's the SML code.
>> ...  What's more, SML is
>> strongly typed, and the Dialyzer should be able to handle this,
>> so the compiler *knows* it is dealing with integers.
> 
> Let's leave dialyzer out of this discussion -- it is a type analyzer alright, but the types it infers are not necessarily usable for optimization purposes (I can possibly elaborate in some other mail).

Let's not leave dialyzer out just yet, because there's an important distinction.
The dialyzer can discover that certain variables will have integer values,
but that's not at all the same thing as discovering that they will have
*small* integer values.  But with arguments declared or inferred as 'int' in
SML, small hardware-sized integers just the same as C is what you get.  The
best plausible case for Erlang here is code that still has to check
arithmetic operations to see whether bignum arithmetic is needed, at run time.
> 
> Can you post p0 to see whether the compiler can infer whether the program deals with integers or not?
> 
It isn't pretty!

p0(N) when is_integer(N), N >= 3 ->
    p0_A(N, N, []).

p0_A(0, _, L) ->
    L;
p0_A(A, N, L) ->
    p0_B(N, A, N, L).

p0_B(0, A, N, L) ->
    p0_A(A-1, N, L);
p0_B(B, A, N, L) ->
    p0_C(N, A, B, N, L).

p0_C(0, A, B, N, L) ->
    p0_B(B-1, A, N, L);
p0_C(C, A, B, N, L) when A+B+C =< N, A*A+B*B =:= C*C ->
    p0_C(C-1, A, B, N, [{A,B,C}|L]);
p0_C(C, A, B, N, L) ->
    p0_C(C-1, A, B, N, L).

It wouldn't take a *huge* amount of effort to make the existing compiler
recognise "X <- lists:seq(Y, Z)" as a special case and generate code not
entirely unlike this, except that I took the opportunity to count down
to 0.  

Here's what p1 looks like in Smalltalk.

    p1
      |r|
      r := OrderedCollection new.
      1 to: self do: [:a |
        1 to: self do: [:b |
           1 to: self do: [:c |
             (a+b+c <= self and: [a squared + b squared = c squared])
               ifTrue: [r addLast: {a. b. c}]]]].
      ^r

By tradition, Smalltalk compilers *do* treat _ to: _ do: _ as a special
case, and mine is no exception, although the code is still sufficiently
generic to work with any kind of number or even Dates, not just integers.

I'll spare you all the generated C code, but here's the inner loop,
cleaned up a bit.  I'm including it to make a point.  The lines flagged
with ** are the lines that would are affected by (a) the lack of a
static type system and (b) the possibility of bignums.
The 'return is_not_boolean(...)' lines are there because you cannot
rely on comparison operators like < always returning a Boolean.
(The next big change to the compiler will be to deal with this another
way.)  The "i<fn>(...)" macros would be single hardware instructions in
C, but the arguments might not be integers, and if integers, might be
big, and if small, might lead to a big result.  (Oh, #squared should
have had a fast path too; I forgot.)

#define ib_p(x, y) \
    (AREINTS(x, y) ? BOXS(UNINT(x) + UNINT(y)) : b_p(GLOBALS_OUT x, y))

On a SPARC, BOXS(UNINT(x)+UNINT(y)) could be a single tadd, but I'm still
trying to generate portable C, so there's a penalty there.  The minimum
at assembly level would be something like
	or x, y	-> z
	and z, 3 -> z
	brnz z, L1
	add x, y -> z
	boc L2
L1:	call out_of_line_add
L2:
which is still a lot more than a simple add instruction.  The thing I
find astonishing is that it is fast enough to be *useful*.

         a4 /*c*/ = ASINT(1);			/* c := 1 */
         for (;;) {
**         b4 = ib_l(self, a4);			/* self < c */
           if (b4 != FALSE) break;
**         t4 = ib_p(a2 /*a*/, a3 /*b*/);	/* t4 := a + b */
**         t3 = ib_p(t4, a4 /*c*/);		/* t3 := (a+b) + c */
**         t2 = ib_le(t3, self);		/* t2 := t3 <= self */
           if (t2 == FALSE) {
             t1 = FALSE;
           } else
           if (t2 == TRUE) {
**            t5 = u_squared(a2 /*a*/);
**            t6 = u_squared(a3 /*b*/);
**            t3 = ib_p(t5, t6);		/* t3 := a**2 + b**2 */
**            t4 = u_squared(a4 /*c*/);
**            t1 = ib_e(t3, t4);		/* t1 := (a**2+b**2)=c**2 */
           } else {
**           return is_not_boolean(t2);
           }
           if (t1 == TRUE) {
             t3 = OBJECT_NEW(GCLASS(n_Array), ASINT(3));
             ELEM(t3, 1) = a2 /*a*/;
             ELEM(t3, 2) = a3 /*b*/;
             ELEM(t3, 3) = a4 /*c*/;
             (void)k_addLast(l1 /*r*/, t3);
           } else
**         if (t1 != FALSE) {
**           return is_not_boolean(t1);
           }
**         a4 = ib_p(a4, ASINT(1));		/* c := c + 1 */
         }
**       if (b4 != TRUE) return is_not_boolean(b4);

H