Tail/Non tail recursion

Thu Aug 28 17:23:10 CEST 2003

I think Peter Lund's suggestion (corrected variant of Jilani Khaldi's) 
is the fastest in this case (and it is non-tail-recursive):

   square([H|T]) -> [H*H|square(T)];
   square([]) -> [].

The list comprehension variant suggested by Luke Gorrie will probably be 
translated to essentially the same as the above by the compiler so it 
should be just as fast (and it is beautifully short):

   square(L) -> [N*N || N <- L].

The closest competitor would be (and it is tail-recursive):

   square(L) -> square(L, []).
   square([], R) -> lists:reverse(L);
   square([H|T], R) -> square(T, [H*H|R]).

Why do I think the first (second) one is the fastest?

The first one saves two words on the stack (one result H*H (or just H 
doing the multiplication after the return) and one return address) per 
recursion on the way down. Every now and then the stack+heap size will 
exeed the allocated limit causing a garbage collection.

After reaching the end of the input list it starts returning, building 
the result list on the way up - adding two words on the heap (a cons 
cell with one word immediate result value and one word tail pointer) and 
popping the two words off the stack, per return. No stack+heap size 
change, that is. AND since the heap and the stack shares allocated area 
and the heap grows upwards towards the stack downwards growing stack, no 
garbage collections has to be done.

The second one i think is syntactical sugar for the first one.

The third one saves two words on the heap per recursion (one cons cell 
but built in the reverse order). The stack does not change since the 
calls are tail recursive. Garbage collections will occur just as for the 
first suggestion, but now caused by heap growth not stack growth.

After reaching the end of the input list it will build the non-reversed 
result list by calling lists:reverse/1 that adds and removes two words 
on the heap per recursion. No heap size chanbe, that is, BUT the new 
cons cells are allocated from the top of the heap and the freed creates 
holes in the heap, so the holes has to be garbage collected every now 
and then.

Summing it up then, I guess that the first and the third one will reach 
the end of the input list in the same time. Then the first one just has 
to do N returns, but the third one has to call a (fast, but wait...) BIF 
that builds and garbage collects holes in proportion to the input list 
length. That is why I think the first (second) one wins: the 
non-tail-recursion data to be pushed on the stack is not greater than 
the result to build on the heap when returning, and it is fast to build 
the result when returning.

But to know you must benchmark. Messen ist Wissen. (Werner von Siemens: 
  To measure is to know)

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Jilani Khaldi wrote:
> Hi All,
> Is there a better way to write this code when there is a lot of elements 
> in the list?
> 
> square([H|T]) -> [(H*H|square(T)];
> square([]) -> [].
> 
> Thanks.
> 
> jilani