Tail/Non tail recursion

Raimo Niskanen < >
Thu Aug 28 17:23:10 CEST 2003

```I think Peter Lund's suggestion (corrected variant of Jilani Khaldi's)
is the fastest in this case (and it is non-tail-recursive):

square([H|T]) -> [H*H|square(T)];
square([]) -> [].

The list comprehension variant suggested by Luke Gorrie will probably be
translated to essentially the same as the above by the compiler so it
should be just as fast (and it is beautifully short):

square(L) -> [N*N || N <- L].

The closest competitor would be (and it is tail-recursive):

square(L) -> square(L, []).
square([], R) -> lists:reverse(L);
square([H|T], R) -> square(T, [H*H|R]).

Why do I think the first (second) one is the fastest?

The first one saves two words on the stack (one result H*H (or just H
doing the multiplication after the return) and one return address) per
recursion on the way down. Every now and then the stack+heap size will
exeed the allocated limit causing a garbage collection.

After reaching the end of the input list it starts returning, building
the result list on the way up - adding two words on the heap (a cons
cell with one word immediate result value and one word tail pointer) and
popping the two words off the stack, per return. No stack+heap size
change, that is. AND since the heap and the stack shares allocated area
and the heap grows upwards towards the stack downwards growing stack, no
garbage collections has to be done.

The second one i think is syntactical sugar for the first one.

The third one saves two words on the heap per recursion (one cons cell
but built in the reverse order). The stack does not change since the
calls are tail recursive. Garbage collections will occur just as for the
first suggestion, but now caused by heap growth not stack growth.

After reaching the end of the input list it will build the non-reversed
result list by calling lists:reverse/1 that adds and removes two words
on the heap per recursion. No heap size chanbe, that is, BUT the new
cons cells are allocated from the top of the heap and the freed creates
holes in the heap, so the holes has to be garbage collected every now
and then.

Summing it up then, I guess that the first and the third one will reach
the end of the input list in the same time. Then the first one just has
to do N returns, but the third one has to call a (fast, but wait...) BIF
that builds and garbage collects holes in proportion to the input list
length. That is why I think the first (second) one wins: the
non-tail-recursion data to be pushed on the stack is not greater than
the result to build on the heap when returning, and it is fast to build
the result when returning.

But to know you must benchmark. Messen ist Wissen. (Werner von Siemens:
To measure is to know)

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Jilani Khaldi wrote:
> Hi All,
> Is there a better way to write this code when there is a lot of elements
> in the list?
>
> square([H|T]) -> [(H*H|square(T)];
> square([]) -> [].
>
> Thanks.
>
> jilani

```