Tail/Non tail recursion

Wed Sep 3 03:43:28 CEST 2003

Thomas Lindgren <thomasl_erlang@REDACTED> wrote:
	Let me explain my terminological choice. 

	For our purposes, the example loop is a two step
	process:

	1. create A = [X+1 | <null value>]

	2. in the next iteration of the loop, update the tail
	element of A.

	The <null value> is written only so that the GC can
	scan the data structure. If the GC can detect this
	anyway, then one can omit that write.

Correct:  there are plausible, even simple, implementations in which the
<null> placeholder is NEVER WRITTEN AT ALL.  Which means that the write
which _does_ fill it in is not a destructive operation.

It seems a little unreasonable to call an operation a "destructive"
update just because *some* (and definitely not all) implementations
might have put something there first.  After all, even for
    X = [H|T]
an implementation *could* do
    R <- allocate a pair cell and fill it with zeros
    st T, 0(R)                ^^^^^^^^^^^^^^^^^^^^^^
    st H, 4(R)
    add R, <list tag>, X
in which case what data construction _wouldn't_ be a "destructive"
operation.

	Between 1 and 2, there may in general be a garbage
	collection. Thus, if a generational collector is used,
	the code needs to check whether A was tenured or not,
	and if tenured, record an old-to-new pointer.

But this is a contigent fact about a garbage collector,
*not* an intrinsic fact about the operation itself.
The significant fact for the garbage collector is not so much
whether the initialisation overwrites defined data or not, but the
direction of the link it establishes.

I thought it might be interesting to get some actual numbers.
Recall that we are discussing

    f([H|T]) -> [H+1 | f(T)];
    f([])    -> [].

I implemented this in C, using only the constructions I thought a
fairly dumb Erlang-to-C compiler might use.  One version used
body recursion; the other used tail recursion with a where-to-store-
the-address argument.  H+1 was implemented by calling an out-of-line
function.  There was no garbage collector, but the code did check
whether a garbage collection was needed.

On a SunBlade 100, the tail recursive version was no less than 12 times
faster.  Of course, the SPARC function calling convention dumps quite a
lot of stuff into memory whenever the register windows overflow, so this
is perhaps a little unfair, but if I had such a compiler, that's the
machine I'd want it for.

On an Alpha, which doesn't have register windows, and has a somewhat
saner function calling convention, the tail recursive version was only
4 times faster.  That's still well worth having.

The *big* advantage of using tail recursion when you can, of course,
is that you don't need so much space for the stack; if you put the
stack and the per-process heap in the same block of memory, you don't
get so many garbage collections because the stack doesn't keep running
into the heap.  This also serves to improve performance.

That's all very well, but would such an implementation strategy be
_useful_ very often?   I decided to take an Erlang file and check.
I didn't write it.   The author knew about tail recursion optimisation,
and may well have avoided tail-recursion-through-a-constructor because
he knew that _didn't_ count as tail recursion.  There are certainly
more calls to reverse/1 than I would have felt comfortable with.
The important thing is that the file is not "about" list processing
or any other data structure.

Functions     :  41
Clauses       :  87    2.12 clauses per function
All returns   : 110    2.68 returns per function, 1.26 per clause
..Appends     :   3     2.7% of all returns
..Constructors:   4     3.6% of all returns
..Simple      :  51    46.4% of all returns
..Tail calls  :  41    37.3% of all returns
..Exits       :  11    10.0% of all returns
Body calls    :  57

Exit: the last action is exit(...)
Tail: the last action is foo(...) for some function, self or other
Simple: the result is a variable, constant, arithmetic expression, &c
Constructor: the result contains at least one function call inside a
        constructor, the last outermost such is picked as "tail" call.
Append: the result is <some expression> ++ a constructor, and can be
    compiled into fast inline code.

If we discount the exit()s as indicating errors, we get
simple returns	: 51   51.5%
plain tail calls: 41   41.4%
constructor tail:  7    7.1%
total           : 99

Roughly one return in 2 is a simple result (constant, variable, &c).
Roughly one function call in 2 is a body call.

Roughly half of the returns (48%) are
roughly half of the function calls (46%) and
are (6/7) or could be (1/7) tail calls.

These are static frequencies, not dynamic frequencies, and they
are only for one file.  To be honest, I was expecting the static
frequency of C and A calls to be lower, and was pleasantly surprised.