[erlang-questions] Priority queues and Erlang

Wed Jun 17 00:39:09 CEST 2009

It's time for some numbers.
I apologise to the list for the delay;
I had this stuff ready yesterday, but wasn't able to get to
work because of snow.

Definitions:

1. An object is mutable if and only if it is possible for its
    state to change without its identity changing (without regard
    for how *fast* such changes might be; SQL tables are mutable).

2. A programming language supports mutability if and only if it
    provides or allows the programmer to create mutable objects
    (without regard for how *fast* such changes might be; SQL
    supports mutability).

3. A programming language supports mutability *efficiently* if and
    only if it provides at least some "small" mutable data structures
    whose state can be changed in about the same time as a function  
call.

4. Mutability does not guarantee an absence of copying.  Assigning one
    Pascal record to another or one C struct to another is defined
    as copying (which is why Simula 67 distinguished between
    := content-copying assignment and :- pointer assignment), but
    is none-the-less change of state without change of identity.

5. Compile-time garbage collection occurs in a programming language
    implementation when the compiler determines that at least some
    data structures are no longer useful and arranges for their
    memory to become available for reuse without any runtime tests.

6. For the purposes of this message, "leftist heaps" are defined by
    the Wikipedia entry http://en.wikipedia.org/wiki/Leftist_tree

7. "Imperative" code referred to below alters the nodes of a tree
    in place using assignment statements.  "Functional" code
    referred to below never alters a node once created, but allocates
    new ones.  No claim is made that these labels have any merit or
    any wider application other than making that distinction in this
    message.

According to these definitions, Erlang does support mutability
through the process dictionary, the process registry, the module
system, ETS, DETS, and connections to external data bases.  It
does not support mutability *efficiently*, except perhaps in the
case of the process dictionary.  Only benchmarks could determine
whether that should count as efficient.  Also according to these
definitions, Erlang does NOT have compile-time garbage collection.

There are functional languages, notably Clean (with uniqueness types)
and Mercury (with dead-input/unique-output modes) that do provide
compile-time garbage collection.  *Semantically* these languages do
not support mutability, but the implementation behaves *as if* they
did; the 'merge' function will not in fact allocate any new storage.

Here's what I did.

I found the Leftist_tree page of the Wikipedia.
It provides "imperative" code for merging leftist heaps in Java.
I took that and wrote a complete Java program that

    - creates a heap with n entries (command line parameter)
    - for several millions of iterations
        - remove the smallest entry
        - create a new entry with the same key + a random number.

This is pretty much the "hold" model for heap performance.
Once this was working, I adapted it to produce a "functional" version.
I then translated both versions to C and to Smalltalk.

The programs were run for values of n ranging from 2 to 100,
and a linear model

     (user+system time in nanoseconds) ~ constant + speed * log(n)

fitted to each using the lm() function in R.  (The model is called
"linear" because it is linear in the _coefficients_; the fact that
it isn't linear in n doesn't matter.)  I tried other models as
well, but these fitted best.  The constant factor was ignored
because (except in the case of Smalltalk) that includes the cost
of program startup, dynamic library loading, and in the case of
Java, JIT compilation.

Rather surprisingly, in the case of the Java programs, the
*actual* measured times for the "functional" version were in
every case LESS than the actual measured times for the "imperative"
version.  Yes, the imperative version was asymptotically more
efficient, but the cross-over was estimated at n = 4,500.  I have
no idea why this is so.

                 C       Java    My ST   Squeak
"Imperative"    40       70     150     1123
"Functional"   240      126     290     1895

C means gcc 4.0; Java means JDK 1.5.0; My ST means my
ANSI Smalltalk via C whole-program compiler; Squeak means
Squeak Smalltalk 3.10.

Interpretation:

     Java is about 2 times slower than C.

     My Smalltalk -> C is about 2 times slower than Java.
     This is a consequnce of
         - the absence of a type system
         - strict encapsulation, so "x->f" and "x->f = e" are
           function calls, not simple loads and stores.

     Squeak is about 7 times slower than my Smalltalk compiler.
     This is a consequence of
         - a fully dynamic system (mine is static whole program)
         - byte code emulation

     Except for C, the "functional" versions are about twice as
     slow as the "imperative" versions.
     This is a consequence of
         - extra memory allocation and garbage collection
         - cache dilution

     The C/Functional case stands out as particularly bad.
     This is a consequence o
         - using an unadapted version of the Boehm GC;
           absolutely no attempt whatever was made to tune it.

Let's consider the My ST/Java pair in a little more detail.
In Java,
	x.s = x.r.s + 1;
should, with the aid of a JIT, turn into something like
	LD t, r(x)
	LD t, s(t)
	ADD t, t, 1
	ST t, s(x)
In Smalltalk, encapsulation means that this has to be
	set_s(x, get_s(get_r(x)) + 1)
with each of get_r(), get_s(), and set_s() involving a
dynamic dispatch, thanks to strict encapsulation, and
even a+b turns into
	if (((a|b)&3) == 0) {
	    long t = (long)a >> 2 + (long)b >> 2;
	    r      = t << 2;
	    if ((r >> 2) != t) r = make_bignum(t);
	} else {
	    r = b_p(a, b);
	}

This could be much faster on a SPARC, at least for 30-bit fixnums.
Something not entirely unlike it is necessary because Smalltalk
integers don't have any arbitrary size limits, just like Erlang.

It took careful coding, especially careful placement of code in
the right objects, to get Smalltalk code running only twice as
slow as Java.  For the Smalltalk/Java pair then

A.  The combination of
	- compile-time type system
	- weak encapsulation
	- crippled integers
     is more important FOR THIS PROBLEM than efficient mutability.

Let's consider the My ST/Squeak pair.

My (not quite finished) Smalltalk system only supports ANSI
Smalltalk plus enough to make up about 200 system classes.
It compiles whole programs to ANSI C.  It has no debugging
support, which is a right pain in the backside.  Compilation
to C is blindingly fast; compilation _of_ the resulting C is
painfully slow.  ANSI Smalltalk provides no way to change a
class at run time, so my system doesn't either.  Currently, no
inlining is done, though it's planned.  Thanks to compilation
via C, "Smalltalk" methods may contain C code, although it
doesn't matter for this benchmark.

Squeak has nearly 4000 classes, and that's the stripped down
version with all the fancy stuff moved out.  It is highly incremental.
Methods and classes can be added, renamed, or removed while a program
is running.  In fact _some_ program is _always_ running.  Its object
representation is large than mine, due in part to garbage collector
differences.  There have been attempts at a JIT for Squeak but they've
always fallen out of date or in one case not yet been finished.  I
didn't try the benchmark in a commercial Smalltalk, but other
benchmarks I've done using VisualWorks NonCommercial have shown it
performing pretty close to my Smalltalk.  So we conclude that

B.  Native code -vs- emulated byte code
     is FAR more important FOR THIS PROBLEM than efficient mutability.

The C/Functional case is seriously anomalous.  I attribute this to
the use of the Boehm garbage collector straight out of the box
without the slightest attempt at tuning or adaptation.  I just
used GC_malloc() instead of malloc().  I conclude that

C.  FOR THIS PROBLEM, the relative cost of immutability depends
     a very great deal on the garbage collector.

I note that Erlang's immutable data structures and its per-process
heaps have the _potential_ to allow a garbage collector to be
more efficient than "general purpose" ones like the Boehm one.

What does all of this tell us about priority queues in Erlang?
Well, the machine I did the benchmarks on didn't have Erlang
installed, but from what we know about it, it seems safe to conclude
that
	- the absence of an enforced static type system
	- seamless integration of fixnums and bignums
will make at least as much of a difference as the lack of
efficient mutability, FOR THIS PROBLEM.  The presence of HiPE
means that that shouldn't be _too_ much of an issue.

Of course support for efficient mutability doesn't just affect the
speed of an algorithm; it affects the very choice of an algorithm
in the first place.

all have either