Dynamic languages are the future

Richard A. O'Keefe ok@REDACTED
Thu Aug 31 06:11:27 CEST 2006

fbg111 <fbg111@REDACTED> wrote:
	Same problem 
	in America.  More
	good writings  on that topic at Paul Graham's site.

It must be synchronicity.
We've had a thread about strings that looks like reviving.
I'm in the middle of writing some Java classes for a 4th year student to use.
And now here's mention of Java.

The 4th year student is supposed to be investigating a topic in information
retrieval; I wanted to see if multicore systems could do IR faster, and
Andrew Trotman came up with the key concept for how to structure an index
so that this might actually work.

The student was given a tiny (< 500 line) IR engine in C that reads an XML
document collection and builds an index (~ 300 lines) and that reads
an index and queries (~ 300 lines; the two programs share some code).
It's small, dead simple, and reasonably fast.  It can index the test
collection in under 3 minutes.  The student basically got nowhere modifying
it because despite having had C in 3rd year, she only really knows Java.
Her Java rewrite of my C code (without the XML stuff; the document
collection had to be reformatted) takes eight hours.

Eight hours!  That's 160 times slower than C!

Just for grins, I rewrote the index builder in AWK.  44 lines of AWK.
(It works on the same reformatted document collection as the student's
code, uses built in hash tables, and writes numbers in ASCII, not binary.)
Her Java program was more than 50 times slower than AWK.

Profiling to the rescue:  "java -Xprof BuildIndex wsj.data".
It turned out that practically all the time was going in RandomAccessFile.
Guess what:  RandomAccessFile doesn't do any buffering, so each


turned into 4 calls to f.write((byte)(x >> ...)), and each of *those*
is a call to a native method, involving a switch from Java to C and back.

Just adding a few lines of code to buffer stuff into a byte array and
flushing that every so often (just like using fwrite() in C would...)
speeded the program up by a factor of 10.

The mawk version is still sniggering at the Java version, but not as loudly.

Of course this says nothing about Java AS A LANGUAGE.
It's a library issue.  But in real Java, practically _everything_ is
a library issue.  There's another student doing a GA+IR project whose
program was speeded up by a large factor by another lecturer.  Same
thing:  run the program with -Xprof, spot that the time is going in a
library class (ArrayList, as it happens), rewrite to use plain arrays,
time goes way down.

One of the things that makes Erlang a practical language for real applications
is the tools for working with Erlang, like 'eprof' and 'cover'.

More information about the erlang-questions mailing list