[erlang-questions] Pointless md5 Problem

Thu May 7 08:04:27 CEST 2009

On Wednesday, May 06, Colin Z wrote:

> I wrote up a quick and dirty distributed app, but the performance seems
> really bad (around 10,000 keys/sec on a 3700+ Athlon) It scales nicely, but
> 10,000 keys/sec/machine is terrible when other languages can get 1M+/sec
...
> So, now my interest lies in what's so slow about my code....
...
> What does everyone think?

There are two main ways to study performance in a program. One is to
think and profile. The other is to ask under-occupied people on
mailing lists to speculate wildly. Or: I can't be bothered profiling
your program, if you're really interested in why it's slow you should,
and the rest of this mail is just some ideas about things to look at
and think about.

---

Within one node, you're spawning one process for each MD5 you
check. What's the point of doing that?

There's no concurrency there to justify doing it for design reasons.

So the only reason I can think of is to exploit any parallel hardware
you might have(1). If you're going to do that, then what you want is a
fairly small number of processes doing a lot of independent work each,
i.e. on an N-core machine I'd start off trying N processes.

As your code is now, there's no upper limit to the number of
concurrent 'checkMD5' processes you're starting. I expected the number
of processes to explode. But actually it doesn't, at least not on my
machine. Looking more closely, I see that you're generating the random
hex string in the process which spawns 'checkMD5'. That suggests that
making the random string is more work than 'checkMD5'.

Without actually profiling, i.e. speculating, my guess is that
erlang:md5 is much cheaper than both the the flattening and formatting
you're doing in checkMD5 and also much cheaper than all the
flattening, folding, randoming and lists:seq in randomHexString().

So there's a lot of work which could be eliminated. How much?  Depends
on which parts of your solution are 'essential parts of a pointless
problem' and which parts are just bad implementation.  E.g.: is
generating a string representation of something which could be an
MD5sum and then hashing the string to see if the string representation
of that hash is the same as the string an essential part of your
problem? Why not generate random binaries(2) and seeing if they hash
to themselves?

And: why so many calls to random()? Is avoiding a systematic division
of the domain an essential part of your contrived problem or not?

Matt

 1. You are running an SMP emulator, right? 

 2. It might even be worth fossicking around in some of the HIPE
    group's evil functions for destructive updates and seeing if there's
    something useful there. Here's a blog post from someone who would
    appear to approve:

      http://prog21.dadgum.com/41.html

    While you're at it, you could also try native compilation.