[erlang-questions] widefinder update

Sun Oct 28 13:29:51 CET 2007

--- Hynek Vychodil <vychodil.hynek@REDACTED> wrote:

> Hello,
> These results are interesting, but I demur to kind
> of solution. Your
> and Steve's approach have some caveats.
> 
> 1/ File is read all in memory. 

Hynek,

This is true for some versions, but not all. The
'block read' version reads the file in chunks.

> 2/ Workers share resource (ets table) and it is
> principally bad. If
> you have more CPU consuming task and you must use
> more CPU than as
> current task to consume your input data bandwitch
> and  simultaneously
> more result extensive task, you fall in trouble
> again.

Note that the ets table in all proposals but one is
managed by a single process. It is just used as a more
efficient data structure. So the potential problem
here is really if this process becomes a bottleneck.

So, we have so far looked at two extremes:

1. Every worker maintains a local count, these are
then merged into a global count.

2. A single process maintains the global count,
workers send it updates.

But if this becomes problematic, one could also
combine the two by having 1 to N centralized counting
processes to trade off the cost of merging versus the
cost of incrementally sending all counts to a
'master'. (And one could batch the sending of updates
too, come to think of it.)

> As conclusion I think, your solution scale bad for
> both end. When you
> have small amount of CPUs, you run out memory on
> larger datasets. 

Not necessarily. With the block read solution, it
doesn't seem like you run that risk. 

The use of file:read_file/1 just showed that you
_could_ do fast I/O in Erlang, at a time when people
thought Erlang file I/O was very slow indeed. Showing
this was done by switching to a more suitable API
call. But you can be even more sophisticated than
that, e.g., by using file:pread.

> When
> you have more CPU, you fall in bottle neck of your
> shared resource. 

Do you mean that the problem becomes I/O bound? Do
note that all sufficiently fast solutions will
ultimately be limited by a hardware bottleneck of some
sort: CPU, I/O, network ...

In this particular case, you could increase I/O
performance by, say, striping the disk. And you can
increase CPU performance by, say, distributing the
work to multiple hosts/nodes (fairly straightforward
with Erlang, by the way). But with these problems,
even with infinite hardware you will eventually run
into some sequential portion of the code, and that
will limit the speedup as per Amdahl's Law.

Best,
Thomas

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com