[erlang-questions] widefinder update

Sun Oct 28 14:06:11 CET 2007

On 10/28/07, Thomas Lindgren <thomasl_erlang@REDACTED> wrote:
>
> --- Hynek Vychodil <vychodil.hynek@REDACTED> wrote:
>
> > Hello,
> > These results are interesting, but I demur to kind
> > of solution. Your
> > and Steve's approach have some caveats.
> >
> > 1/ File is read all in memory.
>
> Hynek,
>
> This is true for some versions, but not all. The
> 'block read' version reads the file in chunks.

What version do you mean? tbray_blockread.erl from
http://www.erlang.org/pipermail/erlang-questions/2007-October/030118.html
reads in chunks, but when workers are slow you run out of memory. Look
at scan_file/9 cycle. There isn't limit of blocks in memory.

>
> > 2/ Workers share resource (ets table) and it is
> > principally bad. If
> > you have more CPU consuming task and you must use
> > more CPU than as
> > current task to consume your input data bandwitch
> > and  simultaneously
> > more result extensive task, you fall in trouble
> > again.
>
> Note that the ets table in all proposals but one is
> managed by a single process. It is just used as a more
> efficient data structure. So the potential problem
> here is really if this process becomes a bottleneck.
>
> So, we have so far looked at two extremes:
>
> 1. Every worker maintains a local count, these are
> then merged into a global count.
>
> 2. A single process maintains the global count,
> workers send it updates.
>
> But if this becomes problematic, one could also
> combine the two by having 1 to N centralized counting
> processes to trade off the cost of merging versus the
> cost of incrementally sending all counts to a
> 'master'. (And one could batch the sending of updates
> too, come to think of it.)
>
> > As conclusion I think, your solution scale bad for
> > both end. When you
> > have small amount of CPUs, you run out memory on
> > larger datasets.
>
> Not necessarily. With the block read solution, it
> doesn't seem like you run that risk.
>

Yes, but where is this solution? I can't see it in this thread now.
May be missed some, but solutions what I read are reader depend and
reader is not waiting for workers.

>
> The use of file:read_file/1 just showed that you
> _could_ do fast I/O in Erlang, at a time when people
> thought Erlang file I/O was very slow indeed. Showing
> this was done by switching to a more suitable API
> call. But you can be even more sophisticated than
> that, e.g., by using file:pread.
>
> > When
> > you have more CPU, you fall in bottle neck of your
> > shared resource.
>
> Do you mean that the problem becomes I/O bound? Do
> note that all sufficiently fast solutions will
> ultimately be limited by a hardware bottleneck of some
> sort: CPU, I/O, network ...
>
> In this particular case, you could increase I/O
> performance by, say, striping the disk. And you can
> increase CPU performance by, say, distributing the
> work to multiple hosts/nodes (fairly straightforward
> with Erlang, by the way). But with these problems,
> even with infinite hardware you will eventually run
> into some sequential portion of the code, and that
> will limit the speedup as per Amdahl's Law.
>

Yes, you are true. There isn't "best" solution. But at least make
memory safe solution we can.

Cheers
-- Hynek (Pichi) Vychodil