[erlang-questions] Not an Erlang fan

Sun Sep 23 22:20:41 CEST 2007

--- Pierpaolo Bernardi <olopierpa@REDACTED> wrote:

> On 9/23/07, Thomas Lindgren
> <thomasl_erlang@REDACTED> wrote:
> 
> > He's also using the obvious, tempting but very
> slow
> > io:read_line. Reading the entire file into a
> binary
> > takes 7 ms (sic) using file:read_file, not 34
> seconds
> > using io:read_line as he reports.
> 
> He reports 34 seconds for the whole log file of
> about 1 million lines.
> Not for the reduced sample of about 20000 lines that
> he made
> available on the site (a difference of 50x in size).

Oops, sorry about missing that. Even so, the I/O as
such does not appear very costly. Chunk the processing
into 50 reads of 2MB each, or whatever size is
suitable. The total cost of these operations should
then be on the order of 50*7=350 ms (let's say around
a second, because it really depends on whether the
data have to be fetched from disk, seeks, bandwidth,
memory management, etc). At a guess, most of the time
will instead be spent in scanning and processing the
binaries.

Regarding parallelism, it looks to me like the reading
and processing can be overlapped. You have to
special-case lines or data that span chunks, but apart
from that, it looks as if you could process each chunk
independently, at least when you are doing
map/filter/reduce style operations (where the output
is combined incrementally as chunks are processed).

Best,
Thomas

____________________________________________________________________________________
Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=list&sid=396545433