Parsing big files

Ulf Wiger etxuwig@REDACTED
Tue Dec 5 10:03:08 CET 2000

Hi Thomas,

I've attached a file that seems to do the job.


1> fileio:lines("fileio.erl","fileio.erl.out",
                 fun(Str) -> ["=== ",Str] end).

> head -5 fileio.erl.out
=== -module(fileio).
=== -author('etxuwig@REDACTED').
=== %%-compile(export_all).
=== -export([lines/3]).

On Tue, 5 Dec 2000, Thomas Arts wrote:

>I have got a large file which consists of about 2 million lines.
>The aim is to parse this file, change the format a little and
>write it back to disk.
>No surprise that file:read_file(FileName) helps the erlang runtime
>system to get out of memory. I need a file:open, and thereafter
>read the file in parts and write the changed parts to disk.
>I wonder if someone already wrote a transformation program for
>such large files. I want the scanner to present a scanned
>line at a time, such that I can write a line at a time, but it
>would be nice if I don't have to do the bookkeeping on the
>byte level.

%%% lines(InFile, OutFile, Fun : fun/1) -> ok | {error, Reason}
%%% Process InFile one line at a time. Each line is passed to Fun, and
%%% the return value (a possibly deep list of chars) is written to OutFile.
%%% Don't forget the newline.
%%% Example:
%%% 2> fileio:lines("fileio.erl","fileio.erl.out",
%%%                 fun(Str) -> ["=== ",Str] end).
%%% ok
%%% would produce the following output in fileio.erl.out:
%%% > head -5 fileio.erl.out
%%%=== -module(fileio).
%%%=== -author('etxuwig@REDACTED').
%%%=== %%-compile(export_all).
%%%=== -export([lines/3]).

lines(InFile, OutFile, Fun) ->
    case file:open(InFile, [read]) of
	{ok, In} ->
	    case file:open(OutFile, [write]) of
		{ok, Out} ->
		    process_files(In, Out, Fun);
		{error, Reason} ->
		    {error, {Reason, OutFile}}
	{error, Reason} ->
	    {error, {Reason, InFile}}

process_files(In, Out, Fun) ->
    Result = (catch process(In, Out, Fun)),

process(In, Out, Fun) ->
    case io:get_line(In, "") of
	eof ->
	Line ->
	    ok = io:put_chars(Out, Fun(Line)),
	    process(In, Out, Fun)


