list size 'causing VM "problems"

Mon Nov 23 11:09:43 CET 2009

Hi there,

Yes, I know this code is not yet optimal (I'm still learning :), but
it begs a few questions I'd like to understand from the VM etc.

1) I've run it fine with a small subset, but once I've loaded the 930k
lines file, the VM sucks up a lot of RAM/Virtualmemory. Like a burst
of about 2G (I have a 4G MacBookPro) and then once it returned in the
erl shell, the VM starts to go balistic and consumes >7G of
virtualmemory ;(
Q1: why did the VM exhibit this behaviour? the garbage collector going bad/mad??

2) I will push the data into an ETS of sorts, as I'll try to find
duplicate files, but were thinking of an initial pull into a list, en
then fron there do the tests etc. The idea might be to pull in one
disk, and then compare it to another removal disk's files.
Q2: Should I rather do this straight into an ETS/DETS?
Q3: Should I preferably start to consider DETS 'cause of the size??
Q4: will Mnesia help in this case?

%%--------------------------------------------------------------------
%% Function: process_line/1
%% Description: take a properly formated line, and parse it, and
%%   returns the tuple {Type,File,Hash}
%% Line: "MD5 (/.file) = d41d8cd98f00b204e9800998ecf8427e"
%% Nore some might be SHA1 in future.
%%--------------------------------------------------------------------
process_line(Line) ->
    {match,[Type,File,Hash]}=
	re:run(Line,
	       "\(.*\)[ ][\\(]\(.*\)[\\)][ ][=][ ]\([0-9a-f]*\)\n",
	       [{capture,all_but_first,list}]),
    {Type,File,Hash}.

%%--------------------------------------------------------------------
%% Function: read_lines/1
%% Description: read in all the lines from a "properly formatted"
%%    md5 output on MacOSX, returning a list with the tupples.
%%--------------------------------------------------------------------

read_lines(IOfd) ->
    case  file:read_line(IOfd) of
	{ok,Line} ->
	    [process_line(Line)|read_lines(IOfd)];
	eof ->
	    []
    end.