list size 'causing VM "problems"
Hendrik Visage
hvjunk@REDACTED
Mon Nov 23 11:09:43 CET 2009
Hi there,
Yes, I know this code is not yet optimal (I'm still learning :), but
it begs a few questions I'd like to understand from the VM etc.
1) I've run it fine with a small subset, but once I've loaded the 930k
lines file, the VM sucks up a lot of RAM/Virtualmemory. Like a burst
of about 2G (I have a 4G MacBookPro) and then once it returned in the
erl shell, the VM starts to go balistic and consumes >7G of
virtualmemory ;(
Q1: why did the VM exhibit this behaviour? the garbage collector going bad/mad??
2) I will push the data into an ETS of sorts, as I'll try to find
duplicate files, but were thinking of an initial pull into a list, en
then fron there do the tests etc. The idea might be to pull in one
disk, and then compare it to another removal disk's files.
Q2: Should I rather do this straight into an ETS/DETS?
Q3: Should I preferably start to consider DETS 'cause of the size??
Q4: will Mnesia help in this case?
%%--------------------------------------------------------------------
%% Function: process_line/1
%% Description: take a properly formated line, and parse it, and
%% returns the tuple {Type,File,Hash}
%% Line: "MD5 (/.file) = d41d8cd98f00b204e9800998ecf8427e"
%% Nore some might be SHA1 in future.
%%--------------------------------------------------------------------
process_line(Line) ->
{match,[Type,File,Hash]}=
re:run(Line,
"\(.*\)[ ][\\(]\(.*\)[\\)][ ][=][ ]\([0-9a-f]*\)\n",
[{capture,all_but_first,list}]),
{Type,File,Hash}.
%%--------------------------------------------------------------------
%% Function: read_lines/1
%% Description: read in all the lines from a "properly formatted"
%% md5 output on MacOSX, returning a list with the tupples.
%%--------------------------------------------------------------------
read_lines(IOfd) ->
case file:read_line(IOfd) of
{ok,Line} ->
[process_line(Line)|read_lines(IOfd)];
eof ->
[]
end.
More information about the erlang-questions
mailing list