<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
It's official, Erlsom is Saxy... <br>
<br>
I have the latest version of sf.net, and have successfully parsed the
6.6Gb file in about 22 minutes. With an average of 5.5Mb/s parsing
time which on my T2500 laptop is pretty damn good. It was nice to note
that it was CPU bound, so when knowing the XML structure I can
potentially split the file processing into blocks and multi-thread it.
Either way 22 minutes is an acceptable time in my book.<br>
<br>
Thanks for your help Willem.<br>
<br>
Willem de Jong wrote:
<blockquote
cite="mid:407d9ef80806192324x47c9800cp5dd1c853a4c6f2cd@mail.gmail.com"
type="cite">
<div>Hello Matt,</div>
<div> </div>
<div>Yes, this is possible (and documented, but maybe not clear
enough - are you using the latest version? Did you get erlsom from
CEAN? That version is quite old).</div>
<div> </div>
<div>There even is an example included in the distribution that shows
how to do it. I am copying part of it below.</div>
<div> </div>
<div>Good luck,</div>
<div>Willem</div>
<div> </div>
<div>
<p><font face="courier new,monospace">%% Example to show how the
Erlsom Sax parser can be used in combination<br>
%% with a 'continuation function'. This enables parsing of very big
documents<br>
%% in a sort of streaming mode.<br>
%%<br>
%% When the sax parser reaches the end of a block of data, it calls the<br>
%% continuation function. This should return the next block of data.<br>
%%<br>
%% the continuation function is a function that takes 2 arguments: Tail
and<br>
%% State.<br>
%% - Tail is the (short) list of characters that could not yet be
parsed<br>
%% because it might be a special token or not. Since this still
has to<br>
%% be parsed, it should be put in front of the next block of data.<br>
%% - State is information that is passed by the parser to the
callback<br>
%% functions transparently. This can be used to keep track of the<br>
%% location in the file etc.<br>
%% The function returns {NewData, NewState}, where NewData is a list of<br>
%% characters/unicode code points, and NewState the new value for the
State.<br>
<br>
-export([run/0]).</font></p>
<p><font face="courier new,monospace">%% 'chunk' is the number of
characters that is read at a time.<br>
%% should be tuned for the best result. (109 is obviously not a good
value,<br>
%% it should be bigger than that - try it out).<br>
-define(chunk, 109).<br>
<br>
run() -><br>
F = fun count_books/2, %% the callback function that handles the
sax events<br>
G = fun continue_file/2, %% the callback function that returns the
next<br>
%% chunk of data<br>
%% open file<br>
{ok, Handle} = <a class="moz-txt-link-freetext"
href="file:open%28xml%28%29">file:open(xml()</a>, [read, raw, binary]),<br>
Position = 0,<br>
CState = {Handle, Position, ?chunk},<br>
SaxCallbackState = undefined,<br>
%% erlsom:parse_sax() returns {ok, FinalState, TrailingBytes},<br>
%% where TrailingBytes is the rest of the input-document<br>
%% that follows after the last closing tag of the XML, and Result<br>
%% is the value of the State after processing the last SAX event.<br>
{ok, Result, _TrailingBytes} =<br>
erlsom:parse_sax(<<>>, SaxCallbackState, F,<br>
[{continuation_function, G, CState}]),<br>
%% close file<br>
ok = <a class="moz-txt-link-freetext" href="file:close%28Handle%29">file:close(Handle)</a>,<br>
<br>
%% Result is a list [{track_id, count}, ...]<br>
lists:foreach(fun({Date, Count}) -><br>
io:format("Date: ~p - count: ~p~n", [Date, Count])<br>
end, Result),<br>
ok.<br>
<br>
%% this is a continuation function that reads chunks of data<br>
%% from a file.<br>
continue_file(Tail, {Handle, Offset, Chunk}) -><br>
%% read the next chunk<br>
case <a class="moz-txt-link-freetext" href="file:pread%28Handle">file:pread(Handle</a>,
Offset, Chunk) of<br>
{ok, Data} -><br>
{<<Tail/binary, Data/binary>>, {Handle, Offset +
Chunk, Chunk}};<br>
eof -><br>
{Tail, {Handle, Offset, Chunk}}<br>
end.</font></p>
<p><font face="courier new,monospace">count_books(startDocument, _)
-><br>
etc...</font></p>
<br>
</div>
<div><span class="gmail_quote">On 6/19/08, <b
class="gmail_sendername">Matt Harrison</b> <<a
moz-do-not-send="true" href="mailto:matt@lummie.co.uk">matt@lummie.co.uk</a>>
wrote:</span>
<blockquote class="gmail_quote"
style="border-left: 1px solid rgb(204, 204, 204); margin: 0px 0px 0px 0.8ex; padding-left: 1ex;">See
what I did there with the subject title :)<br>
<br>
All<br>
<br>
Has anyone used Erlsom for sax parsing straight from a file, i.e without<br>
using <a class="moz-txt-link-freetext" href="file:read_file">file:read_file</a>
to load a whole file. I can't seem to find an<br>
appropriate call, and the docs don't seem to cover it.<br>
<br>
I have large files 6gb+ that I need to SAX parse, the main requirement<br>
being able to parse them with about 1gb of memory.<br>
<br>
Erlsom works a treat with files that are a few 100Mb but requires the<br>
whole file loaded in memory, which kinda kills the main benefit of sax<br>
parsing in my opinion.<br>
<br>
I am not especially bothered about speed, (it was suggested that I look<br>
at c parsers linked into erlang) as this is for a data import process<br>
that will only happen rarely (mainly for development and testing
purposes).<br>
<br>
I still don't seem to be able to find any xmerl_eventp examples so if<br>
you have one please let me know.<br>
<br>
regards and thanks,<br>
<br>
Matt<br>
<br>
I haven't discarded the using a c library I'm just fairly new to erlang<br>
and would prefer an erlang solution if possible as I don't want to<br>
venture into ports quite yet :)<br>
_______________________________________________<br>
erlang-questions mailing list<br>
<a moz-do-not-send="true" href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>
<a moz-do-not-send="true"
href="http://www.erlang.org/mailman/listinfo/erlang-questions">http://www.erlang.org/mailman/listinfo/erlang-questions</a><br>
</blockquote>
</div>
<br>
</blockquote>
<br>
<br>
</body>
</html>