<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

It's official, Erlsom is Saxy... <br>

<br>

I have the latest version of sf.net, and have successfully parsed the

6.6Gb file in about 22 minutes.  With an average of 5.5Mb/s parsing

time which on my T2500 laptop is pretty damn good.  It was nice to note

that it was CPU bound, so when knowing the XML structure I can

potentially split the file processing into blocks and multi-thread it. 

Either way 22 minutes is an acceptable time in my book.<br>

<br>

Thanks for your help Willem.<br>

<br>

Willem de Jong wrote:

<blockquote

 cite="mid:407d9ef80806192324x47c9800cp5dd1c853a4c6f2cd@mail.gmail.com"

 type="cite">

  <div>Hello Matt,</div>

  <div> </div>

  <div>Yes, this is possible (and documented, but maybe not clear

enough - are you using the latest version? Did you get erlsom from

CEAN? That version is quite old).</div>

  <div> </div>

  <div>There even is an example included in the distribution that shows

how to do it. I am copying part of it below.</div>

  <div> </div>

  <div>Good  luck,</div>

  <div>Willem</div>

  <div> </div>

  <div>

  <p><font face="courier new,monospace">%% Example to show how the

Erlsom Sax parser can be used in combination<br>

%% with a 'continuation function'. This enables parsing of very big

documents<br>

%% in a sort of streaming mode.<br>

%%<br>

%% When the sax parser reaches the end of a block of data, it calls the<br>

%% continuation function. This should return the next block of data.<br>

%%<br>

%% the continuation function is a function that takes 2 arguments: Tail

and<br>

%% State.<br>

%%    - Tail is the (short) list of characters that could not yet be

parsed<br>

%%      because it might be a special token or not. Since this still

has to<br>

%%      be parsed, it should be put in front of the next block of data.<br>

%%    - State is information that is passed by the parser to the

callback<br>

%%      functions transparently. This can be used to keep track of the<br>

%%      location in the file etc.<br>

%% The function returns {NewData, NewState}, where NewData is a list of<br>

%% characters/unicode code points, and NewState the new value for the

State.<br>

 <br>

-export([run/0]).</font></p>

  <p><font face="courier new,monospace">%% 'chunk' is the number of

characters that is read at a time.<br>

%% should be tuned for the best result. (109 is obviously not a good

value,<br>

%% it should be bigger than that - try it out).<br>

-define(chunk, 109).<br>

 <br>

run() -><br>

   F = fun count_books/2,   %% the callback function that handles the

sax events<br>

   G = fun continue_file/2, %% the callback function that returns the

next<br>

                            %% chunk of data<br>

   %% open file<br>

   {ok, Handle} = <a class="moz-txt-link-freetext"

 href="file:open%28xml%28%29">file:open(xml()</a>, [read, raw, binary]),<br>

   Position = 0,<br>

   CState = {Handle, Position, ?chunk},<br>

   SaxCallbackState = undefined,<br>

   %% erlsom:parse_sax() returns {ok, FinalState, TrailingBytes},<br>

   %% where TrailingBytes is the rest of the input-document<br>

   %% that follows after the last closing tag of the XML, and Result<br>

   %% is the value of the State after processing the last SAX event.<br>

   {ok, Result, _TrailingBytes} =<br>

     erlsom:parse_sax(<<>>, SaxCallbackState, F,<br>

       [{continuation_function, G, CState}]),<br>

   %% close file<br>

   ok = <a class="moz-txt-link-freetext" href="file:close%28Handle%29">file:close(Handle)</a>,<br>

 <br>

   %% Result is a list [{track_id, count}, ...]<br>

   lists:foreach(fun({Date, Count}) -><br>

                  io:format("Date: ~p - count: ~p~n", [Date, Count])<br>

                 end, Result),<br>

   ok.<br>

 <br>

%% this is a continuation function that reads chunks of data<br>

%% from a file.<br>

continue_file(Tail, {Handle, Offset, Chunk}) -><br>

   %% read the next chunk<br>

   case <a class="moz-txt-link-freetext" href="file:pread%28Handle">file:pread(Handle</a>,

Offset, Chunk) of<br>

     {ok, Data} -><br>

       {<<Tail/binary, Data/binary>>, {Handle, Offset +

Chunk, Chunk}};<br>

     eof -><br>

       {Tail, {Handle, Offset, Chunk}}<br>

   end.</font></p>

  <p><font face="courier new,monospace">count_books(startDocument, _)

-><br>

  etc...</font></p>

  <br>

 </div>

  <div><span class="gmail_quote">On 6/19/08, <b

 class="gmail_sendername">Matt Harrison</b> <<a

 moz-do-not-send="true" href="mailto:matt@lummie.co.uk">matt@lummie.co.uk</a>>

wrote:</span>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0px 0px 0px 0.8ex; padding-left: 1ex;">See

what I did there with the subject title :)<br>

    <br>

All<br>

    <br>

Has anyone used Erlsom for sax parsing straight from a file, i.e without<br>

using <a class="moz-txt-link-freetext" href="file:read_file">file:read_file</a>

to load a whole file. I can't seem to find an<br>

appropriate call, and the docs don't seem to cover it.<br>

    <br>

I have large files 6gb+ that I need to SAX parse, the main requirement<br>

being able to parse them with about 1gb of memory.<br>

    <br>

Erlsom works a treat with files that are a few 100Mb but requires the<br>

whole file loaded in memory, which kinda kills the main benefit of sax<br>

parsing in my opinion.<br>

    <br>

I am not especially bothered about speed, (it was suggested that I look<br>

at c parsers linked into erlang) as this is for a data import process<br>

that will only happen rarely (mainly for development and testing

purposes).<br>

    <br>

I still don't seem to be able to find any xmerl_eventp examples so if<br>

you have one please let me know.<br>

    <br>

regards and thanks,<br>

    <br>

Matt<br>

    <br>

I haven't discarded the using a c library I'm just fairly new to erlang<br>

and would prefer an erlang solution if possible as I don't want to<br>

venture into ports quite yet :)<br>

_______________________________________________<br>

erlang-questions mailing list<br>

    <a moz-do-not-send="true" href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

    <a moz-do-not-send="true"

 href="http://www.erlang.org/mailman/listinfo/erlang-questions">http://www.erlang.org/mailman/listinfo/erlang-questions</a><br>

  </blockquote>

  </div>

  <br>

</blockquote>

<br>

<br>

</body>

</html>