[erlang-questions] Erlsom it's so close to being SAXy

Matt Harrison matt@REDACTED
Fri Jun 20 21:42:12 CEST 2008


It's official, Erlsom is Saxy...

I have the latest version of sf.net, and have successfully parsed the 
6.6Gb file in about 22 minutes.  With an average of 5.5Mb/s parsing time 
which on my T2500 laptop is pretty damn good.  It was nice to note that 
it was CPU bound, so when knowing the XML structure I can potentially 
split the file processing into blocks and multi-thread it.  Either way 
22 minutes is an acceptable time in my book.

Thanks for your help Willem.

Willem de Jong wrote:
> Hello Matt,
>  
> Yes, this is possible (and documented, but maybe not clear enough - 
> are you using the latest version? Did you get erlsom from CEAN? That 
> version is quite old).
>  
> There even is an example included in the distribution that shows how 
> to do it. I am copying part of it below.
>  
> Good  luck,
> Willem
>  
>
> %% Example to show how the Erlsom Sax parser can be used in combination
> %% with a 'continuation function'. This enables parsing of very big 
> documents
> %% in a sort of streaming mode.
> %%
> %% When the sax parser reaches the end of a block of data, it calls the
> %% continuation function. This should return the next block of data.
> %%
> %% the continuation function is a function that takes 2 arguments: 
> Tail and
> %% State.
> %%    - Tail is the (short) list of characters that could not yet be 
> parsed
> %%      because it might be a special token or not. Since this still 
> has to
> %%      be parsed, it should be put in front of the next block of data.
> %%    - State is information that is passed by the parser to the callback
> %%      functions transparently. This can be used to keep track of the
> %%      location in the file etc.
> %% The function returns {NewData, NewState}, where NewData is a list of
> %% characters/unicode code points, and NewState the new value for the 
> State.
>  
> -export([run/0]).
>
> %% 'chunk' is the number of characters that is read at a time.
> %% should be tuned for the best result. (109 is obviously not a good 
> value,
> %% it should be bigger than that - try it out).
> -define(chunk, 109).
>  
> run() ->
>    F = fun count_books/2,   %% the callback function that handles the 
> sax events
>    G = fun continue_file/2, %% the callback function that returns the next
>                             %% chunk of data
>    %% open file
>    {ok, Handle} = file:open(xml(), [read, raw, binary]),
>    Position = 0,
>    CState = {Handle, Position, ?chunk},
>    SaxCallbackState = undefined,
>    %% erlsom:parse_sax() returns {ok, FinalState, TrailingBytes},
>    %% where TrailingBytes is the rest of the input-document
>    %% that follows after the last closing tag of the XML, and Result
>    %% is the value of the State after processing the last SAX event.
>    {ok, Result, _TrailingBytes} =
>      erlsom:parse_sax(<<>>, SaxCallbackState, F,
>        [{continuation_function, G, CState}]),
>    %% close file
>    ok = file:close(Handle),
>  
>    %% Result is a list [{track_id, count}, ...]
>    lists:foreach(fun({Date, Count}) ->
>                   io:format("Date: ~p - count: ~p~n", [Date, Count])
>                  end, Result),
>    ok.
>  
> %% this is a continuation function that reads chunks of data
> %% from a file.
> continue_file(Tail, {Handle, Offset, Chunk}) ->
>    %% read the next chunk
>    case file:pread(Handle, Offset, Chunk) of
>      {ok, Data} ->
>        {<<Tail/binary, Data/binary>>, {Handle, Offset + Chunk, Chunk}};
>      eof ->
>        {Tail, {Handle, Offset, Chunk}}
>    end.
>
> count_books(startDocument, _) ->
>   etc...
>
>
>  
> On 6/19/08, *Matt Harrison* <matt@REDACTED 
> <mailto:matt@REDACTED>> wrote:
>
>     See what I did there with the subject title :)
>
>     All
>
>     Has anyone used Erlsom for sax parsing straight from a file, i.e
>     without
>     using file:read_file to load a whole file. I can't seem to find an
>     appropriate call, and the docs don't seem to cover it.
>
>     I have large files 6gb+ that I need to SAX parse, the main requirement
>     being able to parse them with about 1gb of memory.
>
>     Erlsom works a treat with files that are a few 100Mb but requires the
>     whole file loaded in memory, which kinda kills the main benefit of sax
>     parsing in my opinion.
>
>     I am not especially bothered about speed, (it was suggested that I
>     look
>     at c parsers linked into erlang) as this is for a data import process
>     that will only happen rarely (mainly for development and testing
>     purposes).
>
>     I still don't seem to be able to find any xmerl_eventp examples so if
>     you have one please let me know.
>
>     regards and thanks,
>
>     Matt
>
>     I haven't discarded the using a c library I'm just fairly new to
>     erlang
>     and would prefer an erlang solution if possible as I don't want to
>     venture into ports quite yet :)
>     _______________________________________________
>     erlang-questions mailing list
>     erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>     http://www.erlang.org/mailman/listinfo/erlang-questions
>
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080620/072fe6c2/attachment.htm>


More information about the erlang-questions mailing list