[erlang-questions] Erlsom it's so close to being SAXy
Matt Harrison
matt@REDACTED
Fri Jun 20 21:42:12 CEST 2008
It's official, Erlsom is Saxy...
I have the latest version of sf.net, and have successfully parsed the
6.6Gb file in about 22 minutes. With an average of 5.5Mb/s parsing time
which on my T2500 laptop is pretty damn good. It was nice to note that
it was CPU bound, so when knowing the XML structure I can potentially
split the file processing into blocks and multi-thread it. Either way
22 minutes is an acceptable time in my book.
Thanks for your help Willem.
Willem de Jong wrote:
> Hello Matt,
>
> Yes, this is possible (and documented, but maybe not clear enough -
> are you using the latest version? Did you get erlsom from CEAN? That
> version is quite old).
>
> There even is an example included in the distribution that shows how
> to do it. I am copying part of it below.
>
> Good luck,
> Willem
>
>
> %% Example to show how the Erlsom Sax parser can be used in combination
> %% with a 'continuation function'. This enables parsing of very big
> documents
> %% in a sort of streaming mode.
> %%
> %% When the sax parser reaches the end of a block of data, it calls the
> %% continuation function. This should return the next block of data.
> %%
> %% the continuation function is a function that takes 2 arguments:
> Tail and
> %% State.
> %% - Tail is the (short) list of characters that could not yet be
> parsed
> %% because it might be a special token or not. Since this still
> has to
> %% be parsed, it should be put in front of the next block of data.
> %% - State is information that is passed by the parser to the callback
> %% functions transparently. This can be used to keep track of the
> %% location in the file etc.
> %% The function returns {NewData, NewState}, where NewData is a list of
> %% characters/unicode code points, and NewState the new value for the
> State.
>
> -export([run/0]).
>
> %% 'chunk' is the number of characters that is read at a time.
> %% should be tuned for the best result. (109 is obviously not a good
> value,
> %% it should be bigger than that - try it out).
> -define(chunk, 109).
>
> run() ->
> F = fun count_books/2, %% the callback function that handles the
> sax events
> G = fun continue_file/2, %% the callback function that returns the next
> %% chunk of data
> %% open file
> {ok, Handle} = file:open(xml(), [read, raw, binary]),
> Position = 0,
> CState = {Handle, Position, ?chunk},
> SaxCallbackState = undefined,
> %% erlsom:parse_sax() returns {ok, FinalState, TrailingBytes},
> %% where TrailingBytes is the rest of the input-document
> %% that follows after the last closing tag of the XML, and Result
> %% is the value of the State after processing the last SAX event.
> {ok, Result, _TrailingBytes} =
> erlsom:parse_sax(<<>>, SaxCallbackState, F,
> [{continuation_function, G, CState}]),
> %% close file
> ok = file:close(Handle),
>
> %% Result is a list [{track_id, count}, ...]
> lists:foreach(fun({Date, Count}) ->
> io:format("Date: ~p - count: ~p~n", [Date, Count])
> end, Result),
> ok.
>
> %% this is a continuation function that reads chunks of data
> %% from a file.
> continue_file(Tail, {Handle, Offset, Chunk}) ->
> %% read the next chunk
> case file:pread(Handle, Offset, Chunk) of
> {ok, Data} ->
> {<<Tail/binary, Data/binary>>, {Handle, Offset + Chunk, Chunk}};
> eof ->
> {Tail, {Handle, Offset, Chunk}}
> end.
>
> count_books(startDocument, _) ->
> etc...
>
>
>
> On 6/19/08, *Matt Harrison* <matt@REDACTED
> <mailto:matt@REDACTED>> wrote:
>
> See what I did there with the subject title :)
>
> All
>
> Has anyone used Erlsom for sax parsing straight from a file, i.e
> without
> using file:read_file to load a whole file. I can't seem to find an
> appropriate call, and the docs don't seem to cover it.
>
> I have large files 6gb+ that I need to SAX parse, the main requirement
> being able to parse them with about 1gb of memory.
>
> Erlsom works a treat with files that are a few 100Mb but requires the
> whole file loaded in memory, which kinda kills the main benefit of sax
> parsing in my opinion.
>
> I am not especially bothered about speed, (it was suggested that I
> look
> at c parsers linked into erlang) as this is for a data import process
> that will only happen rarely (mainly for development and testing
> purposes).
>
> I still don't seem to be able to find any xmerl_eventp examples so if
> you have one please let me know.
>
> regards and thanks,
>
> Matt
>
> I haven't discarded the using a c library I'm just fairly new to
> erlang
> and would prefer an erlang solution if possible as I don't want to
> venture into ports quite yet :)
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080620/072fe6c2/attachment.htm>
More information about the erlang-questions
mailing list