[erlang-questions] Erlsom it's so close to being SAXy

Willem de Jong w.a.de.jong@REDACTED
Fri Jun 20 08:24:11 CEST 2008


Hello Matt,

Yes, this is possible (and documented, but maybe not clear enough - are you
using the latest version? Did you get erlsom from CEAN? That version is
quite old).

There even is an example included in the distribution that shows how to do
it. I am copying part of it below.

Good  luck,
Willem


%% Example to show how the Erlsom Sax parser can be used in combination
%% with a 'continuation function'. This enables parsing of very big
documents
%% in a sort of streaming mode.
%%
%% When the sax parser reaches the end of a block of data, it calls the
%% continuation function. This should return the next block of data.
%%
%% the continuation function is a function that takes 2 arguments: Tail and
%% State.
%%    - Tail is the (short) list of characters that could not yet be parsed
%%      because it might be a special token or not. Since this still has to
%%      be parsed, it should be put in front of the next block of data.
%%    - State is information that is passed by the parser to the callback
%%      functions transparently. This can be used to keep track of the
%%      location in the file etc.
%% The function returns {NewData, NewState}, where NewData is a list of
%% characters/unicode code points, and NewState the new value for the State.

-export([run/0]).

%% 'chunk' is the number of characters that is read at a time.
%% should be tuned for the best result. (109 is obviously not a good value,
%% it should be bigger than that - try it out).
-define(chunk, 109).

run() ->
   F = fun count_books/2,   %% the callback function that handles the sax
events
   G = fun continue_file/2, %% the callback function that returns the next
                            %% chunk of data
   %% open file
   {ok, Handle} = file:open(xml(), [read, raw, binary]),
   Position = 0,
   CState = {Handle, Position, ?chunk},
   SaxCallbackState = undefined,
   %% erlsom:parse_sax() returns {ok, FinalState, TrailingBytes},
   %% where TrailingBytes is the rest of the input-document
   %% that follows after the last closing tag of the XML, and Result
   %% is the value of the State after processing the last SAX event.
   {ok, Result, _TrailingBytes} =
     erlsom:parse_sax(<<>>, SaxCallbackState, F,
       [{continuation_function, G, CState}]),
   %% close file
   ok = file:close(Handle),

   %% Result is a list [{track_id, count}, ...]
   lists:foreach(fun({Date, Count}) ->
                  io:format("Date: ~p - count: ~p~n", [Date, Count])
                 end, Result),
   ok.

%% this is a continuation function that reads chunks of data
%% from a file.
continue_file(Tail, {Handle, Offset, Chunk}) ->
   %% read the next chunk
   case file:pread(Handle, Offset, Chunk) of
     {ok, Data} ->
       {<<Tail/binary, Data/binary>>, {Handle, Offset + Chunk, Chunk}};
     eof ->
       {Tail, {Handle, Offset, Chunk}}
   end.

count_books(startDocument, _) ->
  etc...


On 6/19/08, Matt Harrison <matt@REDACTED> wrote:
>
> See what I did there with the subject title :)
>
> All
>
> Has anyone used Erlsom for sax parsing straight from a file, i.e without
> using file:read_file to load a whole file. I can't seem to find an
> appropriate call, and the docs don't seem to cover it.
>
> I have large files 6gb+ that I need to SAX parse, the main requirement
> being able to parse them with about 1gb of memory.
>
> Erlsom works a treat with files that are a few 100Mb but requires the
> whole file loaded in memory, which kinda kills the main benefit of sax
> parsing in my opinion.
>
> I am not especially bothered about speed, (it was suggested that I look
> at c parsers linked into erlang) as this is for a data import process
> that will only happen rarely (mainly for development and testing purposes).
>
> I still don't seem to be able to find any xmerl_eventp examples so if
> you have one please let me know.
>
> regards and thanks,
>
> Matt
>
> I haven't discarded the using a c library I'm just fairly new to erlang
> and would prefer an erlang solution if possible as I don't want to
> venture into ports quite yet :)
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080620/66c577fc/attachment.htm>


More information about the erlang-questions mailing list