[erlang-questions] xmerl scan stream

Willem de Jong <>
Tue Dec 9 08:14:47 CET 2008


Hi,

I also think erlsom might be a good fit. I wouldn't do it the way David
describes, but it may be that David found a way to use erlsom that I hadn't
thought of.

I would use the erlsom sax parser with a continuation function. Personally I
am a big fan of the sax model, but it may take some time to get used to it.
There are some examples provided with the code (you can find it on
sourceforge).

In any case you have to make sure that you don't run into trouble if the
stream is quite long (or even endless). It is my impression that xmerl will
build a structure in memory that will continue to grow as long as you are
receiveing data (but I may be wrong - I never managed to understand the
xmerl documentation).

Regards,
Willem

2008/12/9 David Budworth <>

> For what it's worth, i ended up going with erlsom for xml parsing due to
> speed and ease of streaming parses.
>
> I would just loop over my buffer with:
> erlsom:parse_simple(Buffer)
>
> which returns
> {xmlstruct,Remainder}
>
> or if there isn't a complete doc in the buffer, it raises
> {error,"Malformed:..."}, so basically I had
>
> stream_loop(Data) ->
>   case catch elrsom:simple_form(Data) of
>     {ok,Xml,Rest} -> server ! {newXML,Xml}, stream_loop(Rest);
>     {error,"Malformed"++_} -> server ! {needMore,Data}
>   end.
>
> (note: not real code, just an example)
>
> This may be possible with xmerl as well, I know I had both working with
> just simple straight parsing, but I don't recall the "return struct + left
> over bytes" as being a feature of the parser.
>
> in my testing, i think for my sample xml (smallish, 399 bytes), I was
> getting 90uSec / msg with erlsom and 125uSec / msg with xmerl. (running on
> dual quad 3ghz 8gb ram box, not that that matters for a single threaded test
> like this)
>
> both perfectly speedy (and faster than any java parser I've used), but for
> this particular app (xml router) speed is king.
>
> plus, I found erlsom's "simple form" to be a bit easier to deal with at the
> time.  Not sure why I thought that now though.
>
> It does solve you problem in that you get your xml doc(s) as soon as you
> receive the bytes.
>
> hope that helps,
>
> -David
>
>
>
> On Mon, Dec 8, 2008 at 6:10 PM, Peter Sabaini <> wrote:
>
>> Hm, as an afterthought -- this still doesn't solve the original problem,
>> does
>> it?
>>
>> Say I have this on my input stream:
>>
>> % telnet localhost 2345
>> Trying 127.0.0.1...
>> Connected to localhost.local.
>> Escape character is '^]'.
>> <doc>
>> a
>> </doc>
>> <foo />
>>
>>  -----
>>
>> then I only get the <doc>a</doc> structure back as soon as <foo /> is
>> entered,
>> correct?
>>
>> Thanks,
>> peter.
>>
>>
>>
>> On Tuesday 09 December 2008 00:30:42 Peter Sabaini wrote:
>> > On Monday 08 December 2008 23:53:59 Ulf Wiger wrote:
>> > > True, you can't really use it directly, but you can copy
>> > > the code. Basically, the read_chunk/2 function should
>> > > be replaced by something along the lines of:
>> > >
>> > > read_chunk(Sofar) ->
>> > >     receive
>> > >         {tcp, _Socket, Bin} ->
>> > >             {ok, iolist_to_binary([Sofar, Bin])};
>> > >         {tcp, closed, _} ->
>> > >             eof
>> > >     end.
>> >
>> > Ok...
>> >
>> > > (View this as pseudo code.)
>> > >
>> > > You should probably use gen_tcp:recv() instead, or
>> > > at least an {active, once} socket.
>> >
>> > At the moment, this is for "trusted" clients only, so I can code this
>> > rather liberally, without fear that somebody could abuse that -- is that
>> > what you meant?
>> >
>> > > But you need to
>> > > rewrite xmerl_eventp:stream/2 slightly.
>> >
>> > Ok, I'll try that and report any outcome, maybe other people find this
>> > useful too.
>> >
>> > Thanks,
>> > peter.
>> >
>> > > The complication, when you get down to it, is that the
>> > > stream continuation fun must take care not to break
>> > > up the stream in the wrong place. This is because xmerl
>> > > doesn't use a proper tokenizer, but does a one-pass
>> > > parse which relies rather heavily on pattern matching.
>> > >
>> > > This is what the find_good_split() function is for.
>> > >
>> > > BR,
>> > > Ulf W
>> > >
>> > > 2008/12/8 Peter Sabaini <>:
>> > > > On Monday 08 December 2008 23:09:39 Ulf Wiger wrote:
>> > > >> Hi Peter,
>> > > >>
>> > > >> Have you looked at the module xmerl_eventp in xmerl?
>> > > >>
>> > > >> You might even be able to use it directly.
>> > > >
>> > > > Yes, I suspected that this module might do what I need --
>> > > > unfortunately, being the thick-skulled newbie that I am, I haven't
>> been
>> > > > able to figure out how... The docs here
>> > > > http://www.erlang.org/doc/man/xmerl_eventp.html are pretty
>> succinct.
>> > > > Aren't the functions in xmerl_eventp for scanning files? Or could I
>> use
>> > > > those also with a TCP socket?
>> > > >
>> > > > Thanks,
>> > > > peter.
>> > > >
>> > > >> BR,
>> > > >> Ulf W
>> > > >>
>> > > >> 2008/12/8 Peter Sabaini <>:
>> > > >> > Hi list,
>> > > >> >
>> > > >> > I am trying to get xmerl to parse a stream of data coming in via
>> a
>> > > >> > TCP socket. The goal would be for xmerl to return xmlRecords as
>> soon
>> > > >> > as one is complete.
>> > > >> >
>> > > >> > I use the continuation function option of xmerl and so far that
>> > > >> > works ok; unfortunately I only get an xmlRecord as soon as the
>> next
>> > > >> > xml element starts. Is there a way to tell xmerl to "evaluate
>> > > >> > eagerly"?
>> > > >> >
>> > > >> > Below is the test code I used; any help much appreciated. Is this
>> > > >> > even possible? Or am I completely on the wrong track and should
>> use
>> > > >> > a SAX model instead?
>> > > >> >
>> > > >> >  -- snip --
>> > > >> >
>> > > >> > -module(ap).
>> > > >> > -compile(export_all).
>> > > >> >
>> > > >> > start_server() ->
>> > > >> >    {ok, Listen} = gen_tcp:listen(2345, [binary, {packet, raw},
>> > > >> >                                         {reuseaddr, true},
>> > > >> >                                         {active, true}]),
>> > > >> >    spawn(fun() -> par_connect(Listen) end).
>> > > >> >
>> > > >> > par_connect(Listen) ->
>> > > >> >    {ok, _Socket} = gen_tcp:accept(Listen),
>> > > >> >    spawn(fun() -> par_connect(Listen) end),
>> > > >> >    io:format("par_c ~n", []),
>> > > >> >    X = xmerl_scan:string("", [{continuation_fun, fun
>> continue/3}]),
>> > > >> >    io:format("X: ~p ~n", [X]).
>> > > >> >
>> > > >> > continue(Continue, Exception, GlobalState) ->
>> > > >> >    io:format("entered continue/3 ~n", []),
>> > > >> >    receive
>> > > >> >        {tcp, _Socket, Bin} ->
>> > > >> >            Str = binary_to_list(Bin),
>> > > >> >            io:format("got Str ~p ~n", [Str]),
>> > > >> >            Continue(Str, GlobalState);
>> > > >> >        {tcp_closed, _} ->
>> > > >> >            io:format("Server socket closed~n" ),
>> > > >> >            Exception(GlobalState)
>> > > >> >    end.
>> > > >> >
>> > > >> > main() ->
>> > > >> >    start_server().
>> > > >> >
>> > > >> >
>> > > >> >  -- snip --
>> > > >> >
>> > > >> > --
>> > > >> >  Peter Sabaini
>> > > >> >  http://sabaini.at/
>> > > >> >
>> > > >> >
>> > > >> > _______________________________________________
>> > > >> > erlang-questions mailing list
>> > > >> > 
>> > > >> > http://www.erlang.org/mailman/listinfo/erlang-questions
>> > > >
>> > > > --
>> > > >  Peter Sabaini
>> > > >  http://sabaini.at/
>>
>> --
>>  Peter Sabaini
>>  http://sabaini.at/
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> 
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> 
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20081209/574e9b45/attachment.html>


More information about the erlang-questions mailing list