[erlang-questions] xmerl scan stream

Peter Sabaini <>
Tue Dec 9 19:02:06 CET 2008


Hi list,

thanks for all the suggestions -- I've got something that works for me now. In 
case anyone finds it useful (the docs are often a bit sparse), I've written 
down what I did here: 

http://sabaini.at/blog/erlang/learning-erlang-stream-xml.html

Thanks again!

peter.


On Tuesday 09 December 2008 03:35:43 David Budworth wrote:
> For what it's worth, i ended up going with erlsom for xml parsing due to
> speed and ease of streaming parses.
>
> I would just loop over my buffer with:
> erlsom:parse_simple(Buffer)
>
> which returns
> {xmlstruct,Remainder}
>
> or if there isn't a complete doc in the buffer, it raises
> {error,"Malformed:..."}, so basically I had
>
> stream_loop(Data) ->
>   case catch elrsom:simple_form(Data) of
>     {ok,Xml,Rest} -> server ! {newXML,Xml}, stream_loop(Rest);
>     {error,"Malformed"++_} -> server ! {needMore,Data}
>   end.
>
> (note: not real code, just an example)
>
> This may be possible with xmerl as well, I know I had both working with
> just simple straight parsing, but I don't recall the "return struct + left
> over bytes" as being a feature of the parser.
>
> in my testing, i think for my sample xml (smallish, 399 bytes), I was
> getting 90uSec / msg with erlsom and 125uSec / msg with xmerl. (running on
> dual quad 3ghz 8gb ram box, not that that matters for a single threaded
> test like this)
>
> both perfectly speedy (and faster than any java parser I've used), but for
> this particular app (xml router) speed is king.
>
> plus, I found erlsom's "simple form" to be a bit easier to deal with at the
> time.  Not sure why I thought that now though.
>
> It does solve you problem in that you get your xml doc(s) as soon as you
> receive the bytes.
>
> hope that helps,
>
> -David
>
> On Mon, Dec 8, 2008 at 6:10 PM, Peter Sabaini <> wrote:
> > Hm, as an afterthought -- this still doesn't solve the original problem,
> > does
> > it?
> >
> > Say I have this on my input stream:
> >
> > % telnet localhost 2345
> > Trying 127.0.0.1...
> > Connected to localhost.local.
> > Escape character is '^]'.
> > <doc>
> > a
> > </doc>
> > <foo />
> >
> >  -----
> >
> > then I only get the <doc>a</doc> structure back as soon as <foo /> is
> > entered,
> > correct?
> >
> > Thanks,
> > peter.
> >
> > On Tuesday 09 December 2008 00:30:42 Peter Sabaini wrote:
> > > On Monday 08 December 2008 23:53:59 Ulf Wiger wrote:
> > > > True, you can't really use it directly, but you can copy
> > > > the code. Basically, the read_chunk/2 function should
> > > > be replaced by something along the lines of:
> > > >
> > > > read_chunk(Sofar) ->
> > > >     receive
> > > >         {tcp, _Socket, Bin} ->
> > > >             {ok, iolist_to_binary([Sofar, Bin])};
> > > >         {tcp, closed, _} ->
> > > >             eof
> > > >     end.
> > >
> > > Ok...
> > >
> > > > (View this as pseudo code.)
> > > >
> > > > You should probably use gen_tcp:recv() instead, or
> > > > at least an {active, once} socket.
> > >
> > > At the moment, this is for "trusted" clients only, so I can code this
> > > rather liberally, without fear that somebody could abuse that -- is
> > > that what you meant?
> > >
> > > > But you need to
> > > > rewrite xmerl_eventp:stream/2 slightly.
> > >
> > > Ok, I'll try that and report any outcome, maybe other people find this
> > > useful too.
> > >
> > > Thanks,
> > > peter.
> > >
> > > > The complication, when you get down to it, is that the
> > > > stream continuation fun must take care not to break
> > > > up the stream in the wrong place. This is because xmerl
> > > > doesn't use a proper tokenizer, but does a one-pass
> > > > parse which relies rather heavily on pattern matching.
> > > >
> > > > This is what the find_good_split() function is for.
> > > >
> > > > BR,
> > > > Ulf W
> > > >
> > > > 2008/12/8 Peter Sabaini <>:
> > > > > On Monday 08 December 2008 23:09:39 Ulf Wiger wrote:
> > > > >> Hi Peter,
> > > > >>
> > > > >> Have you looked at the module xmerl_eventp in xmerl?
> > > > >>
> > > > >> You might even be able to use it directly.
> > > > >
> > > > > Yes, I suspected that this module might do what I need --
> > > > > unfortunately, being the thick-skulled newbie that I am, I haven't
> >
> > been
> >
> > > > > able to figure out how... The docs here
> > > > > http://www.erlang.org/doc/man/xmerl_eventp.html are pretty
> > > > > succinct. Aren't the functions in xmerl_eventp for scanning files?
> > > > > Or could I
> >
> > use
> >
> > > > > those also with a TCP socket?
> > > > >
> > > > > Thanks,
> > > > > peter.
> > > > >
> > > > >> BR,
> > > > >> Ulf W
> > > > >>
> > > > >> 2008/12/8 Peter Sabaini <>:
> > > > >> > Hi list,
> > > > >> >
> > > > >> > I am trying to get xmerl to parse a stream of data coming in via
> > > > >> > a TCP socket. The goal would be for xmerl to return xmlRecords
> > > > >> > as
> >
> > soon
> >
> > > > >> > as one is complete.
> > > > >> >
> > > > >> > I use the continuation function option of xmerl and so far that
> > > > >> > works ok; unfortunately I only get an xmlRecord as soon as the
> >
> > next
> >
> > > > >> > xml element starts. Is there a way to tell xmerl to "evaluate
> > > > >> > eagerly"?
> > > > >> >
> > > > >> > Below is the test code I used; any help much appreciated. Is
> > > > >> > this even possible? Or am I completely on the wrong track and
> > > > >> > should
> >
> > use
> >
> > > > >> > a SAX model instead?
> > > > >> >
> > > > >> >  -- snip --
> > > > >> >
> > > > >> > -module(ap).
> > > > >> > -compile(export_all).
> > > > >> >
> > > > >> > start_server() ->
> > > > >> >    {ok, Listen} = gen_tcp:listen(2345, [binary, {packet, raw},
> > > > >> >                                         {reuseaddr, true},
> > > > >> >                                         {active, true}]),
> > > > >> >    spawn(fun() -> par_connect(Listen) end).
> > > > >> >
> > > > >> > par_connect(Listen) ->
> > > > >> >    {ok, _Socket} = gen_tcp:accept(Listen),
> > > > >> >    spawn(fun() -> par_connect(Listen) end),
> > > > >> >    io:format("par_c ~n", []),
> > > > >> >    X = xmerl_scan:string("", [{continuation_fun, fun
> >
> > continue/3}]),
> >
> > > > >> >    io:format("X: ~p ~n", [X]).
> > > > >> >
> > > > >> > continue(Continue, Exception, GlobalState) ->
> > > > >> >    io:format("entered continue/3 ~n", []),
> > > > >> >    receive
> > > > >> >        {tcp, _Socket, Bin} ->
> > > > >> >            Str = binary_to_list(Bin),
> > > > >> >            io:format("got Str ~p ~n", [Str]),
> > > > >> >            Continue(Str, GlobalState);
> > > > >> >        {tcp_closed, _} ->
> > > > >> >            io:format("Server socket closed~n" ),
> > > > >> >            Exception(GlobalState)
> > > > >> >    end.
> > > > >> >
> > > > >> > main() ->
> > > > >> >    start_server().
> > > > >> >
> > > > >> >
> > > > >> >  -- snip --
> > > > >> >
> > > > >> > --
> > > > >> >  Peter Sabaini
> > > > >> >  http://sabaini.at/
> > > > >> >
> > > > >> >
> > > > >> > _______________________________________________
> > > > >> > erlang-questions mailing list
> > > > >> > 
> > > > >> > http://www.erlang.org/mailman/listinfo/erlang-questions
> > > > >
> > > > > --
> > > > >  Peter Sabaini
> > > > >  http://sabaini.at/
> >
> > --
> >  Peter Sabaini
> >  http://sabaini.at/
> >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > 
> > http://www.erlang.org/mailman/listinfo/erlang-questions

-- 
  Peter Sabaini
  http://sabaini.at/
  




More information about the erlang-questions mailing list