[erlang-questions] xmerl scan stream

David Budworth dbudworth@REDACTED
Tue Dec 9 03:35:43 CET 2008


For what it's worth, i ended up going with erlsom for xml parsing due to
speed and ease of streaming parses.

I would just loop over my buffer with:
erlsom:parse_simple(Buffer)

which returns
{xmlstruct,Remainder}

or if there isn't a complete doc in the buffer, it raises
{error,"Malformed:..."}, so basically I had

stream_loop(Data) ->
  case catch elrsom:simple_form(Data) of
    {ok,Xml,Rest} -> server ! {newXML,Xml}, stream_loop(Rest);
    {error,"Malformed"++_} -> server ! {needMore,Data}
  end.

(note: not real code, just an example)

This may be possible with xmerl as well, I know I had both working with just
simple straight parsing, but I don't recall the "return struct + left over
bytes" as being a feature of the parser.

in my testing, i think for my sample xml (smallish, 399 bytes), I was
getting 90uSec / msg with erlsom and 125uSec / msg with xmerl. (running on
dual quad 3ghz 8gb ram box, not that that matters for a single threaded test
like this)

both perfectly speedy (and faster than any java parser I've used), but for
this particular app (xml router) speed is king.

plus, I found erlsom's "simple form" to be a bit easier to deal with at the
time.  Not sure why I thought that now though.

It does solve you problem in that you get your xml doc(s) as soon as you
receive the bytes.

hope that helps,

-David


On Mon, Dec 8, 2008 at 6:10 PM, Peter Sabaini <peter@REDACTED> wrote:

> Hm, as an afterthought -- this still doesn't solve the original problem,
> does
> it?
>
> Say I have this on my input stream:
>
> % telnet localhost 2345
> Trying 127.0.0.1...
> Connected to localhost.local.
> Escape character is '^]'.
> <doc>
> a
> </doc>
> <foo />
>
>  -----
>
> then I only get the <doc>a</doc> structure back as soon as <foo /> is
> entered,
> correct?
>
> Thanks,
> peter.
>
>
>
> On Tuesday 09 December 2008 00:30:42 Peter Sabaini wrote:
> > On Monday 08 December 2008 23:53:59 Ulf Wiger wrote:
> > > True, you can't really use it directly, but you can copy
> > > the code. Basically, the read_chunk/2 function should
> > > be replaced by something along the lines of:
> > >
> > > read_chunk(Sofar) ->
> > >     receive
> > >         {tcp, _Socket, Bin} ->
> > >             {ok, iolist_to_binary([Sofar, Bin])};
> > >         {tcp, closed, _} ->
> > >             eof
> > >     end.
> >
> > Ok...
> >
> > > (View this as pseudo code.)
> > >
> > > You should probably use gen_tcp:recv() instead, or
> > > at least an {active, once} socket.
> >
> > At the moment, this is for "trusted" clients only, so I can code this
> > rather liberally, without fear that somebody could abuse that -- is that
> > what you meant?
> >
> > > But you need to
> > > rewrite xmerl_eventp:stream/2 slightly.
> >
> > Ok, I'll try that and report any outcome, maybe other people find this
> > useful too.
> >
> > Thanks,
> > peter.
> >
> > > The complication, when you get down to it, is that the
> > > stream continuation fun must take care not to break
> > > up the stream in the wrong place. This is because xmerl
> > > doesn't use a proper tokenizer, but does a one-pass
> > > parse which relies rather heavily on pattern matching.
> > >
> > > This is what the find_good_split() function is for.
> > >
> > > BR,
> > > Ulf W
> > >
> > > 2008/12/8 Peter Sabaini <peter@REDACTED>:
> > > > On Monday 08 December 2008 23:09:39 Ulf Wiger wrote:
> > > >> Hi Peter,
> > > >>
> > > >> Have you looked at the module xmerl_eventp in xmerl?
> > > >>
> > > >> You might even be able to use it directly.
> > > >
> > > > Yes, I suspected that this module might do what I need --
> > > > unfortunately, being the thick-skulled newbie that I am, I haven't
> been
> > > > able to figure out how... The docs here
> > > > http://www.erlang.org/doc/man/xmerl_eventp.html are pretty succinct.
> > > > Aren't the functions in xmerl_eventp for scanning files? Or could I
> use
> > > > those also with a TCP socket?
> > > >
> > > > Thanks,
> > > > peter.
> > > >
> > > >> BR,
> > > >> Ulf W
> > > >>
> > > >> 2008/12/8 Peter Sabaini <peter@REDACTED>:
> > > >> > Hi list,
> > > >> >
> > > >> > I am trying to get xmerl to parse a stream of data coming in via a
> > > >> > TCP socket. The goal would be for xmerl to return xmlRecords as
> soon
> > > >> > as one is complete.
> > > >> >
> > > >> > I use the continuation function option of xmerl and so far that
> > > >> > works ok; unfortunately I only get an xmlRecord as soon as the
> next
> > > >> > xml element starts. Is there a way to tell xmerl to "evaluate
> > > >> > eagerly"?
> > > >> >
> > > >> > Below is the test code I used; any help much appreciated. Is this
> > > >> > even possible? Or am I completely on the wrong track and should
> use
> > > >> > a SAX model instead?
> > > >> >
> > > >> >  -- snip --
> > > >> >
> > > >> > -module(ap).
> > > >> > -compile(export_all).
> > > >> >
> > > >> > start_server() ->
> > > >> >    {ok, Listen} = gen_tcp:listen(2345, [binary, {packet, raw},
> > > >> >                                         {reuseaddr, true},
> > > >> >                                         {active, true}]),
> > > >> >    spawn(fun() -> par_connect(Listen) end).
> > > >> >
> > > >> > par_connect(Listen) ->
> > > >> >    {ok, _Socket} = gen_tcp:accept(Listen),
> > > >> >    spawn(fun() -> par_connect(Listen) end),
> > > >> >    io:format("par_c ~n", []),
> > > >> >    X = xmerl_scan:string("", [{continuation_fun, fun
> continue/3}]),
> > > >> >    io:format("X: ~p ~n", [X]).
> > > >> >
> > > >> > continue(Continue, Exception, GlobalState) ->
> > > >> >    io:format("entered continue/3 ~n", []),
> > > >> >    receive
> > > >> >        {tcp, _Socket, Bin} ->
> > > >> >            Str = binary_to_list(Bin),
> > > >> >            io:format("got Str ~p ~n", [Str]),
> > > >> >            Continue(Str, GlobalState);
> > > >> >        {tcp_closed, _} ->
> > > >> >            io:format("Server socket closed~n" ),
> > > >> >            Exception(GlobalState)
> > > >> >    end.
> > > >> >
> > > >> > main() ->
> > > >> >    start_server().
> > > >> >
> > > >> >
> > > >> >  -- snip --
> > > >> >
> > > >> > --
> > > >> >  Peter Sabaini
> > > >> >  http://sabaini.at/
> > > >> >
> > > >> >
> > > >> > _______________________________________________
> > > >> > erlang-questions mailing list
> > > >> > erlang-questions@REDACTED
> > > >> > http://www.erlang.org/mailman/listinfo/erlang-questions
> > > >
> > > > --
> > > >  Peter Sabaini
> > > >  http://sabaini.at/
>
> --
>  Peter Sabaini
>  http://sabaini.at/
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20081208/bd98ca76/attachment.htm>


More information about the erlang-questions mailing list