[erlang-questions] XML parser that works on binaries

Willem de Jong w.a.de.jong@REDACTED
Fri Nov 23 17:44:19 CET 2007

I have a version of erlsom that works on binaries. I made it because I
thought it would be faster, but it turned out that this was not the case. I
guess that this will change with release 12, and I am planning to revive it.

The current version works only on UTF-8 encoded binaries. I remember that I
hadn't quite figured out how I could make a version for UTF-16 without
copying the entire SAX parser (and slightly modifiying it in many places,
obviously). Actually, one for UTF-16 be and one for le. Not that that would
be such a big issue, but it is just not very nice.

Why do you want a parser that works on binaries?


On 11/23/07, Ulf Wiger (TN/EAB) <ulf.wiger@REDACTED> wrote:
> Joel Reymont skrev:
> > Has anyone extracted the expat driver code from ejabberd?
> >
> > Is there another XML parser that works on binaries?
> >
> > Would hacking XMERL to work on binaries be a good idea?
> >
> >       Thanks, Joel
> Yes and no. The Xmerl parser is not easily hacked to work
> on binaries, and I think it would be a waste of time to
> do so. It would require touching most of the code.
> The right approach would be to redesign it with a proper
> tokenizer. It would then be fairly easy to either support
> two different tokenizers, or one that supports both string
> and binary input.
> Of course, this would mean a complete redesign of xmerl_scan,
> but that would have its advantages. I think the API can be kept
> and improved upon (e.g. offering a lightweight parser API
> that uses the same tokenizer and parser.)
> The quick way to do this would be e.g. to buy Joe a beer
> and convince him to share his xml parser code. Neither
> xmerl nor erlsom use a tokenizer approach, and having looked
> at Joe's stuff, I'm convinced that's the way to go.
> Meanwhile, xmerl_eventp.erl can be used to see how an xmerl
> wrapper can deal with binaries and hook into xmerl_scan.
> The trick is to use the xmerl 'fetch' hook, which is called
> each time xmerl needs more data. The fetch function needs to
> buffer input until it finds some whitespace - see
> xmerl_eventp:find_good_split/6. This has to do with entity
> expansion, if I recall correctly, so if you don't have that
> in your XML, it shouldn't matter... (don't take my word for
> it, though. It's been a few years since I last dabbled with it.)
> BR,
> Ulf W
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071123/93b8d0b3/attachment.htm>

More information about the erlang-questions mailing list