[erlang-questions] XML parser that works on binaries

Ulf Wiger (TN/EAB) ulf.wiger@REDACTED
Fri Nov 23 14:41:14 CET 2007


Joel Reymont skrev:
> Has anyone extracted the expat driver code from ejabberd?
> 
> Is there another XML parser that works on binaries?
> 
> Would hacking XMERL to work on binaries be a good idea?
> 
> 	Thanks, Joel

Yes and no. The Xmerl parser is not easily hacked to work
on binaries, and I think it would be a waste of time to
do so. It would require touching most of the code.

The right approach would be to redesign it with a proper
tokenizer. It would then be fairly easy to either support
two different tokenizers, or one that supports both string
and binary input.

Of course, this would mean a complete redesign of xmerl_scan,
but that would have its advantages. I think the API can be kept
and improved upon (e.g. offering a lightweight parser API
that uses the same tokenizer and parser.)

The quick way to do this would be e.g. to buy Joe a beer
and convince him to share his xml parser code. Neither
xmerl nor erlsom use a tokenizer approach, and having looked
at Joe's stuff, I'm convinced that's the way to go.

Meanwhile, xmerl_eventp.erl can be used to see how an xmerl
wrapper can deal with binaries and hook into xmerl_scan.
The trick is to use the xmerl 'fetch' hook, which is called
each time xmerl needs more data. The fetch function needs to
buffer input until it finds some whitespace - see 
xmerl_eventp:find_good_split/6. This has to do with entity
expansion, if I recall correctly, so if you don't have that
in your XML, it shouldn't matter... (don't take my word for
it, though. It's been a few years since I last dabbled with it.)

BR,
Ulf W



More information about the erlang-questions mailing list