[erlang-questions] Rant: I hate parsing XML with Erlang

Joe Armstrong erlang@REDACTED
Tue Oct 23 17:09:11 CEST 2007


I've seen some work on parsing badly formed HTML.

If I remember rightly you keep a stack of the currently open tags
then stacks for things like <font> <b> <i> tags etc. So you end up with
several small stacks. Each new open or close tag pushes or pops
things onto these stacks.

When you hit raw data you pattern match over the stacks to figure out
what to do.

As an aside it occurred to me that mozilla is probably pretty good at
sceen scraping (or whatever it's called) - so it should be possible to write
a Firefox extension to do this that talks through a socket to Erlang.
<somebody told me this was easy, but they obviously knew more than I do>

You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.

if we could use firefox as a component then we could avoid reinventing
the wheel (again)

/Joe


On 10/23/07, Kevin A. Smith <kevin@REDACTED> wrote:
> Possibly. My understanding was that it still required well-formed
> documents to function. A lot of feeds feature varying amounts of
> "well-formedness", sadly.
>
> --Kevin
> On Oct 23, 2007, at 10:01 AM, Joel Reymont wrote:
>
> >
> > On Oct 23, 2007, at 2:46 PM, Kevin A. Smith wrote:
> >
> >> FWIW, I tried writing a very permissive feedparser but lost
> >> interest partially due to the ugliness of Erlang's XML parsing APIs.
> >
> > Running yaws_html:parse/1 on a sample RSS feed works just fine. I
> > suspect you can't get anymore permissive than that.
> >
> > --
> > http://wagerlabs.com
> >
> >
> >
> >
> >
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>



More information about the erlang-questions mailing list