[erlang-questions] Rant: I hate parsing XML with Erlang

Bob Ippolito bob@REDACTED
Tue Oct 23 17:30:25 CEST 2007


http://tidy.sourceforge.net/ is the typical library I've seen used to
transform arbitrary HTML into a valid document quickly and without
re-inventing the wheel. Much easier than trying to integrate with
Mozilla.

-bob

On 10/23/07, Joe Armstrong <erlang@REDACTED> wrote:
> I've seen some work on parsing badly formed HTML.
>
> If I remember rightly you keep a stack of the currently open tags
> then stacks for things like <font> <b> <i> tags etc. So you end up with
> several small stacks. Each new open or close tag pushes or pops
> things onto these stacks.
>
> When you hit raw data you pattern match over the stacks to figure out
> what to do.
>
> As an aside it occurred to me that mozilla is probably pretty good at
> sceen scraping (or whatever it's called) - so it should be possible to write
> a Firefox extension to do this that talks through a socket to Erlang.
> <somebody told me this was easy, but they obviously knew more than I do>
>
> You could then use Erlang as a coordination language controlling
> a load of firefoxes on different machines, telling them to go get pages and
> scrape the pages for data which they send back to Erlang.
>
> if we could use firefox as a component then we could avoid reinventing
> the wheel (again)
>
> /Joe
>
>
> On 10/23/07, Kevin A. Smith <kevin@REDACTED> wrote:
> > Possibly. My understanding was that it still required well-formed
> > documents to function. A lot of feeds feature varying amounts of
> > "well-formedness", sadly.
> >
> > --Kevin
> > On Oct 23, 2007, at 10:01 AM, Joel Reymont wrote:
> >
> > >
> > > On Oct 23, 2007, at 2:46 PM, Kevin A. Smith wrote:
> > >
> > >> FWIW, I tried writing a very permissive feedparser but lost
> > >> interest partially due to the ugliness of Erlang's XML parsing APIs.
> > >
> > > Running yaws_html:parse/1 on a sample RSS feed works just fine. I
> > > suspect you can't get anymore permissive than that.
> > >
> > > --
> > > http://wagerlabs.com
> > >
> > >
> > >
> > >
> > >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>



More information about the erlang-questions mailing list