[erlang-questions] Rant: I hate parsing XML with Erlang
Joe Armstrong
erlang@REDACTED
Tue Oct 23 16:58:43 CEST 2007
This indicates that you don't want an XML parser. If the XML (HTNL) is not
well formed then you probably just want a tag parser
My guess is that if you tokenise the input into a sequence of tags and
then pattern
match over the tags you'll get what you want.
The tokenised file looks like this
[...,
{sTag,a,[{href,"..."}]}
{eTag, img,[{src,"..."}]},
{eTag,img},
{eTag,a},
{stag,p,[]},
{raw,"...."}
...
]
Then you write pattens to extract the content
...
this is described here
http://www.trapexit.org/forum/viewtopic.php?p=20670&highlight=&sid=ab39db1f70f1a3a68602f830091ea547
From what has been posted I get the following picture
1) There are lots of XML libraries around (I have a 6-pack)
other people have mentioned libraries that I was unaware of
2) The code for these cannot be found in one place
3) The documentation for how to use these is non-existent
The solution is
- move all code to one site
- organise it
- document it
This is a lot of work -
/Joe
On 10/23/07, Anders Nygren <anders.nygren@REDACTED> wrote:
> On 10/23/07, Joel Reymont <joelr1@REDACTED> wrote:
> >
> > On Oct 23, 2007, at 2:46 PM, Kevin A. Smith wrote:
> >
> > > FWIW, I tried writing a very permissive feedparser but lost
> > > interest partially due to the ugliness of Erlang's XML parsing APIs.
> >
> > Running yaws_html:parse/1 on a sample RSS feed works just fine. I
> > suspect you can't get anymore permissive than that.
>
> I tried to use it a couple of years ago and it was of no help to me since
> it actually requires correct HTML. Which the sites I tried to scrape
> refused to provide, (missing end tags and so on).
>
> Anders
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
More information about the erlang-questions
mailing list