[erlang-questions] Rant: I hate parsing XML with Erlang

Joe Armstrong <>
Tue Oct 23 16:58:43 CEST 2007


This indicates that you don't want an XML parser. If the XML (HTNL) is not
well formed then you probably just want a tag parser

My guess is that if you tokenise the input into a sequence of tags and
then pattern
match over the tags you'll get what you want.

The tokenised file looks like this

     [...,
      {sTag,a,[{href,"..."}]}
      {eTag, img,[{src,"..."}]},
      {eTag,img},
      {eTag,a},
      {stag,p,[]},
      {raw,"...."}
      ...
    ]

  Then you write pattens to extract the content
      ...

   this is described here

http://www.trapexit.org/forum/viewtopic.php?p=20670&highlight=&sid=ab39db1f70f1a3a68602f830091ea547

   From what has been posted I get the following picture

   1) There are lots of XML libraries around (I have a 6-pack)
       other people have mentioned libraries that I was unaware of
   2) The code for these cannot be found in one place
   3) The documentation for how to use these is non-existent

    The solution is

    - move all code to one site
    - organise it
    - document it

    This is a lot of work -

/Joe



On 10/23/07, Anders Nygren <> wrote:
> On 10/23/07, Joel Reymont <> wrote:
> >
> > On Oct 23, 2007, at 2:46 PM, Kevin A. Smith wrote:
> >
> > > FWIW, I tried writing a very permissive feedparser but lost
> > > interest partially due to the ugliness of Erlang's XML parsing APIs.
> >
> > Running yaws_html:parse/1 on a sample RSS feed works just fine. I
> > suspect you can't get anymore permissive than that.
>
> I tried to use it a couple of years ago and it was of no help to me since
> it actually requires correct HTML. Which the sites I tried to scrape
> refused to provide, (missing end tags and so on).
>
> Anders
> _______________________________________________
> erlang-questions mailing list
> 
> http://www.erlang.org/mailman/listinfo/erlang-questions
>



More information about the erlang-questions mailing list