[erlang-questions] html parsing in erlang?

Wed Jan 20 20:23:18 CET 2010

On Wed, Jan 20, 2010 at 7:41 AM, Carlo Cabanilla
<carlo.cabanilla@REDACTED> wrote:
> Hey,
>
> Can someone recommend a good html parser in Erlang? Something like Python's
> BeautifulSoup that won't choke on bad markup. Saw this thread on Trap Exit:
> http://www.trapexit.org/forum/viewtopic.php?p=38529 but didn't sound
> promising.

I'm a Python guy and love the web tools in that ecosystem. lxml is
another library that rocks!

I haven't found anything like this in Erlang. I did play around the
the mochiweb parse html routine, but didn't use it (can't remember
why).

However, I'm do a fair amount of web scraping and have reverted to
regular expressions in Erlang (re module is fine for that).

If your application can process the web content in batch, or using a
disk based queue/spool, you could use this:

- Grab + parse web content in Python
- Dump your output (presumably trees, maps, etc.) to an Erlang term
(see the erl_term module in py-interface
http://www.lysator.liu.se/~tab/erlang/py_interface/ - or BERT
http://bert-rpc.org/)
- Read the terms on disk from Erlang

To avoid the intermediary phase of writing to disk, you could setup
your Python app as a port, which I've found to work very well.

Of course, if you can get by with regular expressions, that's presents
the fewest moving parts.

Garrett