Screen scraping
ke han
ke.han@REDACTED
Wed Aug 30 05:24:36 CEST 2006
Joel,
How about jungerl's www_tools ??
Here is a snippet of its example code to show you how easy it is to
tokenize an HTML stream or file and harvest element of interest:
%%********************************
file(File) ->
Toks = html_tokenise:file2toks(File),
analyse(Toks).
analyse(Toks) ->
Hrefs = [H || {tagStart, "a", L} <- Toks, {"href", H} <- L],
Images1 = [S || {tagStart, "img", L} <- Toks, {"src", S} <- L],
Images2 = [S || {tagStart, "body", L} <- Toks, {"background", S}
<- L],
{remove_duplicates(Hrefs), remove_duplicates(Images1++Images2)}.
%%********************************
ke han
On Aug 30, 2006, at 5:46 AM, Joel Reymont wrote:
> Does anyone have tools for screen scraping with Erlang?
>
> It's a combination of HTTP client with parsing and regexp-ing
> through HTML. Ruby has nice tools for this like hpricot and scrAPI
> and they parse HTML into a structure and let you query for elements
> based on their class, id, name, etc.
>
> Thanks, Joel
>
> --
> http://wagerlabs.com/
>
>
>
>
>
More information about the erlang-questions
mailing list