Screen scraping

ke han <>
Wed Aug 30 05:24:36 CEST 2006


Joel,
How about jungerl's www_tools ??

Here is a snippet of its example code to show you how easy it is to  
tokenize an HTML stream or file and harvest element of interest:

%%********************************
file(File) ->
     Toks = html_tokenise:file2toks(File),
     analyse(Toks).

analyse(Toks) ->
     Hrefs = [H || {tagStart, "a", L} <- Toks, {"href", H} <- L],
     Images1 = [S || {tagStart, "img", L} <- Toks, {"src", S} <- L],
     Images2 = [S || {tagStart, "body", L} <- Toks, {"background", S}  
<- L],
     {remove_duplicates(Hrefs), remove_duplicates(Images1++Images2)}.
%%********************************

ke han



On Aug 30, 2006, at 5:46 AM, Joel Reymont wrote:

> Does anyone have tools for screen scraping with Erlang?
>
> It's a combination of HTTP client with parsing and regexp-ing  
> through HTML. Ruby has nice tools for this like hpricot and scrAPI  
> and they parse HTML into a structure and let you query for elements  
> based on their class, id, name, etc.
>
> 	Thanks, Joel
>
> --
> http://wagerlabs.com/
>
>
>
>
>




More information about the erlang-questions mailing list