Best practice for Erlang's process design to build a website-downloader (super simple crawler)

I Gusti Ngurah Oka Prinarjaya okaprinarjaya@REDACTED
Tue Nov 12 11:06:56 CET 2019


Hi,

Wowww.. thank you very very much for sharing your experience and strategy
with me. I do really appreciate it.

Ok, l'll start now to write my own website-crawler.


Thank you




Pada tanggal Sen, 11 Nov 2019 pukul 18.06 Сергей Прохоров <
seriy.pr@REDACTED> menulis:

> Hi,
>
> I have quite some years of experience writing web scrapers in Erlang. The
> design that I came to over time is following:
>
> I have a top-level supervisor with following 3 processes:
>
> - workers_supervisor (queue_name)
> - queue (queue_name)
> - scheduler (queue_name)
>
> All of them are registered based on queue name to make it possible to have
> multiple sets of spiders at the same time, but it's not essential if you
> only need one.
>
>    - workers supervisor is just `simple_one_for_one` supervisor that
>    supervises worker processes.
>    - queue - gen_server. How it operates does depend on the type of the
>    spider; if it is recursive spider (downloads all the pages of a website) -
>    this process holds:
>       - a queue of urls to download (regular `queue:queue()`),
>       - `ets` table that holds a set of URLs that were ever added to the
>       queue (to avoid downloading the same link more than once: queue process
>       only adds new URLs to the queue if it is NOT in this ETS),
>       - a dictionary / map of tasks currently in progress (taken by
>       worker but not yet finished) as a map `#{<worker pid monitoring reference>
>       => task()}` - if worker crashes, this task can be re-schedulled.
>       - list of worker pid's subscribed to this queue (maybe monitored).
>       - It may also contain some set of rules to exclude some pages (eg,
>       based on robots.txt).
>       - You should also have URL normalisation function (eg, to threat
>       absolute and relative URLs as the same URL; should decide if
>       `?filter=wasd&page=2` is the same as `?page=2&filter=wasd`, strip URL
>       hashes `#footer` etc). It has 2 APIs: `subscribe`
>
>       queue has quite simple API: `push` / `push_many`, `subscribe` and
>       `ack`. Workers gen-servers call `subscribe` and wait for a task message (it
>       contains URL and unique reference). When task is done - they call
>       `ack(Reference)` and are ready to get next task.
>    - scheduler: it's basically an entry point and the only "brain" of
>    your spider. It takes in the tasks from whatever you want to take them
>    (PUB/SUB queues / CRON / HTTP api, put the "seed" URLs to the queue and
>    spawns workers (usually at start time by calling "workers supervisor" API
>    and I only used to have fixed number of workers to avoid overloading the
>    website or crawler); it can also monitor queue size progress; workers may
>    report to scheduler when task is taken/done; highly depends on your needs
>    actually.
>    - and of course workers: gen_servers, they are supervised by "workers
>    supervisor", their start is initiated by scheduler (or might be just fixed
>    at app start time actually). At start time they call `queue:subscribe` and
>    just wait for messages from the queue. When message is received, it
>    downloads the page, parses it, pushes all found URLs to queue (queue
>    decides which URLs to accept and which to ignore) and saves the results to
>    database; calls `queue:ack` in the end and waits for next task.
>    There is a choice - let the worker crash on errors or have a top-level
>    "try - catch". I prefer to catch to not spam erlang's crash logs, but it
>    depends on your requirements and expected error rates.
>
> This structure proved to be very flexible and allows not only recursive
> crawlers but some other kinds of crawlers, eg non-recursive that do take
> their entire URL set from external source and just downloading what they
> were asked for and saving to DB (in this case scheduler fetches URLs from
> the task source and puts them to the queue; queue doesn't have duplicate
> filter).
> Queue can have namespaces in case you want to parse some website morethan
> once and sometimes in parallel: for each task you use taks_id as a
> namespace, so duplicate filter discards URLs based on {taks_id, URL} pair.
>
> Hope this will help a bit.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20191112/cd93d404/attachment.htm>


More information about the erlang-questions mailing list