Best practice for Erlang's process design to build a website-downloader (super simple crawler)
I Gusti Ngurah Oka Prinarjaya
okaprinarjaya@REDACTED
Tue Nov 12 11:06:56 CET 2019
Hi,
Wowww.. thank you very very much for sharing your experience and strategy
with me. I do really appreciate it.
Ok, l'll start now to write my own website-crawler.
Thank you
Pada tanggal Sen, 11 Nov 2019 pukul 18.06 Сергей Прохоров <
seriy.pr@REDACTED> menulis:
> Hi,
>
> I have quite some years of experience writing web scrapers in Erlang. The
> design that I came to over time is following:
>
> I have a top-level supervisor with following 3 processes:
>
> - workers_supervisor (queue_name)
> - queue (queue_name)
> - scheduler (queue_name)
>
> All of them are registered based on queue name to make it possible to have
> multiple sets of spiders at the same time, but it's not essential if you
> only need one.
>
> - workers supervisor is just `simple_one_for_one` supervisor that
> supervises worker processes.
> - queue - gen_server. How it operates does depend on the type of the
> spider; if it is recursive spider (downloads all the pages of a website) -
> this process holds:
> - a queue of urls to download (regular `queue:queue()`),
> - `ets` table that holds a set of URLs that were ever added to the
> queue (to avoid downloading the same link more than once: queue process
> only adds new URLs to the queue if it is NOT in this ETS),
> - a dictionary / map of tasks currently in progress (taken by
> worker but not yet finished) as a map `#{<worker pid monitoring reference>
> => task()}` - if worker crashes, this task can be re-schedulled.
> - list of worker pid's subscribed to this queue (maybe monitored).
> - It may also contain some set of rules to exclude some pages (eg,
> based on robots.txt).
> - You should also have URL normalisation function (eg, to threat
> absolute and relative URLs as the same URL; should decide if
> `?filter=wasd&page=2` is the same as `?page=2&filter=wasd`, strip URL
> hashes `#footer` etc). It has 2 APIs: `subscribe`
>
> queue has quite simple API: `push` / `push_many`, `subscribe` and
> `ack`. Workers gen-servers call `subscribe` and wait for a task message (it
> contains URL and unique reference). When task is done - they call
> `ack(Reference)` and are ready to get next task.
> - scheduler: it's basically an entry point and the only "brain" of
> your spider. It takes in the tasks from whatever you want to take them
> (PUB/SUB queues / CRON / HTTP api, put the "seed" URLs to the queue and
> spawns workers (usually at start time by calling "workers supervisor" API
> and I only used to have fixed number of workers to avoid overloading the
> website or crawler); it can also monitor queue size progress; workers may
> report to scheduler when task is taken/done; highly depends on your needs
> actually.
> - and of course workers: gen_servers, they are supervised by "workers
> supervisor", their start is initiated by scheduler (or might be just fixed
> at app start time actually). At start time they call `queue:subscribe` and
> just wait for messages from the queue. When message is received, it
> downloads the page, parses it, pushes all found URLs to queue (queue
> decides which URLs to accept and which to ignore) and saves the results to
> database; calls `queue:ack` in the end and waits for next task.
> There is a choice - let the worker crash on errors or have a top-level
> "try - catch". I prefer to catch to not spam erlang's crash logs, but it
> depends on your requirements and expected error rates.
>
> This structure proved to be very flexible and allows not only recursive
> crawlers but some other kinds of crawlers, eg non-recursive that do take
> their entire URL set from external source and just downloading what they
> were asked for and saving to DB (in this case scheduler fetches URLs from
> the task source and puts them to the queue; queue doesn't have duplicate
> filter).
> Queue can have namespaces in case you want to parse some website morethan
> once and sometimes in parallel: for each task you use taks_id as a
> namespace, so duplicate filter discards URLs based on {taks_id, URL} pair.
>
> Hope this will help a bit.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20191112/cd93d404/attachment.htm>
More information about the erlang-questions
mailing list