Best practice for Erlang's process design to build a website-downloader (super simple crawler)

Mon Nov 11 12:05:50 CET 2019

Hi,

I have quite some years of experience writing web scrapers in Erlang. The
design that I came to over time is following:

I have a top-level supervisor with following 3 processes:

- workers_supervisor (queue_name)
- queue (queue_name)
- scheduler (queue_name)

All of them are registered based on queue name to make it possible to have
multiple sets of spiders at the same time, but it's not essential if you
only need one.

   - workers supervisor is just `simple_one_for_one` supervisor that
   supervises worker processes.
   - queue - gen_server. How it operates does depend on the type of the
   spider; if it is recursive spider (downloads all the pages of a website) -
   this process holds:
      - a queue of urls to download (regular `queue:queue()`),
      - `ets` table that holds a set of URLs that were ever added to the
      queue (to avoid downloading the same link more than once: queue process
      only adds new URLs to the queue if it is NOT in this ETS),
      - a dictionary / map of tasks currently in progress (taken by worker
      but not yet finished) as a map `#{<worker pid monitoring reference> =>
      task()}` - if worker crashes, this task can be re-schedulled.
      - list of worker pid's subscribed to this queue (maybe monitored).
      - It may also contain some set of rules to exclude some pages (eg,
      based on robots.txt).
      - You should also have URL normalisation function (eg, to threat
      absolute and relative URLs as the same URL; should decide if
      `?filter=wasd&page=2` is the same as `?page=2&filter=wasd`, strip URL
      hashes `#footer` etc). It has 2 APIs: `subscribe`

      queue has quite simple API: `push` / `push_many`, `subscribe` and
      `ack`. Workers gen-servers call `subscribe` and wait for a task
message (it
      contains URL and unique reference). When task is done - they call
      `ack(Reference)` and are ready to get next task.
   - scheduler: it's basically an entry point and the only "brain" of your
   spider. It takes in the tasks from whatever you want to take them (PUB/SUB
   queues / CRON / HTTP api, put the "seed" URLs to the queue and spawns
   workers (usually at start time by calling "workers supervisor" API and I
   only used to have fixed number of workers to avoid overloading the website
   or crawler); it can also monitor queue size progress; workers may report to
   scheduler when task is taken/done; highly depends on your needs actually.
   - and of course workers: gen_servers, they are supervised by "workers
   supervisor", their start is initiated by scheduler (or might be just fixed
   at app start time actually). At start time they call `queue:subscribe` and
   just wait for messages from the queue. When message is received, it
   downloads the page, parses it, pushes all found URLs to queue (queue
   decides which URLs to accept and which to ignore) and saves the results to
   database; calls `queue:ack` in the end and waits for next task.
   There is a choice - let the worker crash on errors or have a top-level
   "try - catch". I prefer to catch to not spam erlang's crash logs, but it
   depends on your requirements and expected error rates.

This structure proved to be very flexible and allows not only recursive
crawlers but some other kinds of crawlers, eg non-recursive that do take
their entire URL set from external source and just downloading what they
were asked for and saving to DB (in this case scheduler fetches URLs from
the task source and puts them to the queue; queue doesn't have duplicate
filter).
Queue can have namespaces in case you want to parse some website morethan
once and sometimes in parallel: for each task you use taks_id as a
namespace, so duplicate filter discards URLs based on {taks_id, URL} pair.

Hope this will help a bit.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20191111/057e8db8/attachment.htm>