Erlang question on Artima blog

Sat Mar 13 22:24:29 CET 2004

On Sat, 13 Mar 2004 08:56:25 -0800 (PST), Isaac Gouy <igouy2@REDACTED> 
wrote:

> There's a question about the granularity of Erlang's approach to
> fault-tolerance (and the relationship to Design by Contract) on the
> Artima blog:
>
> http://www.artima.com/forums/flat.jsp?forum=226&thread=37878
>
> Could some experienced individual please provide answers? ;-)

I was going to, but the Artima user registration service seemed
to have problems sending me a password for my account. When I wrote
their webmaster and was asked to reply to some generated email
address, I got an NDN back. Oh, well...

Basically, the answer was going to be something like:

1) If system failure is not an option, you have to go with
    hardware redundancy, so a process might be allowed to
    crash the processor/OS it is running on. This is important,
    as it allows you write "kernel processes", that have to be
    assumed correct for the node to be operational. In Erlang,
    you can build a system using multiple "Erlang nodes", where
    distribution aspects can be either transparent or explicit,
    depending on the role of your program. This is how redundancy
    is normally implemented, and it can be done in several ways,
    depending on requirements:
    a) Hot standby: typically, a process on another computer
       would monitor the active process, and the two would
       employ some replication protocol to stay in synch.
       This implies quite explicit exception handling on the
       part of the standby process. However, the logic required
       can be packaged as a reusable framework, so that the
       process assuming the active role is notified through a
       simple callback function.
    b) Cold standby: The Erlang nodes can be configured so that
       the applications running on one node will be restarted
       on another in case of failure. The applications can detect
       that they are starting due to "failover" from another node,
       or they can start as they normally do.

2) A process crash does not have to lead to a node crash.
    Erlang's "process linking" concept can be used in a variety
    of ways.
    a) The default behaviour is that if a process dies, all
       processes linked to it will also die. This is called
       "cascading exit", and allows you to clean up a fairly
       large amount of work automatically.
    b) A process that wants to take action when another dies
       can trap exits. Example: if process A wants to open
       a file, the file library spawns a process B that opens
       the file and acts as a middle man; B becomes A's file
       handle. If A dies, B, having linked itself to A and
       trapping exits, detects this (it receives an 'EXIT'
       message from A), closes the file, and then exits.
    c) Supervisors are special processes built on the linking
       concept. If a supervised process dies, it is restarted
       with default values by its supervisor. If necessary,
       the supervisor can be configured to restart a group
       of processes, as this may simplify the re-synchronization.
       If the restart frequency exceeds a configured limit, the
       supervisor exits, and lets the next-level supervisor
       handle the situation (escalated restart.)
    d) Re-acquiring a process handle may not be necessary. A
       process can register itself using a logical name, and
       other processes wanting to talk to it, can use the
       logical name as the destination for message sending.
       After a crash, the new process registers under the
       same name, and other processes may never know the
       difference.

3) Erlang doesn't really use Design by Contract, but relies
    rather heavily on pattern matching. For example,
    The function file:open(File, Mode) is defined so that it
    returns {ok, FileDescr} or {error, Reason}. A typical
    call to this function would be formulated:

    {ok, Fd} = file:open("foo.txt", [read]).

    This means that the caller will assert that the returned
    value is a 2-tuple where the first element is the constant
    'ok', and the second is some object that becomes bound to
    the free variable Fd. If the function would return e.g.
    {error, enoent}, the caller would crash. This is called
    "programming for the correct case", and is widely used in
    Erlang. It works wonderfully for both large and small
    systems.
    Pattern matching can also be used on the inputs to a
    function. For example the function hd(List), extracting
    the first element from a linked list, could be written:

    hd([Head|_]) -> Head.

    Meaning that the function will only accept as input a
    list containing at least one element (_ is a "don't care"
    pattern, and in this case represents the tail of the list.)
    Any other input will cause a function_clause exception.
    This could also be written explicitly as:

    hd([Head|_]) -> Head;
    hd(Other) -> exit({function_clause, Other}).

/Uffe
-- 
Ulf Wiger