catching errors from linked processes (Long)

Wed Apr 23 09:36:14 CEST 2003

These sorts of questions are what make erlang so interesting
to me.  They appear simple and broad brush at first, but if
you look closely they are actually very subtle and involve
tradeoffs that you would never recognize in other languages.
In C, you must code defensively because once you core dump
there are no options and with runaway code there is no telling
what will happen next.  In erlang you have so many choices it
is difficult to decide what to do.

There are actually three issues at play in this question: how
should code be structured; how should errors be reported;
and how should processes be layered.  Each of these issues
depends on the situation at hand.  Taken together they raise
another issue: how do you plan to reuse or share the code
and processes?

Code should be as simple as possible and no simpler.

This is the standard erlang admonishment.  Sounds well enough --
challenges your skill but makes sense.  When it comes to real
code though, it is actually deeper than that.  The structure of
code depends on the stage of development.  There are at least
4 stages: get it working; make it clear; worry about errors; and
worry about performance.  They may not always be performed
in this order, but in most cases it is better if they are.

Get it working is the great globs of wet clay stuff, where you hack
for a few hours until you see reasonable results occasionally.  You
only feed it valid data with expected results and shape it until it looks
like it is working.

Make it clear is when you understand the problem after having solved
it once.  You recognize the repetition and generalization and
restructure the code to capture the problem space rather than to
issue a handful of correct answers.  Much code ends after this step.

Worry about errors is when the code structure is fairly stable, but
you now consider what wacky input might arrive and what the
consequences are.  Even if "let it fail" is your motto, you need to
reason about what that means at this stage.  The clarity model might
get dramatically shifted to handle errors because operationally it
turns out error handling is more important than code maintenance.

Worry about performance should only happen when enough of the
other stages are correct so that slow performance is the most
glaring problem.  Again the code may be restructured against
clarity to improve performance.

Simple as possible but no simpler means different things for each
of the stages of "code development" (as in growth and maturity
just like "child development").

Errors

Don't worry about errors, just let it fail.  That is the erlang way.
Like all things erlang it sounds simple and easy, but is actually a
very subtle thing.

I can think of three good reasons to let it fail: highlight errors in the
code so they can be corrected; avoid something really bad from
happening and reset to a known state; or notify external systems
(or users) about bad data they are supplying so it can be corrected.

If you are still actively developing the code, failure is the quickest
way to find the exact point where things went wrong.  It facilitates
finding coding errors even if other users report the error to you.
Here the point of failure is most important, followed by information
about the failure.

Runaway code can be dangerous in some situations.  Banks
don't want money given away (the US Govt once sent out thousands
of checks with refunds for people who shouldn't have gotten them,
but it was cheaper to let them keep it then to try to get it back),
flying machines can't fail catastrophically, etc.  Once the system
detects that it is in trouble, failing can stop things from getting out
of control as well as reset the state of the process to the initial known
restart state.  Stopping and restarting is more important here than
why the failure occurred.  (Could this be called "defensive coding"?)

When user input or input from other systems causes unexpected
processing, failure can notify them that something should be altered,
although it is essential that you determine what effect failure has
on the external system (whether human or otherwise).  In this case,
the information conveyed is probably more important than the failure
or where it happened.  The external system has to change its behavior
based on the error result.

In each of these three cases the type and amount of information
communicated as part of the failure serves different goals and would
vary accordingly.

Process layering

Supervisor processes can be used to monitor other processes.
Typically they would do two things: relay error states and restart
the failed process.

If collecting and logging the failure is useful, an external process can
catch the failures, sort the errors and relay them to other processes,
produce statistics or log them to files or databases.  [During debugging
I could imagine a pop up window like the toolbar process window that
showed a tally of process failures counted by type as detected in the
error stack trace state.]

If the process is a service that needs to remain available the supervisor
can restart to a known function state (the initial state).  It can also try to
reason out why the failure occurred (or just use some simple rule of thumb)
and try alternatives like restarting on another node, substituting a different
process or after having no luck failing and letting a higher level supervisor
try to restart this supervisor and cause a downward cascade of restarts
with new parameters supplied by higher level logic.

Again, it depends on the goals and purpose as to what behavior the
supervisor should take (if one is even necessary).

Code and process reuse

The partitioning of code and processes may get restructured as soon as
you have a second system which needs to reuse some of the functionality.
Splitting the code into processes may facilitate reuse, recovery, failure
and supervision.

I'll have to check my list of reasons for processes.  I'm sure I had one for
abstraction and code reuse, but I don't think I had one for fault isolation.

Fault Isolation = use a process where failure will keep a bad state from
propagating through the system causing more damage, and when a
restart of the faulted process can produce a fully functioning system
automatically (possibly after moving the process or changing its config
parameters before restart).

The above shows that what erlang tries to do is provide a layered
approach to code development.  Make each layer understandable
and uncluttered, layer process logic so that algorithm, error handling
and recovery are not intertwined, and use processes judiciously to
manage the abstraction, isolate faults and restart (or upgrade code)
in pieces without require the entire system to fail or stop.

So the simple "let it fail" to me means:

1) Don't worry about errors when trying to get the code to work

2) When organizing a system of processes (which may occur before
#1), really, really worry about what *should* happen when a process dies.
Architect the system to "do the right thing" and to dictate what the code
should *intend* in the failure cases and how other processes should react.

3) Expect *every* line of code you write to fail the entire process and
write the code so that the (un)intended failure happens in an intended way

In Chris' code example he used a case statement and received an error
that was vague.  Using function clause failure provides more information.
Here are three different ways of writing the same code.

foo(Pred, X) ->
    case Pred of
       true -> bar(X)
    end.

This is the method that allows you to add error handling clauses in the case.
If you really, really don't want a failure here, put an open ended clause that
does something useful (and maybe causes the failure elsewhere).  If you
don't need too much information to react, the closed case above will fail but
not give a lot of info.  In that scenario, the clarity of code would 
override the
need for detailed failure info.

foo(Pred, X) ->
    Pred = true, bar(X).

Here even less failure info is provided, although the code is probably clearer.
In the first example, the reader wonders why the case is left dangling with
something left out.  Here the Pred is intentionally a roadblock.

foo(true, X) -> bar(X).

This may end up for clarity or for performance reasons.  It is more concise,
but relies more on the language and compiler.  As it happens, it also
provides more info in the failure case.

Any of the three choices is valid, but it depends on the stage of code
development and the existence and sophistication of external processes
and supervisor processes.  If you later decide to reuse the code, the
restrictions on failure cases, error reporting requirements or other 
constraints
may change and cause a refactoring or rewriting of the code.

Notice that in C / Java you don't get the choice.  You have to handle the error
and assume the caller will understand the error flags you pass back.  If
the process fails you have no recourse and no second chance.  You can
use try and catch in Java but the algorithm becomes cluttered and the
logic is confusing to get right.

jay