catching errors from linked processes (Long)

Thu Apr 24 00:08:31 CEST 2003

On Wed, 23 Apr 2003 00:36:14 -0700
Jay Nelson <jay@REDACTED> wrote:

> Code should be as simple as possible and no simpler.

Concurrency is harder than you believe :)

> Errors
> 
> Don't worry about errors, just let it fail.  That is the erlang way.
> Like all things erlang it sounds simple and easy, but is actually a
> very subtle thing.
> 
> I can think of three good reasons to let it fail: highlight errors in
> the code so they can be corrected; avoid something really bad from
> happening and reset to a known state; or notify external systems
> (or users) about bad data they are supplying so it can be corrected.

My program is definately an example of the third case.

I left out the details of my program to more easily communicate the
problems, but for the curious, here's some info.  Basically we have a
toy programming language that's laxly specified, especially when it
comes to limitations.  There are many implementations of the language
(compilers & interpreters) and in order to attain various goals, their
authors have placed various limitations on them.  The idea behind my
program was to write a general interpreter for which the limitations
can be specified by the user.  That way, the user can test their code
to get an idea of what kind of implementations it will run on (and
exactly how it will fail on others.)

I started out handling errors (and because the user can select the
limitations, there are a wide variety of possible errors) by returning
either {ok, Result} or {error, Reason} to the caller.  This is fine when
it's only the immediate caller that cares about whether your function
succeeded or not.

Unfortunately that's rarely the case.  Often, a function two or three
levels up from the immediate caller will care.  So you have to cascade
the error back up the chain.  You can usually tell that code does this
when it has the following pattern in it:

  case Blah of
    ...
    {error, Reason} ->
      {error, Reason}
  end

It's unwieldy to do it manually like this.

Enter: catch and sometimes throw.

This allows a non-local exit, so we can communicate the error to the
function, several levels up, that cares about it, without manually
passing it around.  catch is fine for when it's only the immediate
*process* that cares about whether your function succeeded or not.

Unfortunately *that's* not always the case either, especially when you
start exploiting concurrency (most implementations of the toy language
in question are strictly sequential; mine currently uses 5 processes.)
Sometimes another process will care about the result, so you *still*
have to cascade the error, back up a chain, this time a chain of
processes.

Enter: [spawn_]link and sometimes exit.

*Then* you have error bliss :)

> Fault Isolation = use a process where failure will keep a bad state
> from propagating through the system causing more damage, and when a
> restart of the faulted process can produce a fully functioning system
> automatically (possibly after moving the process or changing its
> config parameters before restart).

Or, in my case at least, make it fail the entire system as a unit, in a
consistent way.

What Lennart said is starting to make sense; it's that I'm too used to
the idea (from sequential programming) that the process you start is
going to be the process that does the work.  Not so.  Clearly it is
sometimes better for the process you start to be the one that
*supervises* the processes that do the work and *centralizes* their
results (errors & otherwise), like a funnel.

> foo(true, X) -> bar(X).
> 
> This may end up for clarity or for performance reasons.  It is more
> concise, but relies more on the language and compiler.  As it happens,
> it also provides more info in the failure case.

This is what I'd prefer, but my code was a little too complicated to
allow it.  A nice side-effect of it is that if you name the function
well, you get an almost self-descriptive error message.

I still think it would be nice if, whenever an error occurred, the
'EXIT' message contained not only the call stack, but also the
parameters to the function that failed.  Still, you can't have
everything, and when I do need the supervisor to know something about
the state of the process that failed, I can arrange it fairly easily.

> Notice that in C / Java you don't get the choice.  You have to handle
> the error and assume the caller will understand the error flags you
> pass back.  If the process fails you have no recourse and no second
> chance.  You can use try and catch in Java but the algorithm becomes
> cluttered and the logic is confusing to get right.
> 
> jay

What's even worse in C is that you can't even return an aggregate like
{error, Reason}.  If you want to return an error, you either have to
incorporate it into the data type you're returning (like -1 if all your
valid results are positive integers) or pass a parameter by reference.

And figuring out where to put a catch (in Java, Erlang or any other
language) is always a bit tricky - you want it to be neither too high
nor too low in the chain.  A supervisor can let you neatly avoid having
to decide that in many cases, I'm sure.

-Chris