Let some other process fix the error (Long)
Joe Armstrong
joe@REDACTED
Thu Apr 24 16:48:43 CEST 2003
On Wed, 23 Apr 2003, Jay Nelson wrote:
> These sorts of questions are what make erlang so interesting
> to me. They appear simple and broad brush at first, but if
> you look closely they are actually very subtle and involve
> tradeoffs that you would never recognize in other languages.
> In C, you must code defensively because once you core dump
> there are no options and with runaway code there is no telling
> what will happen next. In erlang you have so many choices it
> is difficult to decide what to do.
>
... a very nice description of Erlang philosophy ...
Amazingly people like Jay (who I have never met) and many other seem
to have intuitively understand the Erlang principles - this must be
more by "osmosis" than by reading the documentation.
> Don't worry about errors, just let it fail. That is the Erlang way.
> Like all things Erlang it sounds simple and easy, but is actually a
> very subtle thing.
The real principle is "let some other process fix the error"
The "let if fail philosophy" is a consequence of this.
Let me explain: Erlang was *designed* for making fault-tolerant
systems, so:
1) To make something fault tolerant you need at least *two* computers
(obviously)
2) If one computer fails you must fix the error on the *other*
computer.
This means that:
3) To fix an error you do not make any attempt to do it locally -
you can't fix an error on a computer if the computer has just crashed
- you must do it somewhere else.
In the Erlang model *everything* is a process - even computers - so
we want the same semantics. Thus processes do not do their own error
recovery (they can't they have crashed) - other processes must clean
up after them. - this is the "let some other process fix the error"
principle.
Question 1: If some process (A) crashes which of the many other
processes in the system is responsible for the recovery operations?
Answer: Those processes which are linked to A.
Question 2: How do the linked processes know what to do - surely
they need to know why A died?
Answer: The reason for the exit is sent as an argument in a signal
which is sent from the dieing process to all the processes in its
link set.
To implement this places a number of requirements on the programming
language and run-time system - namely:
1) We must be able to remotely detect errors
2) We must be able to automatically diagnose errors
"Let it fail" is often the *only* sensible thing to do.
Let me explain...
- There are exceptions
- There are errors
- They are not the same thing
Start with exceptions:
The run-time system generates exceptions - these occur when the
run-time system does not know what to do. For example if a divide by
zero condition occurs the run-time system does not know what to do -
so what does it do? - it aborts the process with a {'EXIT',
divide_by_zero} exception.
This is fine and in line with our fault-handling philosophy "some
other process will fix the error."
What about Errors? Well what is an error? An error is "a deviation between
what the program is supposed to do" and what it is observed to do.
What it is supposed to do is "what was in the specification".
Example (my favorite) -- let's suppose the spec say we are to write
a function asm that turns a load opcode into the instruction 1 and
a store opcode into the opcode 2. This is easy:
asm(load) -> 1;
asm(store) -> 2.
Now suppose that what system tries to call asm(jump) - what should happen?
Suppose you are the programmer and you are used to writing defensive code
(just like they taught you) - you'd write:
asm(load) -> 1;
asm(store) -> 2;
asm(X) ->
and then what??
What code do you write? - the programmer is now in the situation that the
run-time system was faced with when it encountered a divide-by-zero situation
you cannot wrote any sensible code here - all you can do is terminate the
program. Remember "Some other process will fix the error".
So maybe you write:
asm(load) -> 1;
asm(store) -> 2;
asm(X) -> exit({bad_arg_to, asm, X}).
But why bother. The Erlang compiler compiles
asm(load) -> 1;
asm(store) -> 2.
almost as if it had been defined:
asm(load) -> 1;
asm(store) -> 2;
asm(X) -> exit({bad_arg, asm, X}).
The defensive code *detracts* from the pure case and confuses the
reader - and the diagnostic is often no better than that which the
compiler supplies automatically.
Now the "some other process will fix the error" philosophy only
makes sense if you have a process based language with total
Independence between processes.
You *can't do this in a sequential language* - you get ONE try (your
processes) and it crashes you loose control.
You *can't do this with thread based concurrency* - threads *share*
resources (usually memory) - if one thread corrupts shared memory -
disaster.
You *can't do this with unix process like concurrency* - you can
observe failure but not accurately diagnose the reason for failure.
This design was not accidental - Erlang was *designed* to program
fault-tolerant systems. The key requirement the *one* requirement that
I always considered far more important than anything else was to be
able to make a system which could recover from software errors.
We knew that our systems would end up with millions of lines of code
and be written by large teams of programmers - in such systems there
are bound to be many mistakes.
I can think of no other way of programming such a system *without*
independent processes.
The *reason* for independent processes is NOT efficiency (I don't
give a hoot about efficiency) - it is to allow large teams of
programmers to work together. Give each programmer their own processes
to work with and let them hack away - if their process dies, who
cares, "some other process will fix the error."
From this the worker-supervisor model is a short step away.
The basic idea is "try to do something - if you can't do it give up
and try to do something simpler."
There are two other points to note:
1) All programmers are not equal
Some are better than others - so then you let your better programmer
programs the error recovery strategies, and let them identify and
program the code that does the error recovery.
2) All code is not equal
As Martin Björkland once said:
- There is code that can recover from errors
- There is code that will not recover from errors
- You have to make your mind up
In particular the error recovery code *must* be correct (so don't mess
with error_handler.erl)
Taking 1) and 2) together you arrive at the following:
Try to structure you problem so that you can write it as
"lots of regular 'pure' code with a well defined structure' *and*
- "a small module of stuff that sucks"
Get your inexperienced programmers to write "referentially
transparent" *pure* code
Get your lead programmers to write the messy stuff.
Now if you use OTP you get the mess for free - every time I look at
the OTP stuff I think "Oh my goodness - couldn't it have been written
in a *much* more simple manner - I start hacking and only then do I
remember *why* it was written as it was written."
There is an underlying logic to Erlang - which I have always known
but find very difficult to explain - it is particularly difficult to
explain it to programmers who think "threads" and "sequential" code.
My current "best argument" is one I've partially ventilated here:
" ... look to make a fault-tolerant system you need TWO computers
not ONE right ... and If you've got TWO computers you need to start
thinking about distributed programming *whether you like or not* and
if you're going to do distributed computing then you'll have to think
about the following ... and so on..."
The reaction is varied - everything from (rarely) "you're
absolutely right" to - (more commonly) "what about efficiency"
Me, I can make an incorrect program run arbitrarily quickly - that is
no challenge, the following program, for example, computes
factorial(10000000000) in less than a picotwinkle.
factorial(N) -> 42.
/Joe
More information about the erlang-questions
mailing list