Let some other process fix the error (Long)

Thu Apr 24 16:48:43 CEST 2003

On Wed, 23 Apr 2003, Jay Nelson wrote:

> These sorts of questions are what make erlang so interesting
> to me.  They appear simple and broad brush at first, but if
> you look closely they are actually very subtle and involve
> tradeoffs that you would never recognize in other languages.
> In C, you must code defensively because once you core dump
> there are no options and with runaway code there is no telling
> what will happen next.  In erlang you have so many choices it
> is difficult to decide what to do.
>

... a very nice description of Erlang philosophy ...

  Amazingly people like Jay (who I have never met) and many other seem
to have  intuitively understand  the Erlang principles  - this  must be
more by "osmosis" than by reading the documentation.

> Don't worry about errors, just let it fail.  That is the Erlang way.
> Like all things Erlang it sounds simple and easy, but is actually a
> very subtle thing.

  The real principle is "let some other process fix the error"

  The "let if fail philosophy" is a consequence of this.

  Let me  explain: Erlang  was *designed* for  making fault-tolerant
systems, so:

  1) To make something fault tolerant  you need at least *two* computers
(obviously)

  2) If  one computer  fails you  must fix  the error  on  the *other*
computer.

  This means that:

  3) To fix an  error you do not  make any attempt to do  it locally -
you can't fix an error on  a computer if the computer has just crashed
- you must do it somewhere else.

  In the Erlang model *everything* is  a process - even computers - so
we want the  same semantics. Thus processes do not  do their own error
recovery (they can't  they have crashed) - other  processes must clean
up after  them. - this is the  "let some other process  fix the error"
principle.

  Question  1: If some  process (A)  crashes which  of the  many other
processes in the system is responsible for the recovery operations?

  Answer: Those processes which are linked to A.

  Question 2:  How do the  linked processes know  what to do  - surely
they need to know why A died?

  Answer: The reason for the exit  is sent as an argument in a signal
which is  sent from the  dieing process to  all the processes  in its
link set.

  To implement this places a number of requirements on the programming
language and run-time system - namely:

	1) We must be able to remotely detect errors
	2) We must be able to automatically diagnose errors

  "Let it fail" is often the *only* sensible thing to do.

  Let me explain...

  - There are exceptions
  - There are errors
  - They are not the same thing

  Start with exceptions:

  The  run-time system generates  exceptions -  these occur  when the
run-time system does  not know what to do. For example  if a divide by
zero condition occurs the run-time system  does not know what to do -
so  what  does  it do?   -  it  aborts  the  process with  a  {'EXIT',
divide_by_zero} exception.

  This is  fine and in  line with our fault-handling  philosophy "some
other process will fix the error."

  What about Errors? Well what is an error? An error is "a deviation between
what the program is supposed to do" and what it is observed to do.

  What it is supposed to do is "what was in the specification".

  Example (my favorite) -- let's suppose  the spec say we are to write
a function asm that turns a load opcode into the instruction 1 and 
a store opcode into the opcode 2. This is easy:

	asm(load)  -> 1;
	asm(store) -> 2.

  Now suppose that what system tries to call asm(jump) - what should happen?

  Suppose you are the programmer and you are used to writing defensive code
(just like they taught you) - you'd write:

	asm(load) -> 1;
	asm(store) -> 2;
	asm(X) ->

  and then what?? 

  What code do you write? - the programmer is now in the situation that the
run-time system was faced with when it encountered a divide-by-zero situation
you cannot wrote any sensible code here - all you can do is terminate the
program. Remember "Some other process will fix the error".

So maybe you write:

	asm(load)  -> 1;
	asm(store) -> 2;
	asm(X)     -> exit({bad_arg_to, asm, X}).

But why bother. The Erlang compiler compiles

	asm(load)  -> 1;
	asm(store) -> 2.

almost as if it had been defined:

	asm(load)  -> 1;
	asm(store) -> 2;
	asm(X)     -> exit({bad_arg, asm, X}).

  The defensive  code *detracts* from  the pure case and  confuses the
reader -  and the diagnostic  is often no  better than that  which the
compiler supplies automatically.

  Now  the "some  other process  will fix  the error"  philosophy only
makes  sense  if  you  have  a  process  based  language  with  total
Independence between processes.

  You *can't do this in a sequential language* - you get ONE try (your
processes) and it crashes you loose control.

  You *can't do this with  thread based concurrency* - threads *share*
resources (usually  memory) - if  one thread corrupts shared  memory -
disaster.

  You *can't  do this  with unix process  like concurrency* -  you can
observe failure but not accurately diagnose the reason for failure.

  This design  was not accidental  - Erlang was *designed*  to program
fault-tolerant systems. The key requirement the *one* requirement that
I always  considered far more important  than anything else  was to be
able to make a system which could recover from software errors.

  We knew that our systems would end up with millions of lines of code
and be written  by large teams of programmers -  in such systems there
are bound to be many mistakes.

  I can think  of no other way of programming  such a system *without*
independent processes.

  The *reason*  for independent processes is NOT  efficiency (I don't
give  a  hoot about  efficiency)  -  it is  to  allow  large teams  of
programmers to work together.  Give each programmer their own processes
to  work with and  let them  hack away  - if  their process  dies, who
cares, "some other process will fix the error."

  From this the worker-supervisor model is a short step away.

  The basic idea is "try to do  something - if you can't do it give up
and try to do something simpler."

  There are two other points to note:

  1) All programmers are  not equal

  Some are better than others - so then you let your better programmer
programs  the error  recovery strategies,  and let  them  identify and
program the code that does the error recovery.

  2) All code is not equal

  As Martin Björkland once said:

	- There is code that can recover from errors
	- There is code that will not recover from errors
	- You have to make your mind up

  In particular the error recovery code *must* be correct (so don't mess
with error_handler.erl)

  Taking 1) and 2) together you arrive at the following:

  Try to structure you problem so that you can write it as
"lots of regular 'pure' code with a well defined structure' *and*
- "a small module of stuff that sucks"

  Get   your  inexperienced   programmers  to   write  "referentially
transparent" *pure* code

  Get your lead programmers to write the messy stuff.

  Now if you use OTP you get the  mess for free - every time I look at
the OTP stuff I think "Oh  my goodness - couldn't it have been written
in a *much*  more simple manner -  I start hacking and only  then do I
remember *why* it was written as it was written."

  There is an  underlying logic to Erlang - which  I have always known
but find very  difficult to explain - it  is particularly difficult to
explain it to programmers who think "threads" and "sequential" code.

  My current "best argument" is one I've partially ventilated here:

  " ...  look  to make a fault-tolerant system  you need TWO computers
not ONE  right ... and If you've  got TWO computers you  need to start
thinking about  distributed programming *whether you like  or not* and
if you're going to do distributed computing then you'll have to think
about the following ... and so on..."

  The  reaction   is  varied  -  everything   from  (rarely)  "you're
absolutely right" to - (more commonly) "what about efficiency"

  Me, I can make an incorrect program run arbitrarily quickly - that is
no   challenge,   the   following   program,  for   example,   computes
factorial(10000000000) in less than a picotwinkle.

  factorial(N) -> 42.

/Joe