Making reliable distributed systems in the presence of software errors

Thu Nov 13 11:01:58 CET 2003

On 12 Nov 2003, Luke Gorrie wrote:

> =?Windows-1252?Q?Bjarne_D=E4cker?= <bjarne@REDACTED> writes:
> 
> > http://www.sics.se/~joe/thesis/spikblad.html
> 
> Nice typography! :-)
> 

  Thanks -

  Shameless plug follows.

  Hello all Erlangers ...

  You might like to *read*

     http://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf

  The central problem this thesis  is "How to make reliable systems in
the presence of software errors".

  We know how  to make reliable systems in  the presence of *hardware*
errors (answer  replicate) - but  what about *software* errors  - here
replication does  not help - replicating faulty  software doesn't help
at all - it just makes  matters worse - instead of one failing program
we have two failing programs, both of which fail for exactly the same
reason.

  Since  most things  fail  because of  software  errors this  problem
seems  much  more  interesting  than the  "hardware"  fault-tolerance
problem.

  Erlang  is part  of the  story -  the thesis  contains  (among other
things)

 - A philosophy of programming (Called Concurrency Oriented Programming)
 - A description of Erlang
 - Examples of how to program in Erlang
 - A method for programming fault tolerant systems
 - A description of an implementation of this method 
  (ie a description of the major OTP behaviours)
 - Examples of how to program with the OTP behaviours
 - Case studies to see if the method works (I claim it does)
 - A method for specifying the interaction between components (UBF)

  Much of  the material in  the thesis can  be viewed as  "the missing
Erlang documentation" since it records  not "how things are done" but,
more importantly "why things were done"

  Have a good read

  /Joe