[erlang-questions] Processes & Fault Tolerance

Joe Armstrong erlang@REDACTED
Mon Jan 3 14:36:30 CET 2011

I think you are getting confused between OTP supervisors etc. and the
general notion of "take-over"

Let's start with how we do error recovery.

Imagine two linked processes A and B on separate machines. A is the
master process.
B is a passive processes that will take over if A fails.

A sends a stream of state update messages S1, S2, S3,... to B. The
state messages contain enough information
for B to do whatever A was doing should A fail. If A fails B will
receive an EXIT signal.

If B does receive an exit signal it knows A has failed, so it carries
on doing whatever A was doing
using the information it last received in a state update message.

That's it and all about it - nothing to do with supervisors etc,

<aside> This is how we do things in a distributed system, obviously
real fault tolerance needs more than one machine to cover the case
where an entire machine fails. This is also why we copy everything, if
an entire machine has failed and we have a pointer to vital data on
that machine then we are in big trouble.

If it works like this in the distributed case then we reasoned it
should work exactly the same way when both processes are on the same
machine. Why is this? - because it would be very difficult to program
if the primitives and mechanisms involved were completely different
when the remote process lay on the same machine as the process it was
talking to or on a different machine.

It might be that we actually handle these two cases differently in the
VM, but as far as the programmer is concerned the system should behave
the same way, and the only observable change should
concerned with latencies.

To make the above possible we need a couple of primitives in Erlang,
namely process_flag trap_exits, and spawn_link. That's all you need.

The basic system properties you need to make a fault-tolerant systems
are the abilities to detect remote errors and send and receive messages.

What the OTP supervisors etc. do are build layers of abstraction on
top of spawn-link and trap_exits
in order to simplify programming common use-cases.

Fault detection and recovery is not arbitrary or magic in any sense,
It has to be part of the system architecture like everything else. One
common fault in making systems is to specify and design for the normal
but not specify and design how fault-detection and correction work.
This is probably due to a legacy of
programming in languages that have very limited ability to detect
errors and recover from them.

A C programmer with a single thread of control will take extreme
measures to avoid crashing their program
they have to,  there is only one thread of control. Given multiple
threads of control and the ability to remotely
detect errors your entire way of thinking changes and the emphasis
changes from thinking "how to I avoid a crash"
to thinking "given that a crash has occurred and been detected, how to
I fix things up and carry on"

The Erlang philosophy is that "things will crash", so we have to
detect the crashes, fix the bit that broke and
carry on - we do not make excessive efforts to avoid crashes, but we
do try very hard to repair things after they have gone wrong.

Fixing things after they have broken is often easier than preventing
them from breaking. Preventing them
from breaking is not theoretically possible - we can't even prove
simple things, like that a program will terminate
(the famous halting problem) - but we can easily detect failures and
put the system to rights by applying
certain invariants.

OTP supervisors are just trees of processes with certain restart rules
that experience has shown us
to be useful.

The main value of supervisors etc. (and of all the OTP behaviors) is
in building large systems with many
programmers involved. Rather than inventing their own patterns, the
programmers reuse a common set of
patterns, which makes the projects easier to manage.


On Mon, Jan 3, 2011 at 6:21 AM, Edmond Begumisa
<ebegumisa@REDACTED> wrote:
> Thanks for your response.
> Firstly, let me make my question a little clearer...
> To rephrase: For processes, "share nothing for the sake of concurrency" - I
> get, both in concept and application. "Share nothing for the sake of
> fault-tolerance" - I get in concept but not in application.
> Yet as I understand it, it is for the latter reason Erlang shares nothing*
> and not the former. Interpretation: I must be completely missing the point
> in regards to Erlang processes and sharing nothing. This is what I want to
> understand in application. In addition to the "side" effect of sane
> concurrency (which coming from a chaotic multi-threading shared-memory world
> I fully appreciate and practically make use of everyday), how can I also
> make use of the "real" reason Erlang processes share nothing -- fault
> tolerance? Practically/illustratively speaking?
> *ETS being the obvious exception.
> Secondly, here's a mantra from Joe Armstrong...
> @ minute 17:26
> http://www.se-radio.net/2008/03/episode-89-joe-armstrong-on-erlang/
> "[message passing concurrency]... the original reasons have to do with fault
> tolerance... you have to copy all the data you need from computer 1 to
> computer 2... if computer 1 crashes you take over on computer 2... you can't
> have dangling pointers... that's the reason for copying everything... it's
> got nothing to do with concurrency, it's got a lot to do with
> fault-tolerance... if they don't crash you could just have a dangling
> pointer and copy less data but it won't work in the presence of errors..."
> I interpret this to mean that share-nothing between processes is more about
> replicating valid state than isolating corrupted state as you described.
> Indeed, Joe created an example on his blog...
> http://armstrongonsoftware.blogspot.com/2007/07/scalable-fault-tolerant-upgradable.html
> It's algorithm 3 there I'm struggling with. Particularly where he says...
> "... In practise we would send an asynchronous stream of messages from N to
> N+1 containing enough information to recover if things go wrong."
> Unfortunately, I couldn't find part II to that post (I don't think there is
> one.) And I'm too green and inexperienced in the field of fault-tolerant
> systems to figure it out on my own. I'm having trouble visualising the
> practical here from the conceptual -- I need to be shown how :(
> Also, I seem to be under the impression that the Erlang language has some
> sort of schematics to do this built-in (i.e. deal with one process taking
> over from another if the first fails) and this is the reason processes share
> nothing. This seems to me to be something different from supervision trees,
> which use exit-trapping to re-spawn if a process fails with the active 'job'
> disappearing and any errors logged (like restarting a daemon). My
> interpretation of the fault-tolerance Erlang is supposed to enable (for
> those in the know) is seamless take-over. The 'job' lives on but elsewhere.
> Using telecoms as an example: a phone call wouldn't be cut-off when a fault
> occurs, another node would seamlessly take over. This is how I interpreted
> Joe's post and other descriptions of Erlang's fault-tolerant features and I
> understand the key is in the share nothing policy for processes. I'm sure
> I've mis-understood something or everything :)
> - Edmond -
> PS: I've read the Manning draft. Great book. I don't know if the answer lies
> in OTP (I searched and didn't find it). I suspect it's lower -- probalby how
> you organise your processes. Some distributed-programming black-magic only
> Erlanger's know about :)
> On Mon, 03 Jan 2011 14:56:51 +1100, Alain O'Dea <alain.odea@REDACTED>
> wrote:
>> On 2011-01-02, at 22:36, "Edmond Begumisa" <ebegumisa@REDACTED>
>> wrote:
>>> Slight correction...
>>> On Mon, 03 Jan 2011 12:38:38 +1100, Edmond Begumisa
>>> <ebegumisa@REDACTED> wrote:
>>>> Hello all,
>>>> I've been trying to wrap my Erlang's fault tolerant features
>>>> particularly in relation to processes.
>>> Should be: I've been trying to wrap my head around Erlang's fault
>>> tolerant features particularly in relation to processes.
>>> Sorry.
>>>> I've heard/read repeatedly that the primary reason why Erlang's
>>>> designers opted for a share-nothing policy is not rooted in concurrency but
>>>> rather in fault-tolerance. When nothing is shared, everything is copied.
>>>> When everything is copied processes can take over from one another when
>>>> things fail. I follow this reasoning but I don't follow how to apply it.
>>>> I fully understand and appreciate how supervision trees are used to
>>>> restart processes if they fail. What I don't get is what to do when you
>>>> don't want to restart but want to take over, say on another node. I know
>>>> that at a higher-level, OTP has some take-over/fail-over schematics (at the
>>>> application level.) I'm trying to understand things at the processes level -
>>>> why Erlang is the way it is so I can better use it to make my currently
>>>> fault-intolerant program fault tolerant.
>>>> Specifically, how can one process take over from another if it fails? It
>>>> appears to may that the only way to do this would be to somehow retrieve not
>>>> only the state of the process (say, gen_server's state) but also the
>>>> messages in its mailbox. Where does the design decision to share-nothing for
>>>> the sake of fault-tolerance come into play for processes? Please help me
>>>> "get" this!
>>>> Thanks in advance.
>>>> - Edmond -
>> Hi Edmond:
>> Share-nothing helps with concurrent fault-tolerance by preventing one
>> process from corrupting the state of another. Receive is a process' choice
>> and it corrupts its own state if it receives bad data and lets it in.
>> AFAIK OTP fault-tolerance doesn't mean no requests will fail, it means the
>> system/sub-system will recover if a single request causes a process to
>> crash.  It's kind of like proper try/catch recovery applied to concurrent
>> code.  How you recover from the crash depends on the supervision strategy
>> chosen.  In some cases the supervisor can pass the state to the replacement
>> process. In others this isn't necessary or even desirable since the state
>> itself may involve resources lost in the crash or corrupted state that led
>> to the crash.
>> I am straying outside my knowledge here so this paragraph is guesswork.
>>  The message queue for a gen_server need not necessarily be lost when the
>> callback module crashes.  In theory OTP could (and might already) simply
>> delegate the messages to the replacement process following a crash.  Someone
>> who knows OTP better than me would need to weigh in here though.
>> I found http://manning.com/logan very informative in understanding OTP and
>> its supervisor hierarchies.
>> Cheers,
>> Alain
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED

More information about the erlang-questions mailing list