[erlang-questions] Processes & Fault Tolerance

Mon Jan 3 09:35:25 CET 2011

On 3 Jan 2011, at 02:38, Edmond Begumisa wrote:

> I fully understand and appreciate how supervision trees are used to restart processes if they fail. What I don't get is what to do when you don't want to restart but want to take over, say on another node. I know that at a higher-level, OTP has some take-over/fail-over schematics (at the application level.) I'm trying to understand things at the processes level - why Erlang is the way it is so I can better use it to make my currently fault-intolerant program fault tolerant.
> 
> Specifically, how can one process take over from another if it fails? It appears to may that the only way to do this would be to somehow retrieve not only the state of the process (say, gen_server's state) but also the messages in its mailbox. Where does the design decision to share-nothing for the sake of fault-tolerance come into play for processes? Please help me "get" this!

To start with the simpler form of redundancy - "cold standby" - it can be
achieved by grouping your code into OTP applications, and configuring
the Distributed Application Controller (dist_ac), so that it will run an
application on one node (e.g. A), if available, and otherwise start it on B.

The application started on B will be given a hint in the first argument of 
the start/2 function: Type = {failover, A}, indicating the reason why it is 
being started. The application can then recover state in the initialization
function.

(I made a simple example of this in 
http://www.trapexit.org/OTP_Release_Handling_Tutorial
I actually developed this example further for a tutorial in Israel,
but my Subversion repository became corrupt and I lost most of it
before I could put it on line. It involved adding mnesia replication,
and I had a more-or-less working riak alternative too. :)

The "recover state" part requires some forethought, as, obviously, the 
original instance of the application is no more. For "less hot" versions of 
standby, one can use some form of "stable-state replication", e.g. by 
storing data in a replicated mnesia table, in riak, memBase, etc.

There is no single answer for how to do this. It depends on the robustness
requirements and dynamics of your application. You have to decide what
types of errors you need to handle transparently to the user, and what type
of recovery action it is reasonable to expect of the user itself.

To tie back to telecoms, the basic service that Erlang was designed for is
the telephone call. In this case, we (used to) rely on the phone service being
extremely reliable - it practically never happened that we picked up the 
phone and didn't get dial tone; when it did, usually, if we tried again, we 
*would* get dial tone. Also, if an ongoing call suddenly disconnected, we 
would try again, and usually succeed. As long as this didn't occur too often,
we wouldn't think twice about it. With mobile telephony, it is a much more 
common event (in fairness, mobile telephony is *much* more complex),
but we accept it, since the service gives us much greater freedom.

From a programming point of view then, recovering a phone call would 
involve resetting it to its most stable state (on hook) and preparing for 
the event that the user may try again. In the AXD 301, we would take it 
a bit further and replicate the "connected" state to a standby node. This
was done using erlang message passing, with an intermediary that 
aggregated lots of call states together in order to reduce cost. You could
view this as a form of "best-effort hot standby", where it wasn't really 
necessary to know if each state made it to the standby; we might lose some
in the transition. We used a 'standby' application on the standby node, which
collected the call states and prepared the data structures (exactly how this 
was done changed over time). Then we used the failover mechanism above
to fire up the call handling application on the other side. Since it knew it 
was a failover, it also knew to handshake with the standby app to get the
data. It then had a few seconds to get everything in order and respond 
appropriately to the status enquiry messages from the clients, which 
were built into the protocol. The actual media stream was handled by 
different hardware, so it was necessary to audit all known calls with that
hardware layer. All calls that were not known to the Erlang side were 
simply reset.

We had some true hot-standby problems as well. In some cases, 
Erlang was involved, but depending on the problem, we had to
resort to hardware-based solutions for some, and in at least one case,
I remember that putting a cheap, passive (i.e. reliable) splitter in 
front of the system was the simplest way to "replicate" the signal,
making hot standby possible. The hottest of them all was that we 
could fail over from one ATM switching plane to another, losing at
most 6 ATM cells. No Erlang there...

To wrap this up, Erlang provides a number of different ways to detect 
that it's time to take over. I find that the OTP application concept offers
a very clean way - with the start({failover,FromNode}, Args) function -
allowing you to contain the logic in one place. For some requirements,
this may not be sufficiently "hot", in which case you may need to roll 
your own mechanism. The trick, as you noted, is to get hold of the data
needed to take over. Erlang doesn't solve this automatically; you have
to plan your data storage, state replication, etc. to make it possible.

In my experience, you should really think about what you can get away
with. Users normally accept outages, if they are sufficiently rare and
short, and the user doesn't lose money or other precious assets.
Don't knock the simple solutions because it's sexy to be "infinitely
robust" - most likely, the added complexity will make your system *less*
robust - not more so.

BR,
Ulf W 

Ulf Wiger, CTO, Erlang Solutions, Ltd.
http://erlang-solutions.com