[erlang-questions] Processes & Fault Tolerance

Mon Jan 3 17:21:19 CET 2011

Thank you Ulf, now I'm beginning to understand things better at the higher  
OTP level. As usual, your AXD stories from the wild are priceless (I  
always tag them in my e-mail client). I think you should write a book or a  
series of tutorials on designing large systems with Erlang -- it would be  
a great follow up to the existing 3 main texts (Armstrong, Cesarini +  
Thomson, Manning.)

However, there's something going on at the lower level I still don't get  
-- why Erlang copies everything between processes and why this is said to  
lead to fault tolerance.

Say I'm writing a fault-tolerant distributed database. I have code in  
place to replicate the actual writes on disk on different nodes. Now for  
reads. A request from a client comes in. Process A accesses the disk on  
Node 1 but the disk fails, so the database needs to a mechanism to have  
Process B on Node 2 to handle the very same request (it should log the  
problem but not burden the client with it). There is no natural way I can  
have a redundant Process B waiting for a disk_fail_error (or a  
predetermined list of errors) from Process A and handle any pending  
requests? I'd have to centrally code for the database to spawn a new  
process and resend any messages the were sent to Process A to a new  
Process B?

- Edmond -

On Mon, 03 Jan 2011 19:35:25 +1100, Ulf Wiger  
<ulf.wiger@REDACTED> wrote:

>
> On 3 Jan 2011, at 02:38, Edmond Begumisa wrote:
>
>> I fully understand and appreciate how supervision trees are used to  
>> restart processes if they fail. What I don't get is what to do when you  
>> don't want to restart but want to take over, say on another node. I  
>> know that at a higher-level, OTP has some take-over/fail-over  
>> schematics (at the application level.) I'm trying to understand things  
>> at the processes level - why Erlang is the way it is so I can better  
>> use it to make my currently fault-intolerant program fault tolerant.
>>
>> Specifically, how can one process take over from another if it fails?  
>> It appears to may that the only way to do this would be to somehow  
>> retrieve not only the state of the process (say, gen_server's state)  
>> but also the messages in its mailbox. Where does the design decision to  
>> share-nothing for the sake of fault-tolerance come into play for  
>> processes? Please help me "get" this!
>
>
> To start with the simpler form of redundancy - "cold standby" - it can be
> achieved by grouping your code into OTP applications, and configuring
> the Distributed Application Controller (dist_ac), so that it will run an
> application on one node (e.g. A), if available, and otherwise start it  
> on B.
>
> The application started on B will be given a hint in the first argument  
> of
> the start/2 function: Type = {failover, A}, indicating the reason why it  
> is
> being started. The application can then recover state in the  
> initialization
> function.
>
> (I made a simple example of this in
> http://www.trapexit.org/OTP_Release_Handling_Tutorial
> I actually developed this example further for a tutorial in Israel,
> but my Subversion repository became corrupt and I lost most of it
> before I could put it on line. It involved adding mnesia replication,
> and I had a more-or-less working riak alternative too. :)
>
> The "recover state" part requires some forethought, as, obviously, the
> original instance of the application is no more. For "less hot" versions  
> of
> standby, one can use some form of "stable-state replication", e.g. by
> storing data in a replicated mnesia table, in riak, memBase, etc.
>
> There is no single answer for how to do this. It depends on the  
> robustness
> requirements and dynamics of your application. You have to decide what
> types of errors you need to handle transparently to the user, and what  
> type
> of recovery action it is reasonable to expect of the user itself.
>
> To tie back to telecoms, the basic service that Erlang was designed for  
> is
> the telephone call. In this case, we (used to) rely on the phone service  
> being
> extremely reliable - it practically never happened that we picked up the
> phone and didn't get dial tone; when it did, usually, if we tried again,  
> we
> *would* get dial tone. Also, if an ongoing call suddenly disconnected, we
> would try again, and usually succeed. As long as this didn't occur too  
> often,
> we wouldn't think twice about it. With mobile telephony, it is a much  
> more
> common event (in fairness, mobile telephony is *much* more complex),
> but we accept it, since the service gives us much greater freedom.
>
> From a programming point of view then, recovering a phone call would
> involve resetting it to its most stable state (on hook) and preparing for
> the event that the user may try again. In the AXD 301, we would take it
> a bit further and replicate the "connected" state to a standby node. This
> was done using erlang message passing, with an intermediary that
> aggregated lots of call states together in order to reduce cost. You  
> could
> view this as a form of "best-effort hot standby", where it wasn't really
> necessary to know if each state made it to the standby; we might lose  
> some
> in the transition. We used a 'standby' application on the standby node,  
> which
> collected the call states and prepared the data structures (exactly how  
> this
> was done changed over time). Then we used the failover mechanism above
> to fire up the call handling application on the other side. Since it  
> knew it
> was a failover, it also knew to handshake with the standby app to get the
> data. It then had a few seconds to get everything in order and respond
> appropriately to the status enquiry messages from the clients, which
> were built into the protocol. The actual media stream was handled by
> different hardware, so it was necessary to audit all known calls with  
> that
> hardware layer. All calls that were not known to the Erlang side were
> simply reset.
>
> We had some true hot-standby problems as well. In some cases,
> Erlang was involved, but depending on the problem, we had to
> resort to hardware-based solutions for some, and in at least one case,
> I remember that putting a cheap, passive (i.e. reliable) splitter in
> front of the system was the simplest way to "replicate" the signal,
> making hot standby possible. The hottest of them all was that we
> could fail over from one ATM switching plane to another, losing at
> most 6 ATM cells. No Erlang there...
>
> To wrap this up, Erlang provides a number of different ways to detect
> that it's time to take over. I find that the OTP application concept  
> offers
> a very clean way - with the start({failover,FromNode}, Args) function -
> allowing you to contain the logic in one place. For some requirements,
> this may not be sufficiently "hot", in which case you may need to roll
> your own mechanism. The trick, as you noted, is to get hold of the data
> needed to take over. Erlang doesn't solve this automatically; you have
> to plan your data storage, state replication, etc. to make it possible.
>
> In my experience, you should really think about what you can get away
> with. Users normally accept outages, if they are sufficiently rare and
> short, and the user doesn't lose money or other precious assets.
> Don't knock the simple solutions because it's sexy to be "infinitely
> robust" - most likely, the added complexity will make your system *less*
> robust - not more so.
>
> BR,
> Ulf W
>
> Ulf Wiger, CTO, Erlang Solutions, Ltd.
> http://erlang-solutions.com
>
>
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/