[erlang-questions] Processes & Fault Tolerance

Mon Jan 3 17:30:03 CET 2011

> Let's start with how we do error recovery.
>
> Imagine two linked processes A and B on separate machines. A is the
> master process.
> B is a passive processes that will take over if A fails.
>
> A sends a stream of state update messages S1, S2, S3,... to B. The
> state messages contain enough information
> for B to do whatever A was doing should A fail. If A fails B will
> receive an EXIT signal.
>
> If B does receive an exit signal it knows A has failed, so it carries
> on doing whatever A was doing
> using the information it last received in a state update message.
>
> That's it and all about it - nothing to do with supervisors etc,

Ooooohh! THANK YOU, THANK YOU, THANK YOU.

*Now* I get it. This is precisely what I was trying to understand. The  
missing link was sending redundant messages. It all makes sense now --  
it's so simple. All I have to do to get fault tolerance at the process  
level is have a group of N redundant processes waiting for exit-signals  
and forward state changes to N-1 of them. I could even send to N/2 and  
have some decent tolerance at the expense of consistency of the nodes.

To preemptively answer the question I sent Ulf and Mazen... for that  
distributed database, Process B would be set up to also receive read  
requests forwarded from Process A but not respond to them unless it gets  
an exit signal it can understand like disk_fail_error. Any changes in  
state would also be forwarded to existing redundant processes.

Thanks everyone for your patience in helping me understand  
fault-tolerance. I've already started thinking about some cool things I  
can do with this in my program.

- Edmond -

On Tue, 04 Jan 2011 00:36:30 +1100, Joe Armstrong <erlang@REDACTED> wrote:

> I think you are getting confused between OTP supervisors etc. and the
> general notion of "take-over"
>
> Let's start with how we do error recovery.
>
> Imagine two linked processes A and B on separate machines. A is the
> master process.
> B is a passive processes that will take over if A fails.
>
> A sends a stream of state update messages S1, S2, S3,... to B. The
> state messages contain enough information
> for B to do whatever A was doing should A fail. If A fails B will
> receive an EXIT signal.
>
> If B does receive an exit signal it knows A has failed, so it carries
> on doing whatever A was doing
> using the information it last received in a state update message.
>
> That's it and all about it - nothing to do with supervisors etc,
> <aside> This is how we do things in a distributed system, obviously
> real fault tolerance needs more than one machine to cover the case
> where an entire machine fails. This is also why we copy everything, if
> an entire machine has failed and we have a pointer to vital data on
> that machine then we are in big trouble.
>
> If it works like this in the distributed case then we reasoned it
> should work exactly the same way when both processes are on the same
> machine. Why is this? - because it would be very difficult to program
> if the primitives and mechanisms involved were completely different
> when the remote process lay on the same machine as the process it was
> talking to or on a different machine.
>
> It might be that we actually handle these two cases differently in the
> VM, but as far as the programmer is concerned the system should behave
> the same way, and the only observable change should
> concerned with latencies.
> </aside>
>
> To make the above possible we need a couple of primitives in Erlang,
> namely process_flag trap_exits, and spawn_link. That's all you need.
>
> The basic system properties you need to make a fault-tolerant systems
> are the abilities to detect remote errors and send and receive messages.
>
> What the OTP supervisors etc. do are build layers of abstraction on
> top of spawn-link and trap_exits
> in order to simplify programming common use-cases.
>
> Fault detection and recovery is not arbitrary or magic in any sense,
> It has to be part of the system architecture like everything else. One
> common fault in making systems is to specify and design for the normal
> behavior
> but not specify and design how fault-detection and correction work.
> This is probably due to a legacy of
> programming in languages that have very limited ability to detect
> errors and recover from them.
>
> A C programmer with a single thread of control will take extreme
> measures to avoid crashing their program
> they have to,  there is only one thread of control. Given multiple
> threads of control and the ability to remotely
> detect errors your entire way of thinking changes and the emphasis
> changes from thinking "how to I avoid a crash"
> to thinking "given that a crash has occurred and been detected, how to
> I fix things up and carry on"
>
> The Erlang philosophy is that "things will crash", so we have to
> detect the crashes, fix the bit that broke and
> carry on - we do not make excessive efforts to avoid crashes, but we
> do try very hard to repair things after they have gone wrong.
>
> Fixing things after they have broken is often easier than preventing
> them from breaking. Preventing them
> from breaking is not theoretically possible - we can't even prove
> simple things, like that a program will terminate
> (the famous halting problem) - but we can easily detect failures and
> put the system to rights by applying
> certain invariants.
>
> OTP supervisors are just trees of processes with certain restart rules
> that experience has shown us
> to be useful.
>
> The main value of supervisors etc. (and of all the OTP behaviors) is
> in building large systems with many
> programmers involved. Rather than inventing their own patterns, the
> programmers reuse a common set of
> patterns, which makes the projects easier to manage.
>
> /Joe
>
>
>
>
> On Mon, Jan 3, 2011 at 6:21 AM, Edmond Begumisa
> <ebegumisa@REDACTED> wrote:
>> Thanks for your response.
>>
>> Firstly, let me make my question a little clearer...
>>
>> To rephrase: For processes, "share nothing for the sake of concurrency"  
>> - I
>> get, both in concept and application. "Share nothing for the sake of
>> fault-tolerance" - I get in concept but not in application.
>>
>> Yet as I understand it, it is for the latter reason Erlang shares  
>> nothing*
>> and not the former. Interpretation: I must be completely missing the  
>> point
>> in regards to Erlang processes and sharing nothing. This is what I want  
>> to
>> understand in application. In addition to the "side" effect of sane
>> concurrency (which coming from a chaotic multi-threading shared-memory  
>> world
>> I fully appreciate and practically make use of everyday), how can I also
>> make use of the "real" reason Erlang processes share nothing -- fault
>> tolerance? Practically/illustratively speaking?
>>
>> *ETS being the obvious exception.
>>
>> Secondly, here's a mantra from Joe Armstrong...
>>
>> @ minute 17:26
>> http://www.se-radio.net/2008/03/episode-89-joe-armstrong-on-erlang/
>>
>> "[message passing concurrency]... the original reasons have to do with  
>> fault
>> tolerance... you have to copy all the data you need from computer 1 to
>> computer 2... if computer 1 crashes you take over on computer 2... you  
>> can't
>> have dangling pointers... that's the reason for copying everything...  
>> it's
>> got nothing to do with concurrency, it's got a lot to do with
>> fault-tolerance... if they don't crash you could just have a dangling
>> pointer and copy less data but it won't work in the presence of  
>> errors..."
>>
>> I interpret this to mean that share-nothing between processes is more  
>> about
>> replicating valid state than isolating corrupted state as you described.
>>
>> Indeed, Joe created an example on his blog...
>>
>> http://armstrongonsoftware.blogspot.com/2007/07/scalable-fault-tolerant-upgradable.html
>>
>> It's algorithm 3 there I'm struggling with. Particularly where he  
>> says...
>>
>> "... In practise we would send an asynchronous stream of messages from  
>> N to
>> N+1 containing enough information to recover if things go wrong."
>>
>> Unfortunately, I couldn't find part II to that post (I don't think  
>> there is
>> one.) And I'm too green and inexperienced in the field of fault-tolerant
>> systems to figure it out on my own. I'm having trouble visualising the
>> practical here from the conceptual -- I need to be shown how :(
>>
>> Also, I seem to be under the impression that the Erlang language has  
>> some
>> sort of schematics to do this built-in (i.e. deal with one process  
>> taking
>> over from another if the first fails) and this is the reason processes  
>> share
>> nothing. This seems to me to be something different from supervision  
>> trees,
>> which use exit-trapping to re-spawn if a process fails with the active  
>> 'job'
>> disappearing and any errors logged (like restarting a daemon). My
>> interpretation of the fault-tolerance Erlang is supposed to enable (for
>> those in the know) is seamless take-over. The 'job' lives on but  
>> elsewhere.
>>
>> Using telecoms as an example: a phone call wouldn't be cut-off when a  
>> fault
>> occurs, another node would seamlessly take over. This is how I  
>> interpreted
>> Joe's post and other descriptions of Erlang's fault-tolerant features  
>> and I
>> understand the key is in the share nothing policy for processes. I'm  
>> sure
>> I've mis-understood something or everything :)
>>
>> - Edmond -
>>
>> PS: I've read the Manning draft. Great book. I don't know if the answer  
>> lies
>> in OTP (I searched and didn't find it). I suspect it's lower --  
>> probalby how
>> you organise your processes. Some distributed-programming black-magic  
>> only
>> Erlanger's know about :)
>>
>>
>> On Mon, 03 Jan 2011 14:56:51 +1100, Alain O'Dea <alain.odea@REDACTED>
>> wrote:
>>
>>> On 2011-01-02, at 22:36, "Edmond Begumisa"  
>>> <ebegumisa@REDACTED>
>>> wrote:
>>>
>>>> Slight correction...
>>>>
>>>> On Mon, 03 Jan 2011 12:38:38 +1100, Edmond Begumisa
>>>> <ebegumisa@REDACTED> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I've been trying to wrap my Erlang's fault tolerant features
>>>>> particularly in relation to processes.
>>>>>
>>>>
>>>> Should be: I've been trying to wrap my head around Erlang's fault
>>>> tolerant features particularly in relation to processes.
>>>>
>>>> Sorry.
>>>>
>>>>> I've heard/read repeatedly that the primary reason why Erlang's
>>>>> designers opted for a share-nothing policy is not rooted in  
>>>>> concurrency but
>>>>> rather in fault-tolerance. When nothing is shared, everything is  
>>>>> copied.
>>>>> When everything is copied processes can take over from one another  
>>>>> when
>>>>> things fail. I follow this reasoning but I don't follow how to apply  
>>>>> it.
>>>>>
>>>>> I fully understand and appreciate how supervision trees are used to
>>>>> restart processes if they fail. What I don't get is what to do when  
>>>>> you
>>>>> don't want to restart but want to take over, say on another node. I  
>>>>> know
>>>>> that at a higher-level, OTP has some take-over/fail-over schematics  
>>>>> (at the
>>>>> application level.) I'm trying to understand things at the processes  
>>>>> level -
>>>>> why Erlang is the way it is so I can better use it to make my  
>>>>> currently
>>>>> fault-intolerant program fault tolerant.
>>>>>
>>>>> Specifically, how can one process take over from another if it  
>>>>> fails? It
>>>>> appears to may that the only way to do this would be to somehow  
>>>>> retrieve not
>>>>> only the state of the process (say, gen_server's state) but also the
>>>>> messages in its mailbox. Where does the design decision to  
>>>>> share-nothing for
>>>>> the sake of fault-tolerance come into play for processes? Please  
>>>>> help me
>>>>> "get" this!
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> - Edmond -
>>>>>
>>>>>
>>>
>>> Hi Edmond:
>>>
>>> Share-nothing helps with concurrent fault-tolerance by preventing one
>>> process from corrupting the state of another. Receive is a process'  
>>> choice
>>> and it corrupts its own state if it receives bad data and lets it in.
>>>
>>> AFAIK OTP fault-tolerance doesn't mean no requests will fail, it means  
>>> the
>>> system/sub-system will recover if a single request causes a process to
>>> crash.  It's kind of like proper try/catch recovery applied to  
>>> concurrent
>>> code.  How you recover from the crash depends on the supervision  
>>> strategy
>>> chosen.  In some cases the supervisor can pass the state to the  
>>> replacement
>>> process. In others this isn't necessary or even desirable since the  
>>> state
>>> itself may involve resources lost in the crash or corrupted state that  
>>> led
>>> to the crash.
>>>
>>> I am straying outside my knowledge here so this paragraph is guesswork.
>>>  The message queue for a gen_server need not necessarily be lost when  
>>> the
>>> callback module crashes.  In theory OTP could (and might already)  
>>> simply
>>> delegate the messages to the replacement process following a crash.  
>>>  Someone
>>> who knows OTP better than me would need to weigh in here though.
>>>
>>> I found http://manning.com/logan very informative in understanding OTP  
>>> and
>>> its supervisor hierarchies.
>>>
>>> Cheers,
>>> Alain
>>> ________________________________________________________________
>>> erlang-questions (at) erlang.org mailing list.
>>> See http://www.erlang.org/faq.html
>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>>
>>
>>
>> --
>> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/