[erlang-questions] Processes & Fault Tolerance

Edmond Begumisa ebegumisa@REDACTED
Mon Jan 3 17:21:24 CET 2011


Thanks. I see what you're saying... so share-nothing is more about  
fault-isolation than fault-tolerance. Isolate the fault to affect as few  
processes as possible rather than deal with the fault then continue.

So there's no natural I can arrange a group of active redundant processes  
on different nodes sharing replicated state, ready to take over if one  
fails. Like what distributed fault-tolerant databases do with disks, but  
instead of nodes with replicated disks, I have nodes with replicated  
processes.

- Edmond -

On Mon, 03 Jan 2011 21:26:23 +1100, Mazen Harake  
<mazen.harake@REDACTED> wrote:

> Don't think of it as "taking over" but more about "recovering from" and  
> "minimizing affect area".
>
> If process A dies you are not expected to "save" the state of A and  
> _transfer_ it to B and thus continue as if B was A. This doesn't make  
> any sense. The idea is rather that if A crashes then you spawn a new  
> process B which a) has unaffected any other process (say C and D)  
> because there is no data corruption and b) when it starts it might read  
> in a persistent state (say from an ETS table) which it uses as a base to  
> continue an operation. This doesn't mean that you transferred the state  
> of A to B, you recovered from A by starting B. There is a difference  
> between types of state as well, E.g. a gen_server state is not  
> necessarily important to store in a resilient matter but perhaps the id  
> for the active session is because it can be used to recover and  
> repopulate the gen_server state.
>
> Some examples:
>
> * If you have an ongoing call with someone and there is a bug that  
> disconnects your call then someone else who is having a call should not  
> be affected. The fault tolerance part here is not that your call will  
> continue, it is that you are able to pick up the dial again and call the  
> user back I.e. the system is still alive even though it suffered a minor  
> glitch in a particular process. (There are ways to keep the call alive  
> but I'm not sure they do that).
>
> * If you have say an IM session running (ejabberd) then you might have a  
> process per request/message/whatever. In this case perhaps your id,  
> session id and some other key data would be stored and shared on several  
> mnesia nodes but it isn't shared in the sense that it is used by many  
> processes, it is just persistent. This means that if a process which is  
> in the middle of something that has to do with that state crashes then  
> another (newly spawned process) can read the various keys and ids and  
> repopulate the state and continue.
>
> Share nothing in both cases mean that where ever the data is, it is only  
> used by one process at a time. The first example might not have  
> persistent data that it keeps but it doesn't affect any other part of  
> the system if the call goes down. The second example has persistent data  
> but it doesn't mean it is shared among processes it just means it  
> handles node disruptions so that new ones can continue.
>
> Now if you move on to Node level then these two scenarios still apply  
> but with small differences. The first example will cut off all calls  
> routed through that node but as soon as some would try to call again it  
> would simply go through another node. The second example would  
> distribute the session state so that if a node goes down a new process  
> on another node can handle the continuation of the session. In the later  
> case it is important to realize that the shared data (the state between  
> the nodes) is an obvious bottle neck but it is another type of shared  
> data because it is, through abstraction, only manipulated by 1 process.
>
> Makes sense? %-)
>
> On 03/01/2011 06:21, Edmond Begumisa wrote:
>> Thanks for your response.
>>
>> Firstly, let me make my question a little clearer...
>>
>> To rephrase: For processes, "share nothing for the sake of concurrency"  
>> - I get, both in concept and application. "Share nothing for the sake  
>> of fault-tolerance" - I get in concept but not in application.
>>
>> Yet as I understand it, it is for the latter reason Erlang shares  
>> nothing* and not the former. Interpretation: I must be completely  
>> missing the point in regards to Erlang processes and sharing nothing.  
>> This is what I want to understand in application. In addition to the  
>> "side" effect of sane concurrency (which coming from a chaotic  
>> multi-threading shared-memory world I fully appreciate and practically  
>> make use of everyday), how can I also make use of the "real" reason  
>> Erlang processes share nothing -- fault tolerance?  
>> Practically/illustratively speaking?
>>
>> *ETS being the obvious exception.
>>
>> Secondly, here's a mantra from Joe Armstrong...
>>
>> @ minute 17:26  
>> http://www.se-radio.net/2008/03/episode-89-joe-armstrong-on-erlang/
>>
>> "[message passing concurrency]... the original reasons have to do with  
>> fault tolerance... you have to copy all the data you need from computer  
>> 1 to computer 2... if computer 1 crashes you take over on computer 2...  
>> you can't have dangling pointers... that's the reason for copying  
>> everything... it's got nothing to do with concurrency, it's got a lot  
>> to do with fault-tolerance... if they don't crash you could just have a  
>> dangling pointer and copy less data but it won't work in the presence  
>> of errors..."
>>
>> I interpret this to mean that share-nothing between processes is more  
>> about replicating valid state than isolating corrupted state as you  
>> described.
>>
>> Indeed, Joe created an example on his blog...
>>
>> http://armstrongonsoftware.blogspot.com/2007/07/scalable-fault-tolerant-upgradable.html  
>> It's algorithm 3 there I'm struggling with. Particularly where he  
>> says...
>>
>> "... In practise we would send an asynchronous stream of messages from  
>> N to N+1 containing enough information to recover if things go wrong."
>>
>> Unfortunately, I couldn't find part II to that post (I don't think  
>> there is one.) And I'm too green and inexperienced in the field of  
>> fault-tolerant systems to figure it out on my own. I'm having trouble  
>> visualising the practical here from the conceptual -- I need to be  
>> shown how :(
>>
>> Also, I seem to be under the impression that the Erlang language has  
>> some sort of schematics to do this built-in (i.e. deal with one process  
>> taking over from another if the first fails) and this is the reason  
>> processes share nothing. This seems to me to be something different  
>> from supervision trees, which use exit-trapping to re-spawn if a  
>> process fails with the active 'job' disappearing and any errors logged  
>> (like restarting a daemon). My interpretation of the fault-tolerance  
>> Erlang is supposed to enable (for those in the know) is seamless  
>> take-over. The 'job' lives on but elsewhere.
>>
>> Using telecoms as an example: a phone call wouldn't be cut-off when a  
>> fault occurs, another node would seamlessly take over. This is how I  
>> interpreted Joe's post and other descriptions of Erlang's  
>> fault-tolerant features and I understand the key is in the share  
>> nothing policy for processes. I'm sure I've mis-understood something or  
>> everything :)
>>
>> - Edmond -
>>
>> PS: I've read the Manning draft. Great book. I don't know if the answer  
>> lies in OTP (I searched and didn't find it). I suspect it's lower --  
>> probalby how you organise your processes. Some distributed-programming  
>> black-magic only Erlanger's know about :)
>>
>>
>> On Mon, 03 Jan 2011 14:56:51 +1100, Alain O'Dea <alain.odea@REDACTED>  
>> wrote:
>>
>>> On 2011-01-02, at 22:36, "Edmond Begumisa"  
>>> <ebegumisa@REDACTED> wrote:
>>>
>>>> Slight correction...
>>>>
>>>> On Mon, 03 Jan 2011 12:38:38 +1100, Edmond Begumisa  
>>>> <ebegumisa@REDACTED> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I've been trying to wrap my Erlang's fault tolerant features  
>>>>> particularly in relation to processes.
>>>>>
>>>>
>>>> Should be: I've been trying to wrap my head around Erlang's fault  
>>>> tolerant features particularly in relation to processes.
>>>>
>>>> Sorry.
>>>>
>>>>> I've heard/read repeatedly that the primary reason why Erlang's  
>>>>> designers opted for a share-nothing policy is not rooted in  
>>>>> concurrency but rather in fault-tolerance. When nothing is shared,  
>>>>> everything is copied. When everything is copied processes can take  
>>>>> over from one another when things fail. I follow this reasoning but  
>>>>> I don't follow how to apply it.
>>>>>
>>>>> I fully understand and appreciate how supervision trees are used to  
>>>>> restart processes if they fail. What I don't get is what to do when  
>>>>> you don't want to restart but want to take over, say on another  
>>>>> node. I know that at a higher-level, OTP has some  
>>>>> take-over/fail-over schematics (at the application level.) I'm  
>>>>> trying to understand things at the processes level - why Erlang is  
>>>>> the way it is so I can better use it to make my currently  
>>>>> fault-intolerant program fault tolerant.
>>>>>
>>>>> Specifically, how can one process take over from another if it  
>>>>> fails? It appears to may that the only way to do this would be to  
>>>>> somehow retrieve not only the state of the process (say,  
>>>>> gen_server's state) but also the messages in its mailbox. Where does  
>>>>> the design decision to share-nothing for the sake of fault-tolerance  
>>>>> come into play for processes? Please help me "get" this!
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> - Edmond -
>>>>>
>>>>>
>>>
>>> Hi Edmond:
>>>
>>> Share-nothing helps with concurrent fault-tolerance by preventing one  
>>> process from corrupting the state of another. Receive is a process'  
>>> choice and it corrupts its own state if it receives bad data and lets  
>>> it in.
>>>
>>> AFAIK OTP fault-tolerance doesn't mean no requests will fail, it means  
>>> the system/sub-system will recover if a single request causes a  
>>> process to crash.  It's kind of like proper try/catch recovery applied  
>>> to concurrent code.  How you recover from the crash depends on the  
>>> supervision strategy chosen.  In some cases the supervisor can pass  
>>> the state to the replacement process. In others this isn't necessary  
>>> or even desirable since the state itself may involve resources lost in  
>>> the crash or corrupted state that led to the crash.
>>>
>>> I am straying outside my knowledge here so this paragraph is  
>>> guesswork.  The message queue for a gen_server need not necessarily be  
>>> lost when the callback module crashes.  In theory OTP could (and might  
>>> already) simply delegate the messages to the replacement process  
>>> following a crash.  Someone who knows OTP better than me would need to  
>>> weigh in here though.
>>>
>>> I found http://manning.com/logan very informative in understanding OTP  
>>> and its supervisor hierarchies.
>>>
>>> Cheers,
>>> Alain
>>> ________________________________________________________________
>>> erlang-questions (at) erlang.org mailing list.
>>> See http://www.erlang.org/faq.html
>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>>
>>
>>
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


More information about the erlang-questions mailing list