[erlang-questions] Processes & Fault Tolerance

Edmond Begumisa ebegumisa@REDACTED
Mon Jan 3 06:21:07 CET 2011


Thanks for your response.

Firstly, let me make my question a little clearer...

To rephrase: For processes, "share nothing for the sake of concurrency" -  
I get, both in concept and application. "Share nothing for the sake of  
fault-tolerance" - I get in concept but not in application.

Yet as I understand it, it is for the latter reason Erlang shares nothing*  
and not the former. Interpretation: I must be completely missing the point  
in regards to Erlang processes and sharing nothing. This is what I want to  
understand in application. In addition to the "side" effect of sane  
concurrency (which coming from a chaotic multi-threading shared-memory  
world I fully appreciate and practically make use of everyday), how can I  
also make use of the "real" reason Erlang processes share nothing -- fault  
tolerance? Practically/illustratively speaking?

*ETS being the obvious exception.

Secondly, here's a mantra from Joe Armstrong...

@ minute 17:26  
http://www.se-radio.net/2008/03/episode-89-joe-armstrong-on-erlang/

"[message passing concurrency]... the original reasons have to do with  
fault tolerance... you have to copy all the data you need from computer 1  
to computer 2... if computer 1 crashes you take over on computer 2... you  
can't have dangling pointers... that's the reason for copying  
everything... it's got nothing to do with concurrency, it's got a lot to  
do with fault-tolerance... if they don't crash you could just have a  
dangling pointer and copy less data but it won't work in the presence of  
errors..."

I interpret this to mean that share-nothing between processes is more  
about replicating valid state than isolating corrupted state as you  
described.

Indeed, Joe created an example on his blog...

http://armstrongonsoftware.blogspot.com/2007/07/scalable-fault-tolerant-upgradable.html

It's algorithm 3 there I'm struggling with. Particularly where he says...

"... In practise we would send an asynchronous stream of messages from N  
to N+1 containing enough information to recover if things go wrong."

Unfortunately, I couldn't find part II to that post (I don't think there  
is one.) And I'm too green and inexperienced in the field of  
fault-tolerant systems to figure it out on my own. I'm having trouble  
visualising the practical here from the conceptual -- I need to be shown  
how :(

Also, I seem to be under the impression that the Erlang language has some  
sort of schematics to do this built-in (i.e. deal with one process taking  
over from another if the first fails) and this is the reason processes  
share nothing. This seems to me to be something different from supervision  
trees, which use exit-trapping to re-spawn if a process fails with the  
active 'job' disappearing and any errors logged (like restarting a  
daemon). My interpretation of the fault-tolerance Erlang is supposed to  
enable (for those in the know) is seamless take-over. The 'job' lives on  
but elsewhere.

Using telecoms as an example: a phone call wouldn't be cut-off when a  
fault occurs, another node would seamlessly take over. This is how I  
interpreted Joe's post and other descriptions of Erlang's fault-tolerant  
features and I understand the key is in the share nothing policy for  
processes. I'm sure I've mis-understood something or everything :)

- Edmond -

PS: I've read the Manning draft. Great book. I don't know if the answer  
lies in OTP (I searched and didn't find it). I suspect it's lower --  
probalby how you organise your processes. Some distributed-programming  
black-magic only Erlanger's know about :)


On Mon, 03 Jan 2011 14:56:51 +1100, Alain O'Dea <alain.odea@REDACTED>  
wrote:

> On 2011-01-02, at 22:36, "Edmond Begumisa" <ebegumisa@REDACTED>  
> wrote:
>
>> Slight correction...
>>
>> On Mon, 03 Jan 2011 12:38:38 +1100, Edmond Begumisa  
>> <ebegumisa@REDACTED> wrote:
>>
>>> Hello all,
>>>
>>> I've been trying to wrap my Erlang's fault tolerant features  
>>> particularly in relation to processes.
>>>
>>
>> Should be: I've been trying to wrap my head around Erlang's fault  
>> tolerant features particularly in relation to processes.
>>
>> Sorry.
>>
>>> I've heard/read repeatedly that the primary reason why Erlang's  
>>> designers opted for a share-nothing policy is not rooted in  
>>> concurrency but rather in fault-tolerance. When nothing is shared,  
>>> everything is copied. When everything is copied processes can take  
>>> over from one another when things fail. I follow this reasoning but I  
>>> don't follow how to apply it.
>>>
>>> I fully understand and appreciate how supervision trees are used to  
>>> restart processes if they fail. What I don't get is what to do when  
>>> you don't want to restart but want to take over, say on another node.  
>>> I know that at a higher-level, OTP has some take-over/fail-over  
>>> schematics (at the application level.) I'm trying to understand things  
>>> at the processes level - why Erlang is the way it is so I can better  
>>> use it to make my currently fault-intolerant program fault tolerant.
>>>
>>> Specifically, how can one process take over from another if it fails?  
>>> It appears to may that the only way to do this would be to somehow  
>>> retrieve not only the state of the process (say, gen_server's state)  
>>> but also the messages in its mailbox. Where does the design decision  
>>> to share-nothing for the sake of fault-tolerance come into play for  
>>> processes? Please help me "get" this!
>>>
>>> Thanks in advance.
>>>
>>> - Edmond -
>>>
>>>
>
> Hi Edmond:
>
> Share-nothing helps with concurrent fault-tolerance by preventing one  
> process from corrupting the state of another. Receive is a process'  
> choice and it corrupts its own state if it receives bad data and lets it  
> in.
>
> AFAIK OTP fault-tolerance doesn't mean no requests will fail, it means  
> the system/sub-system will recover if a single request causes a process  
> to crash.  It's kind of like proper try/catch recovery applied to  
> concurrent code.  How you recover from the crash depends on the  
> supervision strategy chosen.  In some cases the supervisor can pass the  
> state to the replacement process. In others this isn't necessary or even  
> desirable since the state itself may involve resources lost in the crash  
> or corrupted state that led to the crash.
>
> I am straying outside my knowledge here so this paragraph is guesswork.   
> The message queue for a gen_server need not necessarily be lost when the  
> callback module crashes.  In theory OTP could (and might already) simply  
> delegate the messages to the replacement process following a crash.   
> Someone who knows OTP better than me would need to weigh in here though.
>
> I found http://manning.com/logan very informative in understanding OTP  
> and its supervisor hierarchies.
>
> Cheers,
> Alain
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


More information about the erlang-questions mailing list