[erlang-questions] Processes & Fault Tolerance

Mon Jan 3 04:56:51 CET 2011

On 2011-01-02, at 22:36, "Edmond Begumisa" <ebegumisa@REDACTED> wrote:

> Slight correction...
> 
> On Mon, 03 Jan 2011 12:38:38 +1100, Edmond Begumisa <ebegumisa@REDACTED> wrote:
> 
>> Hello all,
>> 
>> I've been trying to wrap my Erlang's fault tolerant features particularly in relation to processes.
>> 
> 
> Should be: I've been trying to wrap my head around Erlang's fault tolerant features particularly in relation to processes.
> 
> Sorry.
> 
>> I've heard/read repeatedly that the primary reason why Erlang's designers opted for a share-nothing policy is not rooted in concurrency but rather in fault-tolerance. When nothing is shared, everything is copied. When everything is copied processes can take over from one another when things fail. I follow this reasoning but I don't follow how to apply it.
>> 
>> I fully understand and appreciate how supervision trees are used to restart processes if they fail. What I don't get is what to do when you don't want to restart but want to take over, say on another node. I know that at a higher-level, OTP has some take-over/fail-over schematics (at the application level.) I'm trying to understand things at the processes level - why Erlang is the way it is so I can better use it to make my currently fault-intolerant program fault tolerant.
>> 
>> Specifically, how can one process take over from another if it fails? It appears to may that the only way to do this would be to somehow retrieve not only the state of the process (say, gen_server's state) but also the messages in its mailbox. Where does the design decision to share-nothing for the sake of fault-tolerance come into play for processes? Please help me "get" this!
>> 
>> Thanks in advance.
>> 
>> - Edmond -
>> 
>> 

Hi Edmond:

Share-nothing helps with concurrent fault-tolerance by preventing one process from corrupting the state of another. Receive is a process' choice and it corrupts its own state if it receives bad data and lets it in.

AFAIK OTP fault-tolerance doesn't mean no requests will fail, it means the system/sub-system will recover if a single request causes a process to crash.  It's kind of like proper try/catch recovery applied to concurrent code.  How you recover from the crash depends on the supervision strategy chosen.  In some cases the supervisor can pass the state to the replacement process. In others this isn't necessary or even desirable since the state itself may involve resources lost in the crash or corrupted state that led to the crash.

I am straying outside my knowledge here so this paragraph is guesswork.  The message queue for a gen_server need not necessarily be lost when the callback module crashes.  In theory OTP could (and might already) simply delegate the messages to the replacement process following a crash.  Someone who knows OTP better than me would need to weigh in here though.

I found http://manning.com/logan very informative in understanding OTP and its supervisor hierarchies.

Cheers,
Alain