Processes & Fault Tolerance

Mon Jan 3 02:38:38 CET 2011

Hello all,

I've been trying to wrap my Erlang's fault tolerant features particularly  
in relation to processes.

I've heard/read repeatedly that the primary reason why Erlang's designers  
opted for a share-nothing policy is not rooted in concurrency but rather  
in fault-tolerance. When nothing is shared, everything is copied. When  
everything is copied processes can take over from one another when things  
fail. I follow this reasoning but I don't follow how to apply it.

I fully understand and appreciate how supervision trees are used to  
restart processes if they fail. What I don't get is what to do when you  
don't want to restart but want to take over, say on another node. I know  
that at a higher-level, OTP has some take-over/fail-over schematics (at  
the application level.) I'm trying to understand things at the processes  
level - why Erlang is the way it is so I can better use it to make my  
currently fault-intolerant program fault tolerant.

Specifically, how can one process take over from another if it fails? It  
appears to may that the only way to do this would be to somehow retrieve  
not only the state of the process (say, gen_server's state) but also the  
messages in its mailbox. Where does the design decision to share-nothing  
for the sake of fault-tolerance come into play for processes? Please help  
me "get" this!

Thanks in advance.

- Edmond -

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/