[erlang-questions] Processes & Fault Tolerance

Mazen Harake <>
Mon Jan 3 11:26:23 CET 2011

Don't think of it as "taking over" but more about "recovering from" and 
"minimizing affect area".

If process A dies you are not expected to "save" the state of A and 
_transfer_ it to B and thus continue as if B was A. This doesn't make 
any sense. The idea is rather that if A crashes then you spawn a new 
process B which a) has unaffected any other process (say C and D) 
because there is no data corruption and b) when it starts it might read 
in a persistent state (say from an ETS table) which it uses as a base to 
continue an operation. This doesn't mean that you transferred the state 
of A to B, you recovered from A by starting B. There is a difference 
between types of state as well, E.g. a gen_server state is not 
necessarily important to store in a resilient matter but perhaps the id 
for the active session is because it can be used to recover and 
repopulate the gen_server state.

Some examples:

* If you have an ongoing call with someone and there is a bug that 
disconnects your call then someone else who is having a call should not 
be affected. The fault tolerance part here is not that your call will 
continue, it is that you are able to pick up the dial again and call the 
user back I.e. the system is still alive even though it suffered a minor 
glitch in a particular process. (There are ways to keep the call alive 
but I'm not sure they do that).

* If you have say an IM session running (ejabberd) then you might have a 
process per request/message/whatever. In this case perhaps your id, 
session id and some other key data would be stored and shared on several 
mnesia nodes but it isn't shared in the sense that it is used by many 
processes, it is just persistent. This means that if a process which is 
in the middle of something that has to do with that state crashes then 
another (newly spawned process) can read the various keys and ids and 
repopulate the state and continue.

Share nothing in both cases mean that where ever the data is, it is only 
used by one process at a time. The first example might not have 
persistent data that it keeps but it doesn't affect any other part of 
the system if the call goes down. The second example has persistent data 
but it doesn't mean it is shared among processes it just means it 
handles node disruptions so that new ones can continue.

Now if you move on to Node level then these two scenarios still apply 
but with small differences. The first example will cut off all calls 
routed through that node but as soon as some would try to call again it 
would simply go through another node. The second example would 
distribute the session state so that if a node goes down a new process 
on another node can handle the continuation of the session. In the later 
case it is important to realize that the shared data (the state between 
the nodes) is an obvious bottle neck but it is another type of shared 
data because it is, through abstraction, only manipulated by 1 process.

Makes sense? %-)

On 03/01/2011 06:21, Edmond Begumisa wrote:
> Thanks for your response.
> Firstly, let me make my question a little clearer...
> To rephrase: For processes, "share nothing for the sake of 
> concurrency" - I get, both in concept and application. "Share nothing 
> for the sake of fault-tolerance" - I get in concept but not in 
> application.
> Yet as I understand it, it is for the latter reason Erlang shares 
> nothing* and not the former. Interpretation: I must be completely 
> missing the point in regards to Erlang processes and sharing nothing. 
> This is what I want to understand in application. In addition to the 
> "side" effect of sane concurrency (which coming from a chaotic 
> multi-threading shared-memory world I fully appreciate and practically 
> make use of everyday), how can I also make use of the "real" reason 
> Erlang processes share nothing -- fault tolerance? 
> Practically/illustratively speaking?
> *ETS being the obvious exception.
> Secondly, here's a mantra from Joe Armstrong...
> @ minute 17:26 
> http://www.se-radio.net/2008/03/episode-89-joe-armstrong-on-erlang/
> "[message passing concurrency]... the original reasons have to do with 
> fault tolerance... you have to copy all the data you need from 
> computer 1 to computer 2... if computer 1 crashes you take over on 
> computer 2... you can't have dangling pointers... that's the reason 
> for copying everything... it's got nothing to do with concurrency, 
> it's got a lot to do with fault-tolerance... if they don't crash you 
> could just have a dangling pointer and copy less data but it won't 
> work in the presence of errors..."
> I interpret this to mean that share-nothing between processes is more 
> about replicating valid state than isolating corrupted state as you 
> described.
> Indeed, Joe created an example on his blog...
> http://armstrongonsoftware.blogspot.com/2007/07/scalable-fault-tolerant-upgradable.html 
> It's algorithm 3 there I'm struggling with. Particularly where he says...
> "... In practise we would send an asynchronous stream of messages from 
> N to N+1 containing enough information to recover if things go wrong."
> Unfortunately, I couldn't find part II to that post (I don't think 
> there is one.) And I'm too green and inexperienced in the field of 
> fault-tolerant systems to figure it out on my own. I'm having trouble 
> visualising the practical here from the conceptual -- I need to be 
> shown how :(
> Also, I seem to be under the impression that the Erlang language has 
> some sort of schematics to do this built-in (i.e. deal with one 
> process taking over from another if the first fails) and this is the 
> reason processes share nothing. This seems to me to be something 
> different from supervision trees, which use exit-trapping to re-spawn 
> if a process fails with the active 'job' disappearing and any errors 
> logged (like restarting a daemon). My interpretation of the 
> fault-tolerance Erlang is supposed to enable (for those in the know) 
> is seamless take-over. The 'job' lives on but elsewhere.
> Using telecoms as an example: a phone call wouldn't be cut-off when a 
> fault occurs, another node would seamlessly take over. This is how I 
> interpreted Joe's post and other descriptions of Erlang's 
> fault-tolerant features and I understand the key is in the share 
> nothing policy for processes. I'm sure I've mis-understood something 
> or everything :)
> - Edmond -
> PS: I've read the Manning draft. Great book. I don't know if the 
> answer lies in OTP (I searched and didn't find it). I suspect it's 
> lower -- probalby how you organise your processes. Some 
> distributed-programming black-magic only Erlanger's know about :)
> On Mon, 03 Jan 2011 14:56:51 +1100, Alain O'Dea <> 
> wrote:
>> On 2011-01-02, at 22:36, "Edmond Begumisa" 
>> <> wrote:
>>> Slight correction...
>>> On Mon, 03 Jan 2011 12:38:38 +1100, Edmond Begumisa 
>>> <> wrote:
>>>> Hello all,
>>>> I've been trying to wrap my Erlang's fault tolerant features 
>>>> particularly in relation to processes.
>>> Should be: I've been trying to wrap my head around Erlang's fault 
>>> tolerant features particularly in relation to processes.
>>> Sorry.
>>>> I've heard/read repeatedly that the primary reason why Erlang's 
>>>> designers opted for a share-nothing policy is not rooted in 
>>>> concurrency but rather in fault-tolerance. When nothing is shared, 
>>>> everything is copied. When everything is copied processes can take 
>>>> over from one another when things fail. I follow this reasoning but 
>>>> I don't follow how to apply it.
>>>> I fully understand and appreciate how supervision trees are used to 
>>>> restart processes if they fail. What I don't get is what to do when 
>>>> you don't want to restart but want to take over, say on another 
>>>> node. I know that at a higher-level, OTP has some 
>>>> take-over/fail-over schematics (at the application level.) I'm 
>>>> trying to understand things at the processes level - why Erlang is 
>>>> the way it is so I can better use it to make my currently 
>>>> fault-intolerant program fault tolerant.
>>>> Specifically, how can one process take over from another if it 
>>>> fails? It appears to may that the only way to do this would be to 
>>>> somehow retrieve not only the state of the process (say, 
>>>> gen_server's state) but also the messages in its mailbox. Where 
>>>> does the design decision to share-nothing for the sake of 
>>>> fault-tolerance come into play for processes? Please help me "get" 
>>>> this!
>>>> Thanks in advance.
>>>> - Edmond -
>> Hi Edmond:
>> Share-nothing helps with concurrent fault-tolerance by preventing one 
>> process from corrupting the state of another. Receive is a process' 
>> choice and it corrupts its own state if it receives bad data and lets 
>> it in.
>> AFAIK OTP fault-tolerance doesn't mean no requests will fail, it 
>> means the system/sub-system will recover if a single request causes a 
>> process to crash.  It's kind of like proper try/catch recovery 
>> applied to concurrent code.  How you recover from the crash depends 
>> on the supervision strategy chosen.  In some cases the supervisor can 
>> pass the state to the replacement process. In others this isn't 
>> necessary or even desirable since the state itself may involve 
>> resources lost in the crash or corrupted state that led to the crash.
>> I am straying outside my knowledge here so this paragraph is 
>> guesswork.  The message queue for a gen_server need not necessarily 
>> be lost when the callback module crashes.  In theory OTP could (and 
>> might already) simply delegate the messages to the replacement 
>> process following a crash.  Someone who knows OTP better than me 
>> would need to weigh in here though.
>> I found http://manning.com/logan very informative in understanding 
>> OTP and its supervisor hierarchies.
>> Cheers,
>> Alain
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:

More information about the erlang-questions mailing list