[erlang-questions] Processes & Fault Tolerance
Edmond Begumisa
ebegumisa@REDACTED
Mon Jan 3 17:21:24 CET 2011
Thanks. I see what you're saying... so share-nothing is more about
fault-isolation than fault-tolerance. Isolate the fault to affect as few
processes as possible rather than deal with the fault then continue.
So there's no natural I can arrange a group of active redundant processes
on different nodes sharing replicated state, ready to take over if one
fails. Like what distributed fault-tolerant databases do with disks, but
instead of nodes with replicated disks, I have nodes with replicated
processes.
- Edmond -
On Mon, 03 Jan 2011 21:26:23 +1100, Mazen Harake
<mazen.harake@REDACTED> wrote:
> Don't think of it as "taking over" but more about "recovering from" and
> "minimizing affect area".
>
> If process A dies you are not expected to "save" the state of A and
> _transfer_ it to B and thus continue as if B was A. This doesn't make
> any sense. The idea is rather that if A crashes then you spawn a new
> process B which a) has unaffected any other process (say C and D)
> because there is no data corruption and b) when it starts it might read
> in a persistent state (say from an ETS table) which it uses as a base to
> continue an operation. This doesn't mean that you transferred the state
> of A to B, you recovered from A by starting B. There is a difference
> between types of state as well, E.g. a gen_server state is not
> necessarily important to store in a resilient matter but perhaps the id
> for the active session is because it can be used to recover and
> repopulate the gen_server state.
>
> Some examples:
>
> * If you have an ongoing call with someone and there is a bug that
> disconnects your call then someone else who is having a call should not
> be affected. The fault tolerance part here is not that your call will
> continue, it is that you are able to pick up the dial again and call the
> user back I.e. the system is still alive even though it suffered a minor
> glitch in a particular process. (There are ways to keep the call alive
> but I'm not sure they do that).
>
> * If you have say an IM session running (ejabberd) then you might have a
> process per request/message/whatever. In this case perhaps your id,
> session id and some other key data would be stored and shared on several
> mnesia nodes but it isn't shared in the sense that it is used by many
> processes, it is just persistent. This means that if a process which is
> in the middle of something that has to do with that state crashes then
> another (newly spawned process) can read the various keys and ids and
> repopulate the state and continue.
>
> Share nothing in both cases mean that where ever the data is, it is only
> used by one process at a time. The first example might not have
> persistent data that it keeps but it doesn't affect any other part of
> the system if the call goes down. The second example has persistent data
> but it doesn't mean it is shared among processes it just means it
> handles node disruptions so that new ones can continue.
>
> Now if you move on to Node level then these two scenarios still apply
> but with small differences. The first example will cut off all calls
> routed through that node but as soon as some would try to call again it
> would simply go through another node. The second example would
> distribute the session state so that if a node goes down a new process
> on another node can handle the continuation of the session. In the later
> case it is important to realize that the shared data (the state between
> the nodes) is an obvious bottle neck but it is another type of shared
> data because it is, through abstraction, only manipulated by 1 process.
>
> Makes sense? %-)
>
> On 03/01/2011 06:21, Edmond Begumisa wrote:
>> Thanks for your response.
>>
>> Firstly, let me make my question a little clearer...
>>
>> To rephrase: For processes, "share nothing for the sake of concurrency"
>> - I get, both in concept and application. "Share nothing for the sake
>> of fault-tolerance" - I get in concept but not in application.
>>
>> Yet as I understand it, it is for the latter reason Erlang shares
>> nothing* and not the former. Interpretation: I must be completely
>> missing the point in regards to Erlang processes and sharing nothing.
>> This is what I want to understand in application. In addition to the
>> "side" effect of sane concurrency (which coming from a chaotic
>> multi-threading shared-memory world I fully appreciate and practically
>> make use of everyday), how can I also make use of the "real" reason
>> Erlang processes share nothing -- fault tolerance?
>> Practically/illustratively speaking?
>>
>> *ETS being the obvious exception.
>>
>> Secondly, here's a mantra from Joe Armstrong...
>>
>> @ minute 17:26
>> http://www.se-radio.net/2008/03/episode-89-joe-armstrong-on-erlang/
>>
>> "[message passing concurrency]... the original reasons have to do with
>> fault tolerance... you have to copy all the data you need from computer
>> 1 to computer 2... if computer 1 crashes you take over on computer 2...
>> you can't have dangling pointers... that's the reason for copying
>> everything... it's got nothing to do with concurrency, it's got a lot
>> to do with fault-tolerance... if they don't crash you could just have a
>> dangling pointer and copy less data but it won't work in the presence
>> of errors..."
>>
>> I interpret this to mean that share-nothing between processes is more
>> about replicating valid state than isolating corrupted state as you
>> described.
>>
>> Indeed, Joe created an example on his blog...
>>
>> http://armstrongonsoftware.blogspot.com/2007/07/scalable-fault-tolerant-upgradable.html
>> It's algorithm 3 there I'm struggling with. Particularly where he
>> says...
>>
>> "... In practise we would send an asynchronous stream of messages from
>> N to N+1 containing enough information to recover if things go wrong."
>>
>> Unfortunately, I couldn't find part II to that post (I don't think
>> there is one.) And I'm too green and inexperienced in the field of
>> fault-tolerant systems to figure it out on my own. I'm having trouble
>> visualising the practical here from the conceptual -- I need to be
>> shown how :(
>>
>> Also, I seem to be under the impression that the Erlang language has
>> some sort of schematics to do this built-in (i.e. deal with one process
>> taking over from another if the first fails) and this is the reason
>> processes share nothing. This seems to me to be something different
>> from supervision trees, which use exit-trapping to re-spawn if a
>> process fails with the active 'job' disappearing and any errors logged
>> (like restarting a daemon). My interpretation of the fault-tolerance
>> Erlang is supposed to enable (for those in the know) is seamless
>> take-over. The 'job' lives on but elsewhere.
>>
>> Using telecoms as an example: a phone call wouldn't be cut-off when a
>> fault occurs, another node would seamlessly take over. This is how I
>> interpreted Joe's post and other descriptions of Erlang's
>> fault-tolerant features and I understand the key is in the share
>> nothing policy for processes. I'm sure I've mis-understood something or
>> everything :)
>>
>> - Edmond -
>>
>> PS: I've read the Manning draft. Great book. I don't know if the answer
>> lies in OTP (I searched and didn't find it). I suspect it's lower --
>> probalby how you organise your processes. Some distributed-programming
>> black-magic only Erlanger's know about :)
>>
>>
>> On Mon, 03 Jan 2011 14:56:51 +1100, Alain O'Dea <alain.odea@REDACTED>
>> wrote:
>>
>>> On 2011-01-02, at 22:36, "Edmond Begumisa"
>>> <ebegumisa@REDACTED> wrote:
>>>
>>>> Slight correction...
>>>>
>>>> On Mon, 03 Jan 2011 12:38:38 +1100, Edmond Begumisa
>>>> <ebegumisa@REDACTED> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I've been trying to wrap my Erlang's fault tolerant features
>>>>> particularly in relation to processes.
>>>>>
>>>>
>>>> Should be: I've been trying to wrap my head around Erlang's fault
>>>> tolerant features particularly in relation to processes.
>>>>
>>>> Sorry.
>>>>
>>>>> I've heard/read repeatedly that the primary reason why Erlang's
>>>>> designers opted for a share-nothing policy is not rooted in
>>>>> concurrency but rather in fault-tolerance. When nothing is shared,
>>>>> everything is copied. When everything is copied processes can take
>>>>> over from one another when things fail. I follow this reasoning but
>>>>> I don't follow how to apply it.
>>>>>
>>>>> I fully understand and appreciate how supervision trees are used to
>>>>> restart processes if they fail. What I don't get is what to do when
>>>>> you don't want to restart but want to take over, say on another
>>>>> node. I know that at a higher-level, OTP has some
>>>>> take-over/fail-over schematics (at the application level.) I'm
>>>>> trying to understand things at the processes level - why Erlang is
>>>>> the way it is so I can better use it to make my currently
>>>>> fault-intolerant program fault tolerant.
>>>>>
>>>>> Specifically, how can one process take over from another if it
>>>>> fails? It appears to may that the only way to do this would be to
>>>>> somehow retrieve not only the state of the process (say,
>>>>> gen_server's state) but also the messages in its mailbox. Where does
>>>>> the design decision to share-nothing for the sake of fault-tolerance
>>>>> come into play for processes? Please help me "get" this!
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> - Edmond -
>>>>>
>>>>>
>>>
>>> Hi Edmond:
>>>
>>> Share-nothing helps with concurrent fault-tolerance by preventing one
>>> process from corrupting the state of another. Receive is a process'
>>> choice and it corrupts its own state if it receives bad data and lets
>>> it in.
>>>
>>> AFAIK OTP fault-tolerance doesn't mean no requests will fail, it means
>>> the system/sub-system will recover if a single request causes a
>>> process to crash. It's kind of like proper try/catch recovery applied
>>> to concurrent code. How you recover from the crash depends on the
>>> supervision strategy chosen. In some cases the supervisor can pass
>>> the state to the replacement process. In others this isn't necessary
>>> or even desirable since the state itself may involve resources lost in
>>> the crash or corrupted state that led to the crash.
>>>
>>> I am straying outside my knowledge here so this paragraph is
>>> guesswork. The message queue for a gen_server need not necessarily be
>>> lost when the callback module crashes. In theory OTP could (and might
>>> already) simply delegate the messages to the replacement process
>>> following a crash. Someone who knows OTP better than me would need to
>>> weigh in here though.
>>>
>>> I found http://manning.com/logan very informative in understanding OTP
>>> and its supervisor hierarchies.
>>>
>>> Cheers,
>>> Alain
>>> ________________________________________________________________
>>> erlang-questions (at) erlang.org mailing list.
>>> See http://www.erlang.org/faq.html
>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>>
>>
>>
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
More information about the erlang-questions
mailing list