[erlang-questions] Processes & Fault Tolerance
Ryan Zezeski
rzezeski@REDACTED
Wed Jan 12 05:03:12 CET 2011
This thread is pure GOLD. I know that all this stuff is covered
in various places but seeing it compressed in this thread was really
profound...for me anyways. I'm to keep this in my pocket for the
presentation I plan to give later in the year.
-Ryan
On Thu, Jan 6, 2011 at 1:32 AM, Edmond Begumisa <ebegumisa@REDACTED
> wrote:
> UPDATE:
>
> I'm excited! I've just added hot take-over to my server program (for
> specific jobs). I've watched one node take-over the handling of a request
> from another as it's network connection died. The client didn't notice. Very
> eerie to witness, and quite impressive how easy it was to add that
> functionality if like me, you've come from other languages and imagined the
> pain involved in doing this.
>
> Using Joe's illustration, I was able to do this without the functionality
> of OTP supervisors, application take-over/fail-over, etc (even though my
> application is an OTP application.) Just using redundant gen_server worker
> processes watching each-other and updating each-other on state changes + new
> requests.
>
> To summerise the important lessons I've learned from all who helped: In
> regards to processes, fault-tolerance and the share-nothing principle...
>
> * We can say: Share nothing because we want to replicate valid state which
> leads to one form of fault tolerance (a failed job can be done elsewhere).
> * Or we can say: Share nothing because we want to isolate invalid state
> which leads to another form fault tolerance (a failed job won't affect the
> rest of the system).
> * For job hot take-over + redundancy, one should think in terms of the
> former (you need two or more nodes for this).
> * For worker supervision + restarts, one should think in terms of the
> latter (you can use one or more nodes for this).
>
> Supervisors monitor workers. Workers do jobs. In-case of faults: workers
> can be restarted but also **jobs can be taken over by one redundant worker
> from another**. This is powerful, Erlang is pretty cool :)
>
> I'm gonna go marvel as I plug and unplug my network cables now :P
>
> Thanks again to all.
>
> - Edmond -
>
>
>
> On Thu, 06 Jan 2011 02:17:39 +1100, Edmond Begumisa <
> ebegumisa@REDACTED> wrote:
>
> On Tue, 04 Jan 2011 20:52:50 +1100, Joe Armstrong <erlang@REDACTED>
>> wrote:
>>
>> On Mon, Jan 3, 2011 at 5:30 PM, Edmond Begumisa
>>> <ebegumisa@REDACTED> wrote:
>>>
>>>> Let's start with how we do error recovery.
>>>>>
>>>>> Imagine two linked processes A and B on separate machines. A is the
>>>>> master process.
>>>>> B is a passive processes that will take over if A fails.
>>>>>
>>>>> A sends a stream of state update messages S1, S2, S3,... to B. The
>>>>> state messages contain enough information
>>>>> for B to do whatever A was doing should A fail. If A fails B will
>>>>> receive an EXIT signal.
>>>>>
>>>>> If B does receive an exit signal it knows A has failed, so it carries
>>>>> on doing whatever A was doing
>>>>> using the information it last received in a state update message.
>>>>>
>>>>> That's it and all about it - nothing to do with supervisors etc,
>>>>>
>>>>
>>>> Ooooohh! THANK YOU, THANK YOU, THANK YOU.
>>>>
>>>
>>> Thanks.
>>>
>>> Another problem in understand is that we assume the meaning of words.
>>>
>>>
>> Yes. There was some confusion about whether we were all talking about
>> replicating valid state or isolating invalid state. Which I guess
>> "fault-tolerance" could be interpreted to mean either. But I'm no expert on
>> the subject so I shouldn't comment on terminology -- I just knew the feature
>> I wanted and was pretty sure it could be done at the process level, I just
>> had no-idea how! From now on, I'll stick to the term "redundancy" or Ulf's
>> "hot take-over" so it's clearer what I mean.
>>
>> - Edmond -
>>
>>
>> Once upon a time I was talking to a guy and used the word "fault
>>> tolerance"
>>> we talked for hours and thought we understood what this word meant. Then
>>> he said something that implied he was building a fault-tolerant system
>>> on one computer.
>>> I said it was impossible. He said, "you catch and handle all the
>>> exceptions .."
>>>
>>> I said, "The entire computer might crash"
>>>
>>> He said, "Oh"
>>>
>>> There was a long silence.
>>>
>>> I told him about Erlang ...
>>>
>>>
>>> /Joe
>>>
>>>
>>>> *Now* I get it. This is precisely what I was trying to understand. The
>>>> missing link was sending redundant messages. It all makes sense now --
>>>> it's
>>>> so simple. All I have to do to get fault tolerance at the process level
>>>> is
>>>> have a group of N redundant processes waiting for exit-signals and
>>>> forward
>>>> state changes to N-1 of them. I could even send to N/2 and have some
>>>> decent
>>>> tolerance at the expense of consistency of the nodes.
>>>>
>>>> To preemptively answer the question I sent Ulf and Mazen... for that
>>>> distributed database, Process B would be set up to also receive read
>>>> requests forwarded from Process A but not respond to them unless it gets
>>>> an
>>>> exit signal it can understand like disk_fail_error. Any changes in state
>>>> would also be forwarded to existing redundant processes.
>>>>
>>>> Thanks everyone for your patience in helping me understand
>>>> fault-tolerance.
>>>> I've already started thinking about some cool things I can do with this
>>>> in
>>>> my program.
>>>>
>>>> - Edmond -
>>>>
>>>>
>>>>
>>>> On Tue, 04 Jan 2011 00:36:30 +1100, Joe Armstrong <erlang@REDACTED>
>>>> wrote:
>>>>
>>>> I think you are getting confused between OTP supervisors etc. and the
>>>>> general notion of "take-over"
>>>>>
>>>>> Let's start with how we do error recovery.
>>>>>
>>>>> Imagine two linked processes A and B on separate machines. A is the
>>>>> master process.
>>>>> B is a passive processes that will take over if A fails.
>>>>>
>>>>> A sends a stream of state update messages S1, S2, S3,... to B. The
>>>>> state messages contain enough information
>>>>> for B to do whatever A was doing should A fail. If A fails B will
>>>>> receive an EXIT signal.
>>>>>
>>>>> If B does receive an exit signal it knows A has failed, so it carries
>>>>> on doing whatever A was doing
>>>>> using the information it last received in a state update message.
>>>>>
>>>>> That's it and all about it - nothing to do with supervisors etc,
>>>>> <aside> This is how we do things in a distributed system, obviously
>>>>> real fault tolerance needs more than one machine to cover the case
>>>>> where an entire machine fails. This is also why we copy everything, if
>>>>> an entire machine has failed and we have a pointer to vital data on
>>>>> that machine then we are in big trouble.
>>>>>
>>>>> If it works like this in the distributed case then we reasoned it
>>>>> should work exactly the same way when both processes are on the same
>>>>> machine. Why is this? - because it would be very difficult to program
>>>>> if the primitives and mechanisms involved were completely different
>>>>> when the remote process lay on the same machine as the process it was
>>>>> talking to or on a different machine.
>>>>>
>>>>> It might be that we actually handle these two cases differently in the
>>>>> VM, but as far as the programmer is concerned the system should behave
>>>>> the same way, and the only observable change should
>>>>> concerned with latencies.
>>>>> </aside>
>>>>>
>>>>> To make the above possible we need a couple of primitives in Erlang,
>>>>> namely process_flag trap_exits, and spawn_link. That's all you need.
>>>>>
>>>>> The basic system properties you need to make a fault-tolerant systems
>>>>> are the abilities to detect remote errors and send and receive
>>>>> messages.
>>>>>
>>>>> What the OTP supervisors etc. do are build layers of abstraction on
>>>>> top of spawn-link and trap_exits
>>>>> in order to simplify programming common use-cases.
>>>>>
>>>>> Fault detection and recovery is not arbitrary or magic in any sense,
>>>>> It has to be part of the system architecture like everything else. One
>>>>> common fault in making systems is to specify and design for the normal
>>>>> behavior
>>>>> but not specify and design how fault-detection and correction work.
>>>>> This is probably due to a legacy of
>>>>> programming in languages that have very limited ability to detect
>>>>> errors and recover from them.
>>>>>
>>>>> A C programmer with a single thread of control will take extreme
>>>>> measures to avoid crashing their program
>>>>> they have to, there is only one thread of control. Given multiple
>>>>> threads of control and the ability to remotely
>>>>> detect errors your entire way of thinking changes and the emphasis
>>>>> changes from thinking "how to I avoid a crash"
>>>>> to thinking "given that a crash has occurred and been detected, how to
>>>>> I fix things up and carry on"
>>>>>
>>>>> The Erlang philosophy is that "things will crash", so we have to
>>>>> detect the crashes, fix the bit that broke and
>>>>> carry on - we do not make excessive efforts to avoid crashes, but we
>>>>> do try very hard to repair things after they have gone wrong.
>>>>>
>>>>> Fixing things after they have broken is often easier than preventing
>>>>> them from breaking. Preventing them
>>>>> from breaking is not theoretically possible - we can't even prove
>>>>> simple things, like that a program will terminate
>>>>> (the famous halting problem) - but we can easily detect failures and
>>>>> put the system to rights by applying
>>>>> certain invariants.
>>>>>
>>>>> OTP supervisors are just trees of processes with certain restart rules
>>>>> that experience has shown us
>>>>> to be useful.
>>>>>
>>>>> The main value of supervisors etc. (and of all the OTP behaviors) is
>>>>> in building large systems with many
>>>>> programmers involved. Rather than inventing their own patterns, the
>>>>> programmers reuse a common set of
>>>>> patterns, which makes the projects easier to manage.
>>>>>
>>>>> /Joe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 3, 2011 at 6:21 AM, Edmond Begumisa
>>>>> <ebegumisa@REDACTED> wrote:
>>>>>
>>>>>>
>>>>>> Thanks for your response.
>>>>>>
>>>>>> Firstly, let me make my question a little clearer...
>>>>>>
>>>>>> To rephrase: For processes, "share nothing for the sake of
>>>>>> concurrency" -
>>>>>> I
>>>>>> get, both in concept and application. "Share nothing for the sake of
>>>>>> fault-tolerance" - I get in concept but not in application.
>>>>>>
>>>>>> Yet as I understand it, it is for the latter reason Erlang shares
>>>>>> nothing*
>>>>>> and not the former. Interpretation: I must be completely missing the
>>>>>> point
>>>>>> in regards to Erlang processes and sharing nothing. This is what I
>>>>>> want
>>>>>> to
>>>>>> understand in application. In addition to the "side" effect of sane
>>>>>> concurrency (which coming from a chaotic multi-threading shared-memory
>>>>>> world
>>>>>> I fully appreciate and practically make use of everyday), how can I
>>>>>> also
>>>>>> make use of the "real" reason Erlang processes share nothing -- fault
>>>>>> tolerance? Practically/illustratively speaking?
>>>>>>
>>>>>> *ETS being the obvious exception.
>>>>>>
>>>>>> Secondly, here's a mantra from Joe Armstrong...
>>>>>>
>>>>>> @ minute 17:26
>>>>>> http://www.se-radio.net/2008/03/episode-89-joe-armstrong-on-erlang/
>>>>>>
>>>>>> "[message passing concurrency]... the original reasons have to do with
>>>>>> fault
>>>>>> tolerance... you have to copy all the data you need from computer 1 to
>>>>>> computer 2... if computer 1 crashes you take over on computer 2... you
>>>>>> can't
>>>>>> have dangling pointers... that's the reason for copying everything...
>>>>>> it's
>>>>>> got nothing to do with concurrency, it's got a lot to do with
>>>>>> fault-tolerance... if they don't crash you could just have a dangling
>>>>>> pointer and copy less data but it won't work in the presence of
>>>>>> errors..."
>>>>>>
>>>>>> I interpret this to mean that share-nothing between processes is more
>>>>>> about
>>>>>> replicating valid state than isolating corrupted state as you
>>>>>> described.
>>>>>>
>>>>>> Indeed, Joe created an example on his blog...
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://armstrongonsoftware.blogspot.com/2007/07/scalable-fault-tolerant-upgradable.html
>>>>>>
>>>>>> It's algorithm 3 there I'm struggling with. Particularly where he
>>>>>> says...
>>>>>>
>>>>>> "... In practise we would send an asynchronous stream of messages from
>>>>>> N
>>>>>> to
>>>>>> N+1 containing enough information to recover if things go wrong."
>>>>>>
>>>>>> Unfortunately, I couldn't find part II to that post (I don't think
>>>>>> there
>>>>>> is
>>>>>> one.) And I'm too green and inexperienced in the field of
>>>>>> fault-tolerant
>>>>>> systems to figure it out on my own. I'm having trouble visualising the
>>>>>> practical here from the conceptual -- I need to be shown how :(
>>>>>>
>>>>>> Also, I seem to be under the impression that the Erlang language has
>>>>>> some
>>>>>> sort of schematics to do this built-in (i.e. deal with one process
>>>>>> taking
>>>>>> over from another if the first fails) and this is the reason processes
>>>>>> share
>>>>>> nothing. This seems to me to be something different from supervision
>>>>>> trees,
>>>>>> which use exit-trapping to re-spawn if a process fails with the active
>>>>>> 'job'
>>>>>> disappearing and any errors logged (like restarting a daemon). My
>>>>>> interpretation of the fault-tolerance Erlang is supposed to enable
>>>>>> (for
>>>>>> those in the know) is seamless take-over. The 'job' lives on but
>>>>>> elsewhere.
>>>>>>
>>>>>> Using telecoms as an example: a phone call wouldn't be cut-off when a
>>>>>> fault
>>>>>> occurs, another node would seamlessly take over. This is how I
>>>>>> interpreted
>>>>>> Joe's post and other descriptions of Erlang's fault-tolerant features
>>>>>> and
>>>>>> I
>>>>>> understand the key is in the share nothing policy for processes. I'm
>>>>>> sure
>>>>>> I've mis-understood something or everything :)
>>>>>>
>>>>>> - Edmond -
>>>>>>
>>>>>> PS: I've read the Manning draft. Great book. I don't know if the
>>>>>> answer
>>>>>> lies
>>>>>> in OTP (I searched and didn't find it). I suspect it's lower --
>>>>>> probalby
>>>>>> how
>>>>>> you organise your processes. Some distributed-programming black-magic
>>>>>> only
>>>>>> Erlanger's know about :)
>>>>>>
>>>>>>
>>>>>> On Mon, 03 Jan 2011 14:56:51 +1100, Alain O'Dea <alain.odea@REDACTED
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> On 2011-01-02, at 22:36, "Edmond Begumisa" <
>>>>>>> ebegumisa@REDACTED>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Slight correction...
>>>>>>>>
>>>>>>>> On Mon, 03 Jan 2011 12:38:38 +1100, Edmond Begumisa
>>>>>>>> <ebegumisa@REDACTED> wrote:
>>>>>>>>
>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> I've been trying to wrap my Erlang's fault tolerant features
>>>>>>>>> particularly in relation to processes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Should be: I've been trying to wrap my head around Erlang's fault
>>>>>>>> tolerant features particularly in relation to processes.
>>>>>>>>
>>>>>>>> Sorry.
>>>>>>>>
>>>>>>>> I've heard/read repeatedly that the primary reason why Erlang's
>>>>>>>>> designers opted for a share-nothing policy is not rooted in
>>>>>>>>> concurrency but
>>>>>>>>> rather in fault-tolerance. When nothing is shared, everything is
>>>>>>>>> copied.
>>>>>>>>> When everything is copied processes can take over from one another
>>>>>>>>> when
>>>>>>>>> things fail. I follow this reasoning but I don't follow how to
>>>>>>>>> apply
>>>>>>>>> it.
>>>>>>>>>
>>>>>>>>> I fully understand and appreciate how supervision trees are used to
>>>>>>>>> restart processes if they fail. What I don't get is what to do when
>>>>>>>>> you
>>>>>>>>> don't want to restart but want to take over, say on another node. I
>>>>>>>>> know
>>>>>>>>> that at a higher-level, OTP has some take-over/fail-over schematics
>>>>>>>>> (at the
>>>>>>>>> application level.) I'm trying to understand things at the
>>>>>>>>> processes
>>>>>>>>> level -
>>>>>>>>> why Erlang is the way it is so I can better use it to make my
>>>>>>>>> currently
>>>>>>>>> fault-intolerant program fault tolerant.
>>>>>>>>>
>>>>>>>>> Specifically, how can one process take over from another if it
>>>>>>>>> fails?
>>>>>>>>> It
>>>>>>>>> appears to may that the only way to do this would be to somehow
>>>>>>>>> retrieve not
>>>>>>>>> only the state of the process (say, gen_server's state) but also
>>>>>>>>> the
>>>>>>>>> messages in its mailbox. Where does the design decision to
>>>>>>>>> share-nothing for
>>>>>>>>> the sake of fault-tolerance come into play for processes? Please
>>>>>>>>> help
>>>>>>>>> me
>>>>>>>>> "get" this!
>>>>>>>>>
>>>>>>>>> Thanks in advance.
>>>>>>>>>
>>>>>>>>> - Edmond -
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> Hi Edmond:
>>>>>>>
>>>>>>> Share-nothing helps with concurrent fault-tolerance by preventing one
>>>>>>> process from corrupting the state of another. Receive is a process'
>>>>>>> choice
>>>>>>> and it corrupts its own state if it receives bad data and lets it in.
>>>>>>>
>>>>>>> AFAIK OTP fault-tolerance doesn't mean no requests will fail, it
>>>>>>> means
>>>>>>> the
>>>>>>> system/sub-system will recover if a single request causes a process
>>>>>>> to
>>>>>>> crash. It's kind of like proper try/catch recovery applied to
>>>>>>> concurrent
>>>>>>> code. How you recover from the crash depends on the supervision
>>>>>>> strategy
>>>>>>> chosen. In some cases the supervisor can pass the state to the
>>>>>>> replacement
>>>>>>> process. In others this isn't necessary or even desirable since the
>>>>>>> state
>>>>>>> itself may involve resources lost in the crash or corrupted state
>>>>>>> that
>>>>>>> led
>>>>>>> to the crash.
>>>>>>>
>>>>>>> I am straying outside my knowledge here so this paragraph is
>>>>>>> guesswork.
>>>>>>> The message queue for a gen_server need not necessarily be lost when
>>>>>>> the
>>>>>>> callback module crashes. In theory OTP could (and might already)
>>>>>>> simply
>>>>>>> delegate the messages to the replacement process following a crash.
>>>>>>> Someone
>>>>>>> who knows OTP better than me would need to weigh in here though.
>>>>>>>
>>>>>>> I found http://manning.com/logan very informative in understanding
>>>>>>> OTP
>>>>>>> and
>>>>>>> its supervisor hierarchies.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Alain
>>>>>>> ________________________________________________________________
>>>>>>> erlang-questions (at) erlang.org mailing list.
>>>>>>> See http://www.erlang.org/faq.html
>>>>>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>>>>>>
>>>>>> ________________________________________________________________
>>>>>> erlang-questions (at) erlang.org mailing list.
>>>>>> See http://www.erlang.org/faq.html
>>>>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>>>>>
>>>>>>
>>>>>>
>>>>> ________________________________________________________________
>>>>> erlang-questions (at) erlang.org mailing list.
>>>>> See http://www.erlang.org/faq.html
>>>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>>>>
>>>>>
>>>>
>>>> --
>>>> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>>>>
>>>>
>>> ________________________________________________________________
>>> erlang-questions (at) erlang.org mailing list.
>>> See http://www.erlang.org/faq.html
>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>>
>>>
>>
>>
>
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
>
More information about the erlang-questions
mailing list