[erlang-questions] erlang-questions Digest, Vol 171, Issue 9

Youngkin, Rich richard.youngkin@REDACTED
Fri Jun 27 16:08:05 CEST 2014

Fred Hebert's blog - It's about the guarantees - has a good discussion
regarding why it's not always about "restart & pray"



> Message: 20
> Date: Thu, 26 Jun 2014 21:40:31 +0200
> From: Lo?c Hoguin <essen@REDACTED>
> To: Raoul Duke <raould@REDACTED>,  erlang-questions
>         <erlang-questions@REDACTED>
> Subject: Re: [erlang-questions] obviously no bugs? (Re: Alternative
>         supervision approaches)
> Message-ID: <53AC772F.9000403@REDACTED>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> On 06/26/2014 09:10 PM, Raoul Duke wrote:
> >> I was referring to something formally based on an FSM and detecting the
> >> current
> >> environment and state of the node before restarting the children and not
> >> just using
> >> timeouts. Not sure if that was clear based on your response.
> >
> > whatever the mechanism, the point to me is that i get the impression
> that:
> >
> > a) it failed at all in the first place and b) it seems like it is not
> > always an expected failure and c) the "fix" is to restart & pray and
> > then if the prayers don't work then wait a little longer or do some
> > as-yet-unspecified random jiggery-pokery of the 'environment' and
> > start praying again.
> a) Even statically checked things fail. The program may run out of
> memory. The network might go down. Network might be slow or bumpy.
> Hardware may fail. Knowing that your program works on paper is nice, but
> has its limits.
> b) An infinite number of things may fail. The only thing you can expect
> is that something *will* fail. You can't prevent that. You can only make
> sure that when something does fail, you recover as gracefully as possible.
> c) I'm not sure where you got the "wait a little longer" part, because
> neither Erlang nor OTP waits. The point is to restart processes in a
> consistent state, and then if we can't do that (for example we depend on
> an ets table managed above in the tree), restart a larger group of
> processes. The VM crashing is when something *really bad* happened and
> you typically want a dirty human to look at it, but many odd failures
> aren't worth spending much time over.
> > i just wonder how many other people have similar pie-in-the-sky
> > day-dream wishes or if the standard groupthink is, "eh, whatever,
> > restarts will be enough to get us shipping and maybe making a profit!"
> It's about solving a real world problem, which is that programs fail.
> > on the other flipper, if one were programming with CSP or some such
> > then in theory you'd "simply" run FDR over your code and get answers
> > about deadlock for "free". vs. doing something in TLA and then
> > maintaing it vs. the "real" code.
> Dialyzer has some race condition checks nowadays. Concuerror can help
> you find many race conditions and deadlocks too.
> But neither of those can anticipate all possible cases that make program
> fail. They are very nice to have because they allow you to increase the
> quality of your codebase, but they do not solve the problem of programs
> failing.
> You can think of and test for many failure cases but you can never cover
> them all. You still need supervision and restarts regardless of how well
> your codebase is tested.
> Stop dreaming and start building real things, I say.
> --
> Lo?c Hoguin
> http://ninenines.eu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140627/3115cce7/attachment.htm>

More information about the erlang-questions mailing list