[erlang-questions] obviously no bugs? (Re: Alternative supervision approaches)
Loïc Hoguin
essen@REDACTED
Thu Jun 26 21:40:31 CEST 2014
On 06/26/2014 09:10 PM, Raoul Duke wrote:
>> I was referring to something formally based on an FSM and detecting the
>> current
>> environment and state of the node before restarting the children and not
>> just using
>> timeouts. Not sure if that was clear based on your response.
>
> whatever the mechanism, the point to me is that i get the impression that:
>
> a) it failed at all in the first place and b) it seems like it is not
> always an expected failure and c) the "fix" is to restart & pray and
> then if the prayers don't work then wait a little longer or do some
> as-yet-unspecified random jiggery-pokery of the 'environment' and
> start praying again.
a) Even statically checked things fail. The program may run out of
memory. The network might go down. Network might be slow or bumpy.
Hardware may fail. Knowing that your program works on paper is nice, but
has its limits.
b) An infinite number of things may fail. The only thing you can expect
is that something *will* fail. You can't prevent that. You can only make
sure that when something does fail, you recover as gracefully as possible.
c) I'm not sure where you got the "wait a little longer" part, because
neither Erlang nor OTP waits. The point is to restart processes in a
consistent state, and then if we can't do that (for example we depend on
an ets table managed above in the tree), restart a larger group of
processes. The VM crashing is when something *really bad* happened and
you typically want a dirty human to look at it, but many odd failures
aren't worth spending much time over.
> i just wonder how many other people have similar pie-in-the-sky
> day-dream wishes or if the standard groupthink is, "eh, whatever,
> restarts will be enough to get us shipping and maybe making a profit!"
It's about solving a real world problem, which is that programs fail.
> on the other flipper, if one were programming with CSP or some such
> then in theory you'd "simply" run FDR over your code and get answers
> about deadlock for "free". vs. doing something in TLA and then
> maintaing it vs. the "real" code.
Dialyzer has some race condition checks nowadays. Concuerror can help
you find many race conditions and deadlocks too.
But neither of those can anticipate all possible cases that make program
fail. They are very nice to have because they allow you to increase the
quality of your codebase, but they do not solve the problem of programs
failing.
You can think of and test for many failure cases but you can never cover
them all. You still need supervision and restarts regardless of how well
your codebase is tested.
Stop dreaming and start building real things, I say.
--
Loïc Hoguin
http://ninenines.eu
More information about the erlang-questions
mailing list