[erlang-questions] obviously no bugs? (Re: Alternative supervision approaches)

Thu Jun 26 21:40:31 CEST 2014

On 06/26/2014 09:10 PM, Raoul Duke wrote:
>> I was referring to something formally based on an FSM and detecting the
>> current
>> environment and state of the node before restarting the children and not
>> just using
>> timeouts. Not sure if that was clear based on your response.
>
> whatever the mechanism, the point to me is that i get the impression that:
>
> a) it failed at all in the first place and b) it seems like it is not
> always an expected failure and c) the "fix" is to restart & pray and
> then if the prayers don't work then wait a little longer or do some
> as-yet-unspecified random jiggery-pokery of the 'environment' and
> start praying again.

a) Even statically checked things fail. The program may run out of 
memory. The network might go down. Network might be slow or bumpy. 
Hardware may fail. Knowing that your program works on paper is nice, but 
has its limits.

b) An infinite number of things may fail. The only thing you can expect 
is that something *will* fail. You can't prevent that. You can only make 
sure that when something does fail, you recover as gracefully as possible.

c) I'm not sure where you got the "wait a little longer" part, because 
neither Erlang nor OTP waits. The point is to restart processes in a 
consistent state, and then if we can't do that (for example we depend on 
an ets table managed above in the tree), restart a larger group of 
processes. The VM crashing is when something *really bad* happened and 
you typically want a dirty human to look at it, but many odd failures 
aren't worth spending much time over.

> i just wonder how many other people have similar pie-in-the-sky
> day-dream wishes or if the standard groupthink is, "eh, whatever,
> restarts will be enough to get us shipping and maybe making a profit!"

It's about solving a real world problem, which is that programs fail.

> on the other flipper, if one were programming with CSP or some such
> then in theory you'd "simply" run FDR over your code and get answers
> about deadlock for "free". vs. doing something in TLA and then
> maintaing it vs. the "real" code.

Dialyzer has some race condition checks nowadays. Concuerror can help 
you find many race conditions and deadlocks too.

But neither of those can anticipate all possible cases that make program 
fail. They are very nice to have because they allow you to increase the 
quality of your codebase, but they do not solve the problem of programs 
failing.

You can think of and test for many failure cases but you can never cover 
them all. You still need supervision and restarts regardless of how well 
your codebase is tested.

Stop dreaming and start building real things, I say.

-- 
Loïc Hoguin
http://ninenines.eu