[erlang-questions] System limit bringing down rex and the VM

Fri Sep 10 09:33:27 CEST 2010

On 09/09/2010 07:33 PM, Musumeci, Antonio S wrote:
> 
> I'm seeing mnesia, rex and timer_server in my dump. If you
> kill timer_server though it restarts.

Actually, I consider this a bug.

Let's check to see what the result is of killing timer_server.

Eshell V5.7.5  (abort with ^G)
1> F = fun() ->
         timer:send_after(15000,self(),hello),
         receive
            Msg ->
               io:fwrite("got ~p~n", [Msg])
            end
        end.
#Fun<erl_eval.20.67289768>
2> f(P), P = spawn(F), time().
{9,25,48}
got hello
3> time().
{9,26,6}
4> whereis(timer_server).
<0.38.0>
5> f(P), P = spawn(F), time().
{9,26,22}
6> exit(whereis(timer_server),kill).
true
7> whereis(timer_server).
<0.43.0>
8> time().
{9,27,0}
9> process_info(P).
[{current_function,{erl_eval,receive_clauses,6}},
 {initial_call,{erlang,apply,2}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 ...

So killing timer_server caused it to bounce back, but in the process,
it forgot all outstanding requests, so any processes depending on the
reliable service of the timer server are now left hanging, with no
indication whatsoever that something went wrong.

Personally, I think it would be much better if the timer server would
in fact stay dead, and bring the whole node down with it - that, or
make sure that its dying and restarting is truly transparent. Choosing
a middle way of merely pretending to be robust is the worst possible
choice.

Rather than concluding that the OTP team are incompetent in matters
of robustness (as there is overwhelming evidence that they are
anything but), I'd like to see this as yet another example of how
desperately difficult and dangerous it is to go down the path you're
suggesting. It may seem like a respectful thing to do, but you take
on a very heavy burden, and may well be much more likely to compound
the problem rather than helping it.

BR,
Ulf W