[erlang-questions] System limit bringing down rex and the VM

Fri Sep 10 12:27:08 CEST 2010

If it's not documented it's a bug, otherwise just a bad behavior. If
you can do something about it it's a feature. 

Agreed, a middle way is bad unless clearly documented and
preferably there is some way to handle it reasonably. Perhaps ets with
the heir option with a custom primary supervisor that would take
ownership of the data while it restarted it and then transfer it back?

Looking over the timer code it looks that it does a catch spawn so it
seems that the spawn shouldn't have been the reason for it's crash.
I'll have to look at it more closely to see what happened in my case.

On Fri, 10 Sep 2010 09:33:27 +0200
Ulf Wiger <ulf.wiger@REDACTED> wrote:

> On 09/09/2010 07:33 PM, Musumeci, Antonio S wrote:
> > 
> > I'm seeing mnesia, rex and timer_server in my dump. If you
> > kill timer_server though it restarts.
> 
> Actually, I consider this a bug.
> 
> Let's check to see what the result is of killing timer_server.
> 
> Eshell V5.7.5  (abort with ^G)
> 1> F = fun() ->
>          timer:send_after(15000,self(),hello),
>          receive
>             Msg ->
>                io:fwrite("got ~p~n", [Msg])
>             end
>         end.
> #Fun<erl_eval.20.67289768>
> 2> f(P), P = spawn(F), time().
> {9,25,48}
> got hello
> 3> time().
> {9,26,6}
> 4> whereis(timer_server).
> <0.38.0>
> 5> f(P), P = spawn(F), time().
> {9,26,22}
> 6> exit(whereis(timer_server),kill).
> true
> 7> whereis(timer_server).
> <0.43.0>
> 8> time().
> {9,27,0}
> 9> process_info(P).
> [{current_function,{erl_eval,receive_clauses,6}},
>  {initial_call,{erlang,apply,2}},
>  {status,waiting},
>  {message_queue_len,0},
>  {messages,[]},
>  ...
> 
> So killing timer_server caused it to bounce back, but in the process,
> it forgot all outstanding requests, so any processes depending on the
> reliable service of the timer server are now left hanging, with no
> indication whatsoever that something went wrong.
> 
> Personally, I think it would be much better if the timer server would
> in fact stay dead, and bring the whole node down with it - that, or
> make sure that its dying and restarting is truly transparent. Choosing
> a middle way of merely pretending to be robust is the worst possible
> choice.
> 
> Rather than concluding that the OTP team are incompetent in matters
> of robustness (as there is overwhelming evidence that they are
> anything but), I'd like to see this as yet another example of how
> desperately difficult and dangerous it is to go down the path you're
> suggesting. It may seem like a respectful thing to do, but you take
> on a very heavy burden, and may well be much more likely to compound
> the problem rather than helping it.
> 
> BR,
> Ulf W
> 
> 
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>