[erlang-questions] System limit bringing down rex and the VM

Thu Sep 9 14:37:36 CEST 2010

So rex failing and bringing down the VM is a purposeful design decision? Aren't their cleaner ways to do this? Why not a watchdog that pays attention to resources and halts the VM when hit? Why isn't it built into the system itself? This may be a valid behavior for your situations but it's not for mine. I don't like or want my system to just collapse without an ability to do something about it when it's possible to do so. And as I recall if you start erlang in minimal mode and manually start rex it is not part of the primary supervision tree and can therefore be killed. Why doesn't it then force the VM to fail?
________________________________
From: Chandru [mailto:chandrashekhar.mullaparthi@REDACTED]
Sent: Thursday, September 09, 2010 6:16 AM
To: bile@REDACTED
Cc: Musumeci, Antonio S (Enterprise Infrastructure); erlang-questions@REDACTED
Subject: Re: [erlang-questions] System limit bringing down rex and the VM

On 8 September 2010 23:34, <bile@REDACTED<mailto:bile@REDACTED>> wrote:

How many other limits cause the platform to shit the bed? I suspect
few. There is a massive difference between the entire platform
collapsing and RPC not working / restarting. If spawning of processes
is so fundamental why do the core processes fit into the process
limit? Linux as mentioned before prevents userland from wrecking things
by safeguarding some amount of RAM for itself. Couldn't BEAM do the
same? Why doesn't it auto shutdown if the limit is hit? Why no warnings
from the system? Why is it triggered by rex? A minor component of the
base process tree.

As Ulf pointed out, beam does throw an exception. It is indeed the rex process which decided to give up. What I meant to say was that almost all applications written in erlang assume they are operating within the system limits. In the case of rex, it decides to die if it can't spawn a process. rex is only active if you are using distributed erlang. That is a design decision, and it is a valid one - atleast for those of us who use it a lot in real world situations.

There is no reason to take control away from the developer. Especially
when it means the entire platform will collapse from underneath them
for something entirely controllable.

It is a trade off. You have complete control over what happens in the system when you program in C. Doesn't necessarily mean that is the best choice.

> The error message you see about mnesia_recover is the "effect", not
> the "cause".

The mnesia_recover rpc call is the catalyst for the failure. It's
issuing the RPC command. The error is caused by the rpc failure.

You are wrong. Infact if you look closely at it, mnesia is trying to make an rpc call when it was trying to dump core, which means it was already dying at that point. You need to dig deeper to find the real cause.

Chandru