[erlang-questions] System limit bringing down rex and the VM

Thu Sep 9 00:34:52 CEST 2010

> This is the way erlang has been designed. If it hits a system limit,
> it crashes. Simply because it cannot cope. By not crashing, and
> returning error codes, you get defensive programming which makes
> programs harder to read/write/test. As far as BEAM is concerned, the
> ability to spawn a process is a fundamental requirement. If that
> cannot be met, all bets are off. I suspect the resource exhaustion is
> being caused by your own code, and a bit of overload control, coupled
> with configuration of BEAM as pointed out by others should solve the
> problem.

How many other limits cause the platform to shit the bed? I suspect
few. There is a massive difference between the entire platform
collapsing and RPC not working / restarting. If spawning of processes
is so fundamental why do the core processes fit into the process
limit? Linux as mentioned before prevents userland from wrecking things
by safeguarding some amount of RAM for itself. Couldn't BEAM do the
same? Why doesn't it auto shutdown if the limit is hit? Why no warnings
from the system? Why is it triggered by rex? A minor component of the
base process tree.

Those who defend this behavior are not consistent. The behavior of the
core processes are not consistent. Just look at the code.

There is no reason to take control away from the developer. Especially
when it means the entire platform will collapse from underneath them
for something entirely controllable.

> 
> > I shouldn't have to build my own spawn wrapper to keep track of the
> > number of processes. The VM already does this. Besides, this
> > problem couldn't be fully addressed that way.
> 
> 
> You don't have to. I suspect you need to do some sort of load
> regulation in your system.

Load regulation? My system is designed to support arbitrary process
creation. I was maxing out the processes as a scale test. If for some
reason it can't spawn new processes then I want control over what
happens next. Rex and the supervisor's behavior takes that control
from me. At best I can poll the process count and warn that the system
will soon fail but am powerless to do anything about it. The failure of
a non-essential component of the system should not cause the VM to fail
just like a bad process in an OS should cause it to halt. Could one
of you please explain to me how that analogy is incorrect?

> 
> > This is caused by mnesia reacting to some event causing
> > mnesia_recover to send a rpc. I'm not in control of that.
> >
> 
> The error message you see about mnesia_recover is the "effect", not
> the "cause".

The mnesia_recover rpc call is the catalyst for the failure. It's
issuing the RPC command. The error is caused by the rpc failure.