[erlang-questions] System limit bringing down rex and the VM

Thu Sep 9 19:33:55 CEST 2010

{badrpc,_} seems to me to mean that the rpc failed. The second element is for why. Not being able to spawn is a why.

It's obvious that other services can't handle the exhaustion of processes. I'm seeing mnesia, rex and timer_server in my dump. If you kill timer_server though it restarts. If I kill mnesia it's just dead. But mnesia is forbidden from dealing with the failure of rex because of it's behavior and I'm unable to deal with the possible failure of mnesia for the same reason. They are separate (but in this case connected) issues and should be addressed separately. Obviously I could set the limit arbitrarily high but then what's the point of the limit?

Again, I don't see what the problem is with giving developers more control. If you want the VM to die when the process limit is reached then you can do so easily. Telling people just to raise limits is whitewashing the problem and hides possible reliability improvements.

-----Original Message-----
From: Ulf Wiger [mailto:ulf.wiger@REDACTED] 
Sent: Thursday, September 09, 2010 11:46 AM
To: Musumeci, Antonio S (Enterprise Infrastructure)
Cc: erlang-questions@REDACTED
Subject: Re: [erlang-questions] System limit bringing down rex and the VM

On 09/09/2010 02:15 PM, Musumeci, Antonio S wrote:
> I understand your points completely... however, there is certainly a 
> difference from having an erlang process die and allowing it's peers 
> to handle the cleanup and having the erlang vm die. The vm has no peer 
> in that way.

It does have a peer if you are running with redundancy, and Erlang/OTP was primarily designed for systems where redundancy is pretty much a given.

> Yes Mnesia needs RPC... but so do a lot of things and if the pattern 
> is to be followed that you die and allow the peers to respond... 
> that's not what is happening here. Rex dies and brings the world down 
> with it. Mnesia is unable to respond to the issue. The mnesia code 
> shows that it is prepared for {badrpc,_} errors.

Yes, but 'badrpc' means "I was not able to communicate with the other node" - not "I wasn't even able to try". Mnesia is quite prepared to handle the case that other nodes disappear.
In this case, a system resource has been exhausted, and as Chandru pointed out, the crash came as mnesia was trying to create a core dump, which means your system was going down anyway.

This is fairly typical if you exhaust a system limit. Even if you could theoretically write code to handle it, most of the libraries you likely want to use have not been designed that way, so something is going to break.

If it were only rex, it is easy enough to write your own RPC library that behaves differently. But mnesia is not prepared to cope with not being able to spawn a process, or create an ETS table, which is another system limit that can bring down mnesia.

You call it a completely arbitrary limit, but it is no more arbitrary than the number of open file descriptors you may have.
You are not forced to accept the default limit, and just like with the file descriptor limit, you probably couldn't for any system of size. Try setting the process limit to a suitably high number.

BR,
Ulf W