[erlang-questions] System limit bringing down rex and the VM

Thu Sep 9 11:14:33 CEST 2010

On 09/09/2010 12:34 AM, bile@REDACTED wrote:
> 
> How many other limits cause the platform to shit the bed? I suspect
> few.

Actually, hitting the system limits themselves will not cause
the VM to crash. OOM is the only exception I can think of right now.

Trying to spawn a process when the max number of processes has
been reached will simply raise an exception. However, some (most)
application code will not include to cope with the situation that
you can't spawn a process, so most likely, applications will come
crashing down when this happens. Which one goes first is mainly
up to chance.

Erlang is a concurrency-oriented language. The spawn() function
is about as fundamental as new() in OO languages. In most other
languages, you treat the spawning of processes as something
scary that you don't want to do unless you absolutely have to.
Also, the limits are usually pretty low. In Erlang, you can
raise the limit to > 200 million processes, if you have enough
memory for it. The default limit is fairly low (32,000) mainly
for historical reasons, but also because it makes sense to
keep memory footprint low by default, and 32K processes is
plenty enough for most uses.

> Those who defend this behavior are not consistent. The behavior
> of the core processes are not consistent. Just look at the code.

It would be better if you mention specific instances rather than
asking people to "just look at the code". OTOH, we can probably
stipulate that the code base is inconsistent in many ways....

In general, application code does not cope with system limits
being exhausted. This is in line with Erlang's "let it crash"
philosophy as well as the fact that if you want true robustness
in the kind of products Erlang was designed for, you have to
have a redundant setup anyway.

The thing about redundancy is that it works best if the failing
side fails quickly and distinctly, rather than trying in vain
to correct the problem locally. This is the essence of "fail-fast"
programming. Although this is not 100% consistently implemented
in Erlang/OTP either, you might want to keep in mind that Erlang
has been breaking new ground in this respect, and most of the
people who've worked on OTP components over the years were
originally steeped in the same programming mindset as everyone
else. ;-)

>>> I shouldn't have to build my own spawn wrapper to keep track of the
>>> number of processes. The VM already does this. Besides, this
>>> problem couldn't be fully addressed that way.
>>
>> You don't have to. I suspect you need to do some sort of load
>> regulation in your system.
> 
> Load regulation? My system is designed to support arbitrary process
> creation. I was maxing out the processes as a scale test. If for some
> reason it can't spawn new processes then I want control over what
> happens next. Rex and the supervisor's behavior takes that control
> from me. At best I can poll the process count and warn that the system
> will soon fail but am powerless to do anything about it.

With any programming language or operating environment, you have
the responsibility as a developer to understand and respect the
fundamental assumptions made when developing the environment.
If your requirements don't match well enough, it might be better
to find another language/environment that fits your problem
better.

You seem to be saying that you shouldn't have to worry about
the user of your application throwing more work at you than the
system is capable of handling? This may be a valid requirement in
some domains, but Erlang is fundamentally a language for developing
messaging systems, which have to cope with overload situations
(including Denial-of-Service attacks) in a structured way. When
subjected to a DoS attack, you typically don't just want to accept
the challenge and likely die honorably as a result. The
normal way to do that is to push back, or shed load so that it
doesn't overwhelm the core components in your system.

You might compare this with the recurring discussions about
active vs passive sockets. Sockets in POSIX are by design passive,
as you have to explicitly read data from the buffer, but in Erlang,
the default is that the VM empties the buffer and delivers the data
to the socket owner asynchronously. I think this was the wrong
default, and the recommendation is to use passive, or {active,once}
to avoid being swamped by input from the network. This is in line
with the idea of not accepting more work than you can cope with.

> The failure of
> a non-essential component of the system should not cause the VM to fail
> just like a bad process in an OS should cause it to halt. Could one
> of you please explain to me how that analogy is incorrect?

RPC is not a non-essential component, especially not to mnesia.
Mnesia assumes that rpc will work, unless something really bad has
happened. The recommended behaviour when something really bad happens
in Erlang is to die and let the rest of the system take over.

BR,
Ulf W