[erlang-questions] Why Beam.smp crashes when memory is over?

Mon Nov 9 07:48:11 CET 2009

On Sun, Nov 08, 2009 at 10:43:25PM -0700, Tony Arcieri wrote:
> I'm not really sure if we're speaking about the same thing then.  Erlang is
> best utilized as a stateless platform, with a "fail early fail often"
> philosophy, and crashing the entire interpreter in the event of some
> underlying problem with the OS fits that entirely.  You seem to be seeking a
> platform that lets you program defensively, handling errors rather than
> crashing and restoring with a clean state.  In that regard, Erlang may not
> be the best language for you to work with, but then again the problem of
> successfully recovering from an out-of-resource condition is an extremely
> difficult one to handle well without scrapping it all and trying to restart
> with a clean state.
>

I agree that Erlang is best used when your application is as stateless
as possible, or has clear transfer of responsibility (HTTP server, SMTP
server, etc), but sometimes Erlang still has a lot of benefits even if
you're lugging some state around, and that's when the current OOM
behaviour is a little annoying. I don't say that the current behaviour
is not the best in a lot of situations, but it'd be nice to have an
alternative in cases where nuking the VM and having to recover or simply
discard a lot of state is unpleasant.

In an ideal world I'd like to see an *optional* OOM behaviour similar to
the following:

On startup, the VM pre-allocates enough memory to be able to attempt OOM
recovery without additional malloc calls. If and when an OOM condition
is detected, freeze the VM, iterate the process list and simply
exit(FatPid, oom) the process with the most memory allocated (obviously
this strategy is full of holes, epecially with regard to off-heap
binaries), but *sometimes* it's better than just blowing away the whole
thing. Then, if the malloc fails again, revert to the current behaviour.

You could, of course try to be smarter about it (look at the process
calling malloc, look at the reference counted binaries, try to avoid
killing SASL or system processes), but the point is that sometimes the
obvious offender is the only one you need to kill to restore a system's
functionality.

Of course, I'm no expert on this, these are merely musings from having
(entirely due to my own fault) triggered plenty of OOM VM kills that
would have been as easily solved if a single process had been killed.

Andrew