[erlang-questions] Why Beam.smp crashes when memory is over?

Ulf Wiger ulf.wiger@REDACTED
Tue Nov 10 14:40:52 CET 2009

Joe Armstrong wrote:
> Just killing processes when they have done nothing wrong is not a good idea.

Well, it's optional, of course.  :)

Imagine, OTOH, a well-tested system where memory characteristics
have been well charted for the foreseeable cases. It might be
defensible to set resource limits so that everything we expect
to see falls well within the limit, and stuff that we don't
expect might trigger the killing of some process. If this is
done on temporary processes, we should be able to accept it
as long as the number of spurious kills is low.

This is not much stranger than things that we do routinely in
other cases:

- If dets or disk_log notice that a file hasn't been properly
   closed, it 'repairs' the file - that is, it repairs the index.
   Corrupt objects are simply discarded, not repaired.

- Replication in the AXD 301 and similar products was asynchronous
   with a bulking factor. Some failure cases could lead to dropped
   calls, but as long as they were few, it was acceptable.

- Some complex state machines would bail out for unexpected
   sequences (I showed an example of this in my Structured Network
   Programming talk at EUC). This was a form of "complexity
   overload", and hugely unfair to the poor process running the
   code, as it was probably not a real failure case.

- Mnesia's deadlock prevention algorithm, or indeed any deadlock
   prevention algo, will restart transactions if there is even
   the smallest chance of deadlock. Granted, this should be
   transparent if the transaction fun is well written, but there
   will be false positives, and this will affect performance.

On the other hand, there can be situations where a rogue process
gobbles up all available memory, rendering the VM unresponsive
for several minutes (e.g. due to the infamous "loss of sharing"),
or cases where a number of unexpectedly large processes "gang up"
and kill the VM in one big memory spike. Or a difficult-to-reproduce
bug that sends some application into an infinite retry situation
rendering the system unusable. In all these cases, killing off
the poor culprits, guilty or not, may well result in a less deadly
disturbance for the system as a whole.

Ulf W
Ulf Wiger
CTO, Erlang Training & Consulting Ltd

More information about the erlang-questions mailing list