[erlang-questions] Preventing memory crashes

Mon Nov 21 22:35:20 CET 2016

On Mon, Nov 21, 2016 at 7:29 PM Lyn Headley <lheadley@REDACTED> wrote:

>
>
> Other thoughts?
>
>
Many of the suggestions in this thread is good. Let me put in a
counterpoint: Detect the OOM situation, but build your system to eventually
cope with it through a node crash.

First, define the capacity of your system. Once you hit the capacity limit,
don't add more work to the system. Gracefully reject work, and handle the
situation by adding more nodes if you need more capacity. You need to know
the engineering capacity (nominal operation) and peak capacity (when things
start going seriously wrong).

Second, the system_monitor, suggested by Taavi Talvik is usually a good
idea to enable, since you can log whenever a single process uses more than,
say, 5% of the systems memory. Also look into the alarm_handler and
piggyback on set and cleared alarms to warn about when things start going
wrong.

Here is why: fatal errors are like many boss enemies in computer games -
they telegraph their attacks long before they happen. A fatal error usually
makes itself known at a smaller scale long before the fatal error takes
down the system.

Third, if things start going wrong, chances are you can't gracefully
recover from them. Better wipe the whole node and let some other node take
over the work. If possible, build your system such that it can start off a
safe invariant state it periodically stores back to disk. Any system
reaching memory limits are susceptible to a rather fast death through a
SIGKILL anyway.

The alternative solution is to buy a Turing Machine with infinite tape...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20161121/0772aac4/attachment.htm>