[erlang-questions] all nodes in cluster crashing with eheap_alloc in the same time

Wed Sep 21 19:50:37 CEST 2016

Hello guys,

We have an Erlang server based on ejabberd (totally changed to fit our
needs) which worked without any problem for 2 years.
Suddenly for 2 weeks we had 3 big downtime's when all the nodes crashed in
the same time with :
eheap_alloc: Cannot allocate 18446744063279941840 bytes of memory (of type
"heap_frag"). Of course the number of
bytes was different but all over 18 GB.

The servers are running on machines with 24 cores, around 300 GB memory.
When crashed also didn't generated
any crash dump and also core dump was not enabled on that machines.

The fact that crash dump was not generated makes me believe it might be a
problem in a NIF library but after this I found the following in
the ERTS changelog: "Make sure to create a crash dump when running out of
memory. This was accidentally removed in the erts-7.3 release."

What we did fast was:

1. update to last erlang 19.0.7
2. compiled ourselves the virtual machine with systemtap enabled
3. enabled core dumps on all boxes
4. Limited all session process to 50 MB (process_flag(max_heap_size, ..))
and all other processes to 100 MB using erlang:system_flag(max_heap_size,
...) values being calculated as:

-define(SESSION_MAX_HEAP_SIZE, get_env(session_max_heap_size, 10000000) div
?SYSTEM_WORDSIZE).
-define(DEFAULT_MAX_HEAP_SIZE, get_env(default_max_heap_size, 10000000) div
?SYSTEM_WORDSIZE).

where:

{default_max_heap_size, 100000000},
{session_max_heap_size, 50000000},

Looking to the logs we can see so far time to time lot of:

Process: <0.19379.617> on node 'prod@REDACTED' Context: maximum heap
size reached Max Heap Size: 6250000 Total Heap Size: 30360946 Kill: true
Error Logger: true GC Info:
[{old_heap_block_size,10958},{heap_block_size,15405490},{mbuf_size,14944498},{recent_size,9},{stack_size,16},{old_heap_size,917},{heap_size,9},{bin_vheap_size,4295970},{bin_vheap_block_size,9235836},{bin_old_vheap_size,21},{bin_old_vheap_block_size,46422}]

Observations:

mbuf_size (which corresponds to message queue) is pretty big but also are
bin_vheap numbers, which tells that the process allocates large binaries

The only question I have now is :

How I can make something to include in the logs more other info before
process dies. like number of messages in the queue.

We tried to setup also a monitor to be triggered way less than the limit
where it has to be killed:

Options = [{long_gc, 10000}, {large_heap, 1000000}, busy_port,
busy_dist_port],
erlang:system_monitor(self(), Options),

handle_info({monitor, Pid, Type, Details}, State) ->
    log_system_event({Type, Pid, Details}),
    {noreply, State};

log_system_event({large_heap, GcPid, Info}) ->
    LogFun = fun() ->
        case recon:info(GcPid, messages) of
      {messages, Messages} ->
           ?WARNING_MSG("Large heap (~p): ~p~nProcess info: ~p~nProcess
state size (words in the heap): ~p~nMessage queue(first 10):~p~n",
            [GcPid, Info, recon:info(GcPid),
erts_debug:size(recon:get_state(GcPid)), Messages]);
          undefined ->
       ?WARNING_MSG("Large heap (~p): ~p~nProcess info is not available",
[GcPid, Info])
        end
    end,
    spawn(LogFun);

But unfortunately the processes that has this issues have a life time small
than 4 seconds. And this event is never triggered in time.

Any help is appreciated !

Silviu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160921/c07e33b4/attachment.htm>