<div dir="ltr">Hello guys,<br><br>We have an Erlang server based on ejabberd (totally changed to fit our needs) which worked without any problem for 2 years. <br>Suddenly for 2 weeks we had 3 big downtime's when all the nodes crashed in the same time with :<br>eheap_alloc: Cannot allocate 18446744063279941840 bytes of memory (of type "heap_frag"). Of course the number of<br>bytes was different but all over 18 GB. <br><br>The servers are running on machines with 24 cores, around 300 GB memory. When crashed also didn't generated<br>any crash dump and also core dump was not enabled on that machines.<br><br>The fact that crash dump was not generated makes me believe it might be a problem in a NIF library but after this I found the following in<br>the ERTS changelog: "Make sure to create a crash dump when running out of memory. This was accidentally removed in the erts-7.3 release."<br><br>What we did fast was:<br><br>1. update to last erlang 19.0.7 <br>2. compiled ourselves the virtual machine with systemtap enabled<br>3. enabled core dumps on all boxes<br>4. Limited all session process to 50 MB (process_flag(max_heap_size, ..)) and all other processes to 100 MB using erlang:system_flag(max_heap_size, ...) values being calculated as:<br><br>-define(SESSION_MAX_HEAP_SIZE, get_env(session_max_heap_size, 10000000) div ?SYSTEM_WORDSIZE).<br>-define(DEFAULT_MAX_HEAP_SIZE, get_env(default_max_heap_size, 10000000) div ?SYSTEM_WORDSIZE).<br><br>where:<br><br>{default_max_heap_size, 100000000},<br>{session_max_heap_size, 50000000},<br><br>Looking to the logs we can see so far time to time lot of:<br><br>Process: <0.19379.617> on node '<a href="mailto:prod@10.11.1.177">prod@10.11.1.177</a>' Context: maximum heap size reached Max Heap Size: 6250000 Total Heap Size: 30360946 Kill: true Error Logger: true GC Info: [{old_heap_block_size,10958},{heap_block_size,15405490},{mbuf_size,14944498},{recent_size,9},{stack_size,16},{old_heap_size,917},{heap_size,9},{bin_vheap_size,4295970},{bin_vheap_block_size,9235836},{bin_old_vheap_size,21},{bin_old_vheap_block_size,46422}]<br><br>Observations:<br><br>mbuf_size (which corresponds to message queue) is pretty big but also are bin_vheap numbers, which tells that the process allocates large binaries<br><br>The only question I have now is :<br><br>How I can make something to include in the logs more other info before process dies. like number of messages in the queue.<br><br>We tried to setup also a monitor to be triggered way less than the limit where it has to be killed:<br><br>Options = [{long_gc, 10000}, {large_heap, 1000000}, busy_port, busy_dist_port],<br>erlang:system_monitor(self(), Options),<br><br>handle_info({monitor, Pid, Type, Details}, State) -><br>    log_system_event({Type, Pid, Details}),<br>    {noreply, State};<br>    <br>log_system_event({large_heap, GcPid, Info}) -><br>    LogFun = fun() -><br>        case recon:info(GcPid, messages) of<br>      {messages, Messages} -><br>           ?WARNING_MSG("Large heap (~p): ~p~nProcess info: ~p~nProcess state size (words in the heap): ~p~nMessage queue(first 10):~p~n",<br>            [GcPid, Info, recon:info(GcPid), erts_debug:size(recon:get_state(GcPid)), Messages]);<br>          undefined -><br>       ?WARNING_MSG("Large heap (~p): ~p~nProcess info is not available", [GcPid, Info])<br>        end<br>    end,<br>    spawn(LogFun);    <br>    <br>But unfortunately the processes that has this issues have a life time small than 4 seconds. And this event is never triggered in time.     <br><br>Any help is appreciated !<br><br>Silviu<br></div>