[erlang-questions] eheap cannot allocate for which process?

Mon Mar 17 15:57:46 CET 2014

On Mon, Mar 17, 2014 at 08:51:13AM +0000, J?zsef B?rces wrote:
> I receive the classic ?eheap_alloc: Cannot allocate?? message. It wants to
> allocate ~1GB memory and that fails. That is fine, I am doing something wrong.
> So I took the crash dump and tried to find out which one of my processes is 
> the guilty one.

Some months ago I've had a similar problem: application running happily 
with ~400Mb RAM (on 2Gb RAM host) mostly consisting of four "major 
consumers" ("huge state" FSMs, ~80Mb each) started crashing with the
same eheap_alloc: Cannot allocate 729810240 bytes of memory (of type "heap").
message. After some investigation (and switching from stock SASL error_logger
to lager) I found that "guilty" processes were error_logger and gproc
and that this problem is a bit deeper. 

Some screens: after FSM crash, restart and rebuilding its state I saw: 

Pid                   Initial Call                          Heap     Reds Msgs
Registered            Current Function                     Stack
<0.5.0>               gen_event:init_it/6               59786060  5199822    0
error_logger          gen_event:fetch_msg/5                    8
<0.47.0>              gproc:init/1                      19590700  1650033    0
gproc                 gen_server:loop/6                        9
<0.221.0>             ebgp_conn:init/1                  19590700 20382184    0
                      gen_fsm:loop/7                          10

where 0.221.0 is my "fat FSM" after crash, restart and "state download". 
process_info(pid(0,5,0)) shows

 {total_heap_size,107614910},
 {heap_size,59786060},
 {stack_size,8},
 {reductions,5199822},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,1}]},

but after manual call to garbage_collect(pid(0, 5, 0)) heap usage
decreased significantly: 

 {total_heap_size,233},
 {heap_size,233},
 {stack_size,8},
 {reductions,5199822},

and the same memory decrease happened with gproc. 

How can I explain VM crash (not 100% sure, still consider myself as a 
novice in Erlang): when process crashes, it's state sent to all processes 
monitoring this one (gproc in this case) and to error_logger. State is big 
in my case (and in yours too). And there are no shared memory in Erlang. 
So, it's pretty logical that state of failed process was duplicated (may 
be even triplicated if copy happens while original process heap is not 
freed at this moment) and this duplication can cause eheap error. 
Especially in case when more than one "fat" process crashes instantly.. 

Lesson learned: while "let it crash" approach is generally good, it is 
not so good with "fat" processes, especially with heavily linked/monitored 
"fat" processes.

PS: and error_logger and gproc are of course not guilty. They just
efficient enough, so their garbage collector was not yet called.

> 
>  
> 
> Unfortunately, I cannot tell it from the crash dump.
> 
>  
> 
> The memory section says:
> 
> =memory
> 
> total: 15447352528
> 
> processes: 15140232809
> 
> processes_used: 15140005610
> 
> system: 307119719
> 
> atom: 512601
> 
> atom_used: 496586
> 
> binary: 148574400
> 
> code: 21228007
> 
> ets: 119988984
> 
>  
> 
> I have 16GB RAM, so the processes use almost all. There are 4010 processes. 1
> garbing, 31 scheduled, 3978 waiting. If I sum stack+heap of all the processes
> then I get ~700MB. That is very far from 16GB. Here are the top 10 stack+heap
> processes:
> 
> Pid          State              Reductions    Stack+heap  MsgQ Length
> 
>              Garbing (limited   1,508,838,180 145,962,050           1
> <0.21060.67> info)
> 
> <0.25689.27> Waiting               86,670,344 145,962,050           0
> 
> <0.10003.68> Waiting                1,363,039  38,263,080           0
> 
> <0.15943.66> Waiting            1,882,465,380  30,610,465           0
> 
> <0.15879.68> Waiting                  471,549  30,610,465           0
> 
> <0.31854.67> Waiting              154,500,777  24,488,375           0
> 
> <0.16221.68> Waiting                  262,114  24,488,375           0
> 
> <0.16628.68> Waiting                  117,268  24,488,375           0
> 
> <0.15878.68> Waiting                  453,490  19,590,700           0
> 
> <0.16235.68> Waiting                  181,968  19,590,700           0
> 
> 
>  
> 
> Any ideas how to tell which process needs ~1GB memory?
> 
>  
> 
> Thanks,
> 
> Jozsef
> 

> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-- 
In theory, there is no difference between theory and practice. 
But, in practice, there is.