[erlang-questions] eheap cannot allocate for which process?

Mon Mar 17 16:48:46 CET 2014

For this issue, see the format_status callback OTP behaviours contain:
http://www.erlang.org/doc/man/gen_server.html#Module:format_status-2

They should let you do things the way you want and reduce the size of
the messages logged.

More generally, however, if you need to dig into a crash dump, I
recommend using the scripts I added to recon:
https://github.com/ferd/recon/tree/master/script One of them will do a
quick diagnostic over the crashdump and output the info I've always
found useful while debugging, and the awk script will output all
functions that were running if mailboxes were huge.

If the node is still running when you see problems appearing, I'd
suggest looking into recon as a whole (docs: http://ferd.github.io/recon/)
and see if the issues can be related to the total memory size and how
it's allocated (see recon_alloc), binary memory "leaks" (see
recon:bin_leak/1), and so on.

The binary memory stuff wouldn't necessarily surprise me if you find out
GCs tend to solve problems, but that's always system-specific.

Regards,
Fred.

On 03/17, Alexandre Snarskii wrote:
> On Mon, Mar 17, 2014 at 08:51:13AM +0000, J?zsef B?rces wrote:
> > I receive the classic ?eheap_alloc: Cannot allocate?? message. It wants to
> > allocate ~1GB memory and that fails. That is fine, I am doing something wrong.
> > So I took the crash dump and tried to find out which one of my processes is 
> > the guilty one.
> 
> Some months ago I've had a similar problem: application running happily 
> with ~400Mb RAM (on 2Gb RAM host) mostly consisting of four "major 
> consumers" ("huge state" FSMs, ~80Mb each) started crashing with the
> same eheap_alloc: Cannot allocate 729810240 bytes of memory (of type "heap").
> message. After some investigation (and switching from stock SASL error_logger
> to lager) I found that "guilty" processes were error_logger and gproc
> and that this problem is a bit deeper. 
> 
> Some screens: after FSM crash, restart and rebuilding its state I saw: 
> 
> Pid                   Initial Call                          Heap     Reds Msgs
> Registered            Current Function                     Stack
> <0.5.0>               gen_event:init_it/6               59786060  5199822    0
> error_logger          gen_event:fetch_msg/5                    8
> <0.47.0>              gproc:init/1                      19590700  1650033    0
> gproc                 gen_server:loop/6                        9
> <0.221.0>             ebgp_conn:init/1                  19590700 20382184    0
>                       gen_fsm:loop/7                          10
> 
> where 0.221.0 is my "fat FSM" after crash, restart and "state download". 
> process_info(pid(0,5,0)) shows
> 
>  {total_heap_size,107614910},
>  {heap_size,59786060},
>  {stack_size,8},
>  {reductions,5199822},
>  {garbage_collection,[{min_bin_vheap_size,46368},
>                       {min_heap_size,233},
>                       {fullsweep_after,65535},
>                       {minor_gcs,1}]},
> 
> but after manual call to garbage_collect(pid(0, 5, 0)) heap usage
> decreased significantly: 
> 
>  {total_heap_size,233},
>  {heap_size,233},
>  {stack_size,8},
>  {reductions,5199822},
> 
> and the same memory decrease happened with gproc. 
> 
> How can I explain VM crash (not 100% sure, still consider myself as a 
> novice in Erlang): when process crashes, it's state sent to all processes 
> monitoring this one (gproc in this case) and to error_logger. State is big 
> in my case (and in yours too). And there are no shared memory in Erlang. 
> So, it's pretty logical that state of failed process was duplicated (may 
> be even triplicated if copy happens while original process heap is not 
> freed at this moment) and this duplication can cause eheap error. 
> Especially in case when more than one "fat" process crashes instantly.. 
> 
> Lesson learned: while "let it crash" approach is generally good, it is 
> not so good with "fat" processes, especially with heavily linked/monitored 
> "fat" processes.
> 
> PS: and error_logger and gproc are of course not guilty. They just
> efficient enough, so their garbage collector was not yet called.
> 
> 
> > 
> >  
> > 
> > Unfortunately, I cannot tell it from the crash dump.
> > 
> >  
> > 
> > The memory section says:
> > 
> > =memory
> > 
> > total: 15447352528
> > 
> > processes: 15140232809
> > 
> > processes_used: 15140005610
> > 
> > system: 307119719
> > 
> > atom: 512601
> > 
> > atom_used: 496586
> > 
> > binary: 148574400
> > 
> > code: 21228007
> > 
> > ets: 119988984
> > 
> >  
> > 
> > I have 16GB RAM, so the processes use almost all. There are 4010 processes. 1
> > garbing, 31 scheduled, 3978 waiting. If I sum stack+heap of all the processes
> > then I get ~700MB. That is very far from 16GB. Here are the top 10 stack+heap
> > processes:
> > 
> > Pid          State              Reductions    Stack+heap  MsgQ Length
> > 
> >              Garbing (limited   1,508,838,180 145,962,050           1
> > <0.21060.67> info)
> > 
> > <0.25689.27> Waiting               86,670,344 145,962,050           0
> > 
> > <0.10003.68> Waiting                1,363,039  38,263,080           0
> > 
> > <0.15943.66> Waiting            1,882,465,380  30,610,465           0
> > 
> > <0.15879.68> Waiting                  471,549  30,610,465           0
> > 
> > <0.31854.67> Waiting              154,500,777  24,488,375           0
> > 
> > <0.16221.68> Waiting                  262,114  24,488,375           0
> > 
> > <0.16628.68> Waiting                  117,268  24,488,375           0
> > 
> > <0.15878.68> Waiting                  453,490  19,590,700           0
> > 
> > <0.16235.68> Waiting                  181,968  19,590,700           0
> > 
> > 
> >  
> > 
> > Any ideas how to tell which process needs ~1GB memory?
> > 
> >  
> > 
> > Thanks,
> > 
> > Jozsef
> > 
> 
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> 
> 
> -- 
> In theory, there is no difference between theory and practice. 
> But, in practice, there is. 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions