[erlang-questions] eheap cannot allocate for which process?

Fri Mar 21 06:15:57 CET 2014

Thanks for all the replies so far.

I played a bit with recon and I see huge gap between the allocated and used memory:

28> recon_alloc:memory(used).
368489488
29> recon_alloc:memory(allocated).
15139661416
30> recon_alloc:memory(unused).
14757773688
31> recon_alloc:memory(allocated_types,current).
[{binary_alloc,199241928},
 {driver_alloc,950472},
 {eheap_alloc,438374600},
 {ets_alloc,14441853128},
 {fix_alloc,9117896},
 {ll_alloc,33554472},
 {sl_alloc,164040},
 {std_alloc,7545032},
 {temp_alloc,655560}]
32> recon_alloc:memory(allocated_instances,current).
[{0,43327848},
 {1,5131874624},
 {2,5086572864},
 {3,4623593792},
 {4,246088000}]
33> erlang:memory().
[{total,399193088},
 {processes,259732526},
 {processes_used,259670741},
 {system,139460562},
 {atom,512601},
 {atom_used,501488},
 {binary,62563168},
 {code,21154095},
 {ets,41092024}]

There is more than 14GB memory unused. This is the status even hours after the heavy workers finished and exited. Should not the VM return at least a portion of the unused mem to the OS after a while?
It seems it is because of ETS. ets_alloc is about 14GB whilst ets (from erlang:memory/0) is only about 40MB. I use R15B01. Can I expect smaller gaps if I upgrade the VM?

Thanks,
Jozsef

-----Original Message-----
From: Fred Hebert [mailto:mononcqc@REDACTED] 
Sent: Monday, March 17, 2014 22:49
To: Alexandre Snarskii
Cc: József Bérces; erlang-questions@REDACTED
Subject: Re: [erlang-questions] eheap cannot allocate for which process?

For this issue, see the format_status callback OTP behaviours contain:
http://www.erlang.org/doc/man/gen_server.html#Module:format_status-2

They should let you do things the way you want and reduce the size of the messages logged.

More generally, however, if you need to dig into a crash dump, I recommend using the scripts I added to recon:
https://github.com/ferd/recon/tree/master/script One of them will do a quick diagnostic over the crashdump and output the info I've always found useful while debugging, and the awk script will output all functions that were running if mailboxes were huge.

If the node is still running when you see problems appearing, I'd suggest looking into recon as a whole (docs: http://ferd.github.io/recon/) and see if the issues can be related to the total memory size and how it's allocated (see recon_alloc), binary memory "leaks" (see recon:bin_leak/1), and so on.

The binary memory stuff wouldn't necessarily surprise me if you find out GCs tend to solve problems, but that's always system-specific.

Regards,
Fred.

On 03/17, Alexandre Snarskii wrote:
> On Mon, Mar 17, 2014 at 08:51:13AM +0000, J?zsef B?rces wrote:
> > I receive the classic ?eheap_alloc: Cannot allocate?? message. It 
> > wants to allocate ~1GB memory and that fails. That is fine, I am doing something wrong.
> > So I took the crash dump and tried to find out which one of my 
> > processes is the guilty one.
> 
> Some months ago I've had a similar problem: application running 
> happily with ~400Mb RAM (on 2Gb RAM host) mostly consisting of four 
> "major consumers" ("huge state" FSMs, ~80Mb each) started crashing 
> with the same eheap_alloc: Cannot allocate 729810240 bytes of memory (of type "heap").
> message. After some investigation (and switching from stock SASL 
> error_logger to lager) I found that "guilty" processes were 
> error_logger and gproc and that this problem is a bit deeper.
> 
> Some screens: after FSM crash, restart and rebuilding its state I saw: 
> 
> Pid                   Initial Call                          Heap     Reds Msgs
> Registered            Current Function                     Stack
> <0.5.0>               gen_event:init_it/6               59786060  5199822    0
> error_logger          gen_event:fetch_msg/5                    8
> <0.47.0>              gproc:init/1                      19590700  1650033    0
> gproc                 gen_server:loop/6                        9
> <0.221.0>             ebgp_conn:init/1                  19590700 20382184    0
>                       gen_fsm:loop/7                          10
> 
> where 0.221.0 is my "fat FSM" after crash, restart and "state download". 
> process_info(pid(0,5,0)) shows
> 
>  {total_heap_size,107614910},
>  {heap_size,59786060},
>  {stack_size,8},
>  {reductions,5199822},
>  {garbage_collection,[{min_bin_vheap_size,46368},
>                       {min_heap_size,233},
>                       {fullsweep_after,65535},
>                       {minor_gcs,1}]},
> 
> but after manual call to garbage_collect(pid(0, 5, 0)) heap usage 
> decreased significantly:
> 
>  {total_heap_size,233},
>  {heap_size,233},
>  {stack_size,8},
>  {reductions,5199822},
> 
> and the same memory decrease happened with gproc. 
> 
> How can I explain VM crash (not 100% sure, still consider myself as a 
> novice in Erlang): when process crashes, it's state sent to all 
> processes monitoring this one (gproc in this case) and to 
> error_logger. State is big in my case (and in yours too). And there are no shared memory in Erlang.
> So, it's pretty logical that state of failed process was duplicated 
> (may be even triplicated if copy happens while original process heap 
> is not freed at this moment) and this duplication can cause eheap error.
> Especially in case when more than one "fat" process crashes instantly.. 
> 
> Lesson learned: while "let it crash" approach is generally good, it is 
> not so good with "fat" processes, especially with heavily 
> linked/monitored "fat" processes.
> 
> PS: and error_logger and gproc are of course not guilty. They just 
> efficient enough, so their garbage collector was not yet called.
> 
> 
> > 
> >  
> > 
> > Unfortunately, I cannot tell it from the crash dump.
> > 
> >  
> > 
> > The memory section says:
> > 
> > =memory
> > 
> > total: 15447352528
> > 
> > processes: 15140232809
> > 
> > processes_used: 15140005610
> > 
> > system: 307119719
> > 
> > atom: 512601
> > 
> > atom_used: 496586
> > 
> > binary: 148574400
> > 
> > code: 21228007
> > 
> > ets: 119988984
> > 
> >  
> > 
> > I have 16GB RAM, so the processes use almost all. There are 4010 
> > processes. 1 garbing, 31 scheduled, 3978 waiting. If I sum 
> > stack+heap of all the processes then I get ~700MB. That is very far 
> > from 16GB. Here are the top 10 stack+heap
> > processes:
> > 
> > Pid          State              Reductions    Stack+heap  MsgQ Length
> > 
> >              Garbing (limited   1,508,838,180 145,962,050           1
> > <0.21060.67> info)
> > 
> > <0.25689.27> Waiting               86,670,344 145,962,050           0
> > 
> > <0.10003.68> Waiting                1,363,039  38,263,080           0
> > 
> > <0.15943.66> Waiting            1,882,465,380  30,610,465           0
> > 
> > <0.15879.68> Waiting                  471,549  30,610,465           0
> > 
> > <0.31854.67> Waiting              154,500,777  24,488,375           0
> > 
> > <0.16221.68> Waiting                  262,114  24,488,375           0
> > 
> > <0.16628.68> Waiting                  117,268  24,488,375           0
> > 
> > <0.15878.68> Waiting                  453,490  19,590,700           0
> > 
> > <0.16235.68> Waiting                  181,968  19,590,700           0
> > 
> > 
> >  
> > 
> > Any ideas how to tell which process needs ~1GB memory?
> > 
> >  
> > 
> > Thanks,
> > 
> > Jozsef
> > 
> 
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> 
> 
> --
> In theory, there is no difference between theory and practice. 
> But, in practice, there is. 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions