[erlang-questions] Garbage Collection, BEAM memory and Erlang memory

Thu Jan 22 17:33:57 CET 2015

Dear List,
I'm having some troubles in pinpointing why a node is crashing due to
memory issues.
For info, when it crashes, it does not produce a crash dump. However I've
monitored live and I've seen the .beam process eat up all memory until it
abruptly exits.

The system is a big router that relays data coming from TCP connections,
into other TCP connections.
I'm using cowboy as the HTTP server that initiates the long-lived TCP
connections.

I've done all the obvious:

   - Checked the States of my gen_servers and processes.
   - Checked my processes mailboxes (the ones with the longest queue have 1
   item in the inbox).
   - My ETS table memory is constant (see below).

I put the system under controlled load, and I can see with
length(processes()). that my process count is stable, always around 120,000.

I check the processes that are using most memory with this call:

MostMemory = fun(N) ->
  lists:sublist(
    lists:sort(
      fun({_, _, V1}, {_, _, V2}) -> V1 >= V2 end,
      [try
        [{memory, Mem}, {registered_name, RegName}] =
erlang:process_info(Pid, [memory, registered_name]),
        {Pid, RegName, Mem}
      catch _:_ ->
        {Pid, undefined, 0}
      end || Pid <- processes(), Pid =/= self()]
    ), N)
  end.

Which always returns very similar numbers:

1> MostMemory(20).
[{<0.96.0>,[],5180448},
 {<0.78.0>,tls_connection_sup,4525096},
 {<0.6.0>,error_logger,743776},
 {<0.7.0>,application_controller,372592},
 {<0.77.0>,ssl_manager,284640},
 {<0.11.0>,kernel_sup,176712},
 {<0.26.0>,code_server,176272},
 {<0.33.0>,[],143064},
 {<0.419.0>,[],142896},
 {<0.420.0>,[],142896},
 {<0.421.0>,[],142896},
 {<0.422.0>,[],142896},
 {<0.423.0>,[],142896},
 {<0.424.0>,[],142896},
 {<0.425.0>,[],142896},
 {<0.426.0>,[],142896},
 {<0.427.0>,[],142896},
 {<0.428.0>,[],142896},
 {<0.429.0>,[],142896},
 {<0.430.0>,[],142896}]

See the last processes there with all identical memory? These are the
processes handling the connections, and they stay stable with the same
identical number throughout all test.

I get the pid of the .beam process, and I check its reported RES memory
with top -p beam-pid-here.
I get my erlang memory with this simple call (I just convert everything to
GB, thanks to Ferd and his article
https://blog.heroku.com/archives/2013/11/7/logplex-down-the-rabbit-hole):

[{K,V / math:pow(1024,3)} || {K,V} <- erlang:memory()].

This is what I get (at random time intervals):

- BEAM process RES memory:* 2.751 GB*
- Erlang memory:
[{total,2.11871287971735},
 {processes,1.6582859307527542},
 {processes_used,1.6581560596823692},
 {system,0.4604269489645958},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.29880597442388535},
 {code,0.009268132038414478},
 {ets,0.004808835685253143}]

- BEAM process RES memory:* 3.039 GB*
- Erlang memory:
[{total,2.2570599243044853},
 {processes,1.7243007272481918},
 {processes_used,1.7241046279668808},
 {system,0.5327591970562935},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.37129393219947815},
 {code,0.009268132038414478},
 {ets,0.004808835685253143}]

- BEAM process RES memory:* 3.630 GB*
- Erlang memory:
[{total,2.677028402686119},
 {processes,2.1421403884887695},
 {processes_used,2.142106533050537},
 {system,0.5348880141973495},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.37329262495040894},
 {code,0.009268132038414478},
 {ets,0.004808835685253143}]

- BEAM process RES memory:* 3.807 GB*
- Erlang memory:
[{total,2.9233806803822517},
 {processes,2.277688652276993},
 {processes_used,2.277618482708931},
 {system,0.6456920281052589},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.48407071083784103},
 {code,0.009268132038414478},
 {ets,0.004808835685253143}]

- BEAM process RES memory:* 4.026 GB*
- Erlang memory:
[{total,2.8762372359633446},
 {processes,2.100425034761429},
 {processes_used,2.1003194376826286},
 {system,0.7758122012019157},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.6143399104475975},
 {code,0.009268132038414478},
 {ets,0.004808835685253143}]

- BEAM process RES memory:* 4.136 GB*
- Erlang memory:
[{total,2.9030912443995476},
 {processes,2.028559662401676},
 {processes_used,2.0283572375774384},
 {system,0.8745315819978714},
 {atom,4.000673070549965e-4},
 {atom_used,3.847004845738411e-4},
 {binary,0.7129654437303543},
 {code,0.00929991528391838},
 {ets,0.004809550940990448}]

- BEAM process RES memory:* 4.222 GB*
- Erlang memory:
[{total,2.785604253411293},
 {processes,1.875294029712677},
 {processes_used,1.8752291351556778},
 {system,0.910310223698616},
 {atom,4.000673070549965e-4},
 {atom_used,3.847004845738411e-4},
 {binary,0.7487552836537361},
 {code,0.00929991528391838},
 {ets,0.004809550940990448}]

As you can see, at the beginning both the BEAM RES memory and the total
Erlang memory increase, but after a while it becomes clear that the BEAM
process memory keeps increasing while the memory reported as used by Erlang
stabilizes, and even decreases.
Erlang reported memory never surpasses 3 GB.

At this point I tried forcing a Garbage Collection:

[erlang:garbage_collect(Pid) || Pid <- processes()]

After that, we went back to:

- BEAM process RES memory:* 3.336 GB*
- Erlang memory:
[{total,1.9107630401849747},
 {processes,1.5669479593634605},
 {processes_used,1.5668926388025284},
 {system,0.34381508082151413},
 {atom,4.000673070549965e-4},
 {atom_used,3.847004845738411e-4},
 {binary,0.18235664814710617},
 {code,0.00929991528391838},
 {ets,0.004809550940990448}]

However after that, I let the system go and it kept on having the same
behavior (and increasing the BEAM memory).

What puzzles me is that you can clearly see that:

   - The total memory used by processes is increasing, however the top
   processes always use the same amount of memory (and the process count is
   always stable).
   - Binary consumption also increases, but in proportion with process
   memory (and my data is <64K so I don't anticipate it being an issue of
   Refc-binaries not being garbage collected).

I already hibernate most of the long-term open connections.
I also added a periodic garbage collector on the main router, since it
touches all the binaries that go through it, to ensure that all
Refc-binaries that hold a reference to the router are garbage collected.

So I tried the hard approach, and I've set fullsweep_after to 0 as a system
flag (passed in as an environment variable -env ERL_FULLSWEEP_AFTER 0).

After this, I could see notable improvements:

- BEAM process RES memory:* 2.049 GB*
- Erlang memory:
[{total,1.597476489841938},
 {processes,1.2037805244326591},
 {processes_used,1.2036690935492516},
 {system,0.39369596540927887},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.2321353331208229},
 {code,0.009268132038414478},
 {ets,0.004821933805942535}]

- BEAM process RES memory:* 1.919 GB*
- Erlang memory:
[{total,1.549286112189293},
 {processes,1.1740453317761421},
 {processes_used,1.1739420965313911},
 {system,0.3752407804131508},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.2134672999382019},
 {code,0.009268132038414478},
 {ets,0.004821933805942535}]

- BEAM process RES memory:* 2.004 GB*
- Erlang memory:
[{total,1.6023957282304764},
 {processes,1.2192133665084839},
 {processes_used,1.219102293252945},
 {system,0.3831823617219925},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.22155668586492538},
 {code,0.009268132038414478},
 {ets,0.004821933805942535}]

- BEAM process RES memory:* 2.456 GB*
- Erlang memory:
[{total,1.7860298827290535},
 {processes,1.4158401936292648},
 {processes_used,1.4157484397292137},
 {system,0.37018968909978867},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.20867645740509033},
 {code,0.009268132038414478},
 {ets,0.004821933805942535}]

- BEAM process RES memory:* 2.455 GB*
- Erlang memory:
[{total,1.8919306173920631},
 {processes,1.4726912006735802},
 {processes_used,1.4726523533463478},
 {system,0.41923941671848297},
 {atom,4.000673070549965e-4},
 {atom_used,3.846092149615288e-4},
 {binary,0.25766071677207947},
 {code,0.009268132038414478},
 {ets,0.004821933805942535}]

However, the down size to this is obviously that the CPU load increased
almost of a point.

I also have a GC "guardian" similar to the one that Fred implemented in
Heroku's logplex:
https://github.com/heroku/logplex/blob/master/src/logplex_leak.erl

But this obviously is a guard, not a solution per se.

Can anyone give me some pointers on how I can process to identify what is
going on?

Thank you,
r.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150122/3fc3236c/attachment.htm>