[erlang-questions] Garbage Collection, BEAM memory and Erlang memory
Roberto Ostinelli
roberto@REDACTED
Thu Jan 22 17:33:57 CET 2015
Dear List,
I'm having some troubles in pinpointing why a node is crashing due to
memory issues.
For info, when it crashes, it does not produce a crash dump. However I've
monitored live and I've seen the .beam process eat up all memory until it
abruptly exits.
The system is a big router that relays data coming from TCP connections,
into other TCP connections.
I'm using cowboy as the HTTP server that initiates the long-lived TCP
connections.
I've done all the obvious:
- Checked the States of my gen_servers and processes.
- Checked my processes mailboxes (the ones with the longest queue have 1
item in the inbox).
- My ETS table memory is constant (see below).
I put the system under controlled load, and I can see with
length(processes()). that my process count is stable, always around 120,000.
I check the processes that are using most memory with this call:
MostMemory = fun(N) ->
lists:sublist(
lists:sort(
fun({_, _, V1}, {_, _, V2}) -> V1 >= V2 end,
[try
[{memory, Mem}, {registered_name, RegName}] =
erlang:process_info(Pid, [memory, registered_name]),
{Pid, RegName, Mem}
catch _:_ ->
{Pid, undefined, 0}
end || Pid <- processes(), Pid =/= self()]
), N)
end.
Which always returns very similar numbers:
1> MostMemory(20).
[{<0.96.0>,[],5180448},
{<0.78.0>,tls_connection_sup,4525096},
{<0.6.0>,error_logger,743776},
{<0.7.0>,application_controller,372592},
{<0.77.0>,ssl_manager,284640},
{<0.11.0>,kernel_sup,176712},
{<0.26.0>,code_server,176272},
{<0.33.0>,[],143064},
{<0.419.0>,[],142896},
{<0.420.0>,[],142896},
{<0.421.0>,[],142896},
{<0.422.0>,[],142896},
{<0.423.0>,[],142896},
{<0.424.0>,[],142896},
{<0.425.0>,[],142896},
{<0.426.0>,[],142896},
{<0.427.0>,[],142896},
{<0.428.0>,[],142896},
{<0.429.0>,[],142896},
{<0.430.0>,[],142896}]
See the last processes there with all identical memory? These are the
processes handling the connections, and they stay stable with the same
identical number throughout all test.
I get the pid of the .beam process, and I check its reported RES memory
with top -p beam-pid-here.
I get my erlang memory with this simple call (I just convert everything to
GB, thanks to Ferd and his article
https://blog.heroku.com/archives/2013/11/7/logplex-down-the-rabbit-hole):
[{K,V / math:pow(1024,3)} || {K,V} <- erlang:memory()].
This is what I get (at random time intervals):
- BEAM process RES memory:* 2.751 GB*
- Erlang memory:
[{total,2.11871287971735},
{processes,1.6582859307527542},
{processes_used,1.6581560596823692},
{system,0.4604269489645958},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.29880597442388535},
{code,0.009268132038414478},
{ets,0.004808835685253143}]
- BEAM process RES memory:* 3.039 GB*
- Erlang memory:
[{total,2.2570599243044853},
{processes,1.7243007272481918},
{processes_used,1.7241046279668808},
{system,0.5327591970562935},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.37129393219947815},
{code,0.009268132038414478},
{ets,0.004808835685253143}]
- BEAM process RES memory:* 3.630 GB*
- Erlang memory:
[{total,2.677028402686119},
{processes,2.1421403884887695},
{processes_used,2.142106533050537},
{system,0.5348880141973495},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.37329262495040894},
{code,0.009268132038414478},
{ets,0.004808835685253143}]
- BEAM process RES memory:* 3.807 GB*
- Erlang memory:
[{total,2.9233806803822517},
{processes,2.277688652276993},
{processes_used,2.277618482708931},
{system,0.6456920281052589},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.48407071083784103},
{code,0.009268132038414478},
{ets,0.004808835685253143}]
- BEAM process RES memory:* 4.026 GB*
- Erlang memory:
[{total,2.8762372359633446},
{processes,2.100425034761429},
{processes_used,2.1003194376826286},
{system,0.7758122012019157},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.6143399104475975},
{code,0.009268132038414478},
{ets,0.004808835685253143}]
- BEAM process RES memory:* 4.136 GB*
- Erlang memory:
[{total,2.9030912443995476},
{processes,2.028559662401676},
{processes_used,2.0283572375774384},
{system,0.8745315819978714},
{atom,4.000673070549965e-4},
{atom_used,3.847004845738411e-4},
{binary,0.7129654437303543},
{code,0.00929991528391838},
{ets,0.004809550940990448}]
- BEAM process RES memory:* 4.222 GB*
- Erlang memory:
[{total,2.785604253411293},
{processes,1.875294029712677},
{processes_used,1.8752291351556778},
{system,0.910310223698616},
{atom,4.000673070549965e-4},
{atom_used,3.847004845738411e-4},
{binary,0.7487552836537361},
{code,0.00929991528391838},
{ets,0.004809550940990448}]
As you can see, at the beginning both the BEAM RES memory and the total
Erlang memory increase, but after a while it becomes clear that the BEAM
process memory keeps increasing while the memory reported as used by Erlang
stabilizes, and even decreases.
Erlang reported memory never surpasses 3 GB.
At this point I tried forcing a Garbage Collection:
[erlang:garbage_collect(Pid) || Pid <- processes()]
After that, we went back to:
- BEAM process RES memory:* 3.336 GB*
- Erlang memory:
[{total,1.9107630401849747},
{processes,1.5669479593634605},
{processes_used,1.5668926388025284},
{system,0.34381508082151413},
{atom,4.000673070549965e-4},
{atom_used,3.847004845738411e-4},
{binary,0.18235664814710617},
{code,0.00929991528391838},
{ets,0.004809550940990448}]
However after that, I let the system go and it kept on having the same
behavior (and increasing the BEAM memory).
What puzzles me is that you can clearly see that:
- The total memory used by processes is increasing, however the top
processes always use the same amount of memory (and the process count is
always stable).
- Binary consumption also increases, but in proportion with process
memory (and my data is <64K so I don't anticipate it being an issue of
Refc-binaries not being garbage collected).
I already hibernate most of the long-term open connections.
I also added a periodic garbage collector on the main router, since it
touches all the binaries that go through it, to ensure that all
Refc-binaries that hold a reference to the router are garbage collected.
So I tried the hard approach, and I've set fullsweep_after to 0 as a system
flag (passed in as an environment variable -env ERL_FULLSWEEP_AFTER 0).
After this, I could see notable improvements:
- BEAM process RES memory:* 2.049 GB*
- Erlang memory:
[{total,1.597476489841938},
{processes,1.2037805244326591},
{processes_used,1.2036690935492516},
{system,0.39369596540927887},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.2321353331208229},
{code,0.009268132038414478},
{ets,0.004821933805942535}]
- BEAM process RES memory:* 1.919 GB*
- Erlang memory:
[{total,1.549286112189293},
{processes,1.1740453317761421},
{processes_used,1.1739420965313911},
{system,0.3752407804131508},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.2134672999382019},
{code,0.009268132038414478},
{ets,0.004821933805942535}]
- BEAM process RES memory:* 2.004 GB*
- Erlang memory:
[{total,1.6023957282304764},
{processes,1.2192133665084839},
{processes_used,1.219102293252945},
{system,0.3831823617219925},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.22155668586492538},
{code,0.009268132038414478},
{ets,0.004821933805942535}]
- BEAM process RES memory:* 2.456 GB*
- Erlang memory:
[{total,1.7860298827290535},
{processes,1.4158401936292648},
{processes_used,1.4157484397292137},
{system,0.37018968909978867},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.20867645740509033},
{code,0.009268132038414478},
{ets,0.004821933805942535}]
- BEAM process RES memory:* 2.455 GB*
- Erlang memory:
[{total,1.8919306173920631},
{processes,1.4726912006735802},
{processes_used,1.4726523533463478},
{system,0.41923941671848297},
{atom,4.000673070549965e-4},
{atom_used,3.846092149615288e-4},
{binary,0.25766071677207947},
{code,0.009268132038414478},
{ets,0.004821933805942535}]
However, the down size to this is obviously that the CPU load increased
almost of a point.
I also have a GC "guardian" similar to the one that Fred implemented in
Heroku's logplex:
https://github.com/heroku/logplex/blob/master/src/logplex_leak.erl
But this obviously is a guard, not a solution per se.
Can anyone give me some pointers on how I can process to identify what is
going on?
Thank you,
r.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150122/3fc3236c/attachment.htm>
More information about the erlang-questions
mailing list