[erlang-questions] Are crash dumps collecting my garbage?
Fred Hebert
mononcqc@REDACTED
Thu Jul 18 18:25:40 CEST 2013
Hi everyone.
I was investigating a refc binary leak in a couple of production nodes,
and had a few questions regarding Erlang crash dump following some
discoveries I made.
tl;dr: is the Erlang VM garbage-collecting refc binary references
when generating the crash dump or am I not reading the dumps
right?
--
First of all, I diagnosed the binary refc leak problem by using a
function a bit like follows:
f(MostLeaky).
MostLeaky = fun(N) ->
lists:sublist(
lists:usort(
fun({K1,V1},{K2,V2}) -> {V1,K1} =< {V2,K2} end,
[try
{_,Pre} = erlang:process_info(Pid, binary),
erlang:garbage_collect(Pid),
{_,Post} = erlang:process_info(Pid, binary),
{Pid, length(Post)-length(Pre)}
catch
_:_ -> {Pid, 0}
end || Pid <- processes()]),
N)
end.
%% Pairs = MostLeaky(25).
I ran this function on a production node, using the 'binary' option of
process_info/2 to get the list of binaries referenced by the process. I
quickly found I had processes leaking dozens to hundreds of thousands of
them.
I took a similar node, running the same code at the same time in the
same cluster and under a similar load and let it crash dump once it got
out of memory.
I noticed that the fields `=proc_heap:Pid` in the crash dump often
referred to binaries (in `=binary:Id` fields) by using
`<Hex>:Yc<Id>:Hex` and decided to write a script[1] to find what
processes were hogging refc binary references when the node died.
However, I found out that this script gave me output that made it look
,ike there was no actual refc leak (highest counts were fully
reasonable), and there was a huge discrepancy in the values returned by
the `=memory` field and the individual binary sizes (calculated by using
the prefix in `HexPrefix:HexBinary` in each `=binary` field). The
results I had, for example, were:
binary: 3360361264
binary-memory-counted: 138502020
I expect a difference to be had due to how they're stored with the
refcounts, fragmentation and everything, but I do not expect this to
cause an overhead of 24:1 between the two data sets I counted, which
would be mighty scary.
Have I made some kind of mistake in my script or understanding of the
crash dump when counting for refs, or is the VM really omitting the
garbage-collectable references to refc binaries when dumping?
Regards,
Fred.
[1]: https://gist.github.com/ferd/6030174 (awk to show biggest processes
hogging references, and refc binaries with their respective refc count)
More information about the erlang-questions
mailing list