[erlang-questions] Problem with Beam process not freeing memory

Tue Sep 17 17:03:53 CEST 2013

as far as I can see erlang:garbage_collect/0,1 does do fullsweeps [1].

   [1]:
https://github.com/erlang/otp/blob/maint/erts/emulator/beam/bif.c#L3770

On Tue, Sep 17, 2013 at 4:53 PM, Robert Virding <
robert.virding@REDACTED> wrote:

> OK, I meant doing an explicit garbage collectionen and was under the
> impression that calling erlang:garbage_collect() actually did a full sweep
> of the process which is why hibernating seemed a bit of an overkill.
>
> Rober
>
> ----- Original Message -----
> > From: "Fred Hebert" <mononcqc@REDACTED>
> > To: "Robert Virding" <robert.virding@REDACTED>
> > Cc: "Erlang Questions Mailing List" <erlang-questions@REDACTED>
> > Sent: Tuesday, 17 September, 2013 4:36:19 PM
> > Subject: Re: [erlang-questions] Problem with Beam process not freeing
> memory
> >
> > Hi Robert,
> >
> > I CC'd the mailing list on this post, because I felt it could be
> > interesting to share with everyone.
> >
> > To get into some more details, the difference between a
> > garbage-collection and hibernating is that hibernation forces a
> > full-sweep and does compaction work. It is more likely to actually
> > remove old refs to binaries.
> >
> > The tricky part about binary leaks in these cases is that if the process
> > you're garbage collecting holds some very long-lived references (or
> > takes a long while before enabling them), you will move the references
> > to the old heap, if my understanding is correct. Now if your process
> > that leaks resources is busy or bogged down by some task, there are
> > chances that manually calling GC at higher frequencies will force
> > short-lived references to the old heap.  Eventually, most of the
> > subsequent GCs are done for no good reason until there's a full sweep to
> > free the references, if my understanding is right.
> >
> > In comparison, some hibernations may turn out to be beneficial due to
> > how they do the full sweep, especially on less active processes, without
> > changing spawn_opt values or whatever.
> >
> > Both cases are not necessarily great, and I don't think there's one easy
> > way to deal with it.
> >
> > One thing I've had in mind to try for a while was to run a function a
> > bit like:
> >
> > -spec gc_count(non_neg_integer(), binary()) -> non_neg_integer().
> > gc_count(PreviousCounter, Bin) ->
> >     case byte_size(Bin) of
> >         N when N >= 64 -> % refc binary
> >             Count = N + PreviousCounter
> >             case NewCount >= ?THRESHOLD of
> >                 true ->
> >                     erlang:garbage_collect(),
> >                     0;
> >                false ->
> >                    NewCount
> >             end;
> >         N -> % heap binary
> >             PreviousCounter+N
> >     end.
> >
> > that could alternatively force some hibernation instead of GC'ing. This
> > one could basically track the size of all binaries seen manually and
> > force something when you go over a certain amount. It sucks, though,
> > because that's basically manually doing your collection, and it doesn't
> > mean that because you have seen a binary, it's ready to be GC'd. I've
> > thus avoided trying it in the real world for now.
> >
> > In practice, at Heroku, we've decided to go for a hybrid approach in
> > logplex. We force hibernation on some important events that interrupt
> > our work flow no matter what (such as a socket disconnection, or long
> > periods of time [seconds] without activity for a process), and have put
> > a workaround in place to force VM-wide GCs when we're reaching critical
> > amounts of memory:
> > https://github.com/heroku/logplex/blob/master/src/logplex_leak.erl
> >
> > The objective was to use global GC as a last measure in case individual
> > (unobtrusive) hibernates were not enough to save a node.
> >
> > This later on prompted for exploring the allocators of the VM -- the
> > value used (erlang:memory(total)) didn't represent the OS-imposed limits
> > on the VM: nodes would be killed by going out of memory without first
> > having had the chance to run the global GC. This lead to discovering
> > things about fragmentation and characterizing our workloads to pick
> > better allocation strategies that seem to work decently so far, so that
> > erlang:memory(total), for one, has the right values, and also that we
> > have a better time releasing allocated blocks of memory when most
> > binaries vanish.
> >
> > I hope we can remove both the artificial hibernation calls and the
> > workarounds to force some global GCs in the near future. Ideally, it
> > sounds like the VM should possibly do more when it comes to the weight
> > of refc binaries to individual processes' memory for GC, but I don't
> > have a good idea of how this should be done in practice without having
> > elephant-sized assumptions and holes in the solution without adding more
> > knobs to the VM to configure things. Plus I'd have no idea on how to
> > actually implement it.
> >
> > Regards,
> > Fred.
> >
> >
> > On 09/17, Robert Virding wrote:
> > > Hi Fred,
> > >
> > > You recommend hibernating a process. Do you think this is better than
> > > calling the garbage collector in a process? I have no idea but
> hibernating
> > > seems more drastic, especially if the process is "in use"?
> > >
> > > Robert
> > >
> > > ----- Original Message -----
> > > > From: "Fred Hebert" <mononcqc@REDACTED>
> > > > To: "Tino Breddin" <tino.breddin@REDACTED>
> > > > Cc: "Erlang Questions Mailing List" <erlang-questions@REDACTED>
> > > > Sent: Tuesday, 17 September, 2013 2:39:41 AM
> > > > Subject: Re: [erlang-questions] Problem with Beam process not freeing
> > > > memory
> > > >
> > > > I've recently run in similar issues and have received a bit of help
> from
> > > > Lukas Larsson, which I'm glad to pass on to you. Whatever he taught
> me,
> > > > I tried to put into the recon library, currently on a different
> branch
> > > > awaiting review and prone to change:
> > > > https://github.com/ferd/recon/tree/allocators
> > > >
> > > > 1. Checking for binary leaks, I recommend calling `recon:bin_leak(N)`
> > > > where `N` is the number of 'highest results' you want. The function
> will
> > > > take a snapshot of the number of binary refs in your processes, then
> GC
> > > > the node entirely, and then take another snapshot, make diff, and
> return
> > > > the N biggest deltas in binaries. This will let you know what
> processes
> > > > hold the most references to stale refc binaries. I recommend
> hibernation
> > > > as a first way to try and fix this if it is the problem.
> > > >
> > > > 2. Check the reported/allocated memory with `recon_alloc:memory(Arg)`
> > > > where `Arg` can be:
> > > >  - `used` for the memory actively used (i.e. erlang:memory(total))
> > > >  - `allocated` for the memory reserved by individual allocators (sum)
> > > >  - `usage` for the percentage.
> > > > If the result of `allocated` is close to what the OS reports, you
> > > > probably have fragmentation issues. If not, you may have a NIF or
> driver
> > > > that allocates data outside of the ERTS allocators
> > > >
> > > > 3. check individual allocator usage levels with
> > > > `recon_alloc:fragmentation(current)`. It will return usage
> percentages
> > > > for mbcs and sbcs. Mbcs are multiblock carriers and are where data
> goes
> > > > by default. When the data allocated is too large (> than the single
> > > > block carrier threshold [sbct]), it goes into its own block. Compare
> the
> > > > results with what you get with `recon_alloc:fragmentation(max)`. If
> the
> > > > current values have very low usage but the max ones have large ones,
> you
> > > > may have lingering data, possibly held in long-term references or
> > > > whatever that blocks deallocation of specific carriers. Different
> > > > carrier strategies can help, which we can dive into if you see a
> problem
> > > > with this.
> > > >
> > > > Feel free to read the comments in `recon_alloc` until I actually
> merge
> > > > it in master, they contain some of the details about what to do or
> look
> > > > for.
> > > >
> > > > Lukas may want me to correct me on the content of this post. I'm
> going
> > > > from the limited knowledge he transmitted to me here, or rather, my
> > > > limited understanding of it :)
> > > >
> > > > Regards,
> > > > Fred.
> > > >
> > > > On 09/16, Tino Breddin wrote:
> > > > > Hi list,
> > > > >
> > > > > I'm experiencing issues with a couple of Beam nodes where I see a
> huge
> > > > > gap
> > > > > between the node's reported memory usage and the underlying Linux
> > > > > Kernel's
> > > > > view.
> > > > >
> > > > > This is using R15B01.
> > > > >
> > > > > As a start an application in such a node stores a lot of tuples
> > > > > (containing
> > > > > atoms and binary data) in ETS tables. That proceeds until a point
> where
> > > > > memory usage is 70% (6GB) of the available memory. At that point
> > > > > erlang:memory() and top (or /proc/PID/status) agree roughly on the
> > > > > memory
> > > > > usage. Then an internal cleanup task is performed, which clears
> > > > > obsolete
> > > > > records from the ETS tables. Afterwards, erlang:memory() reports an
> > > > > expected low value of roughly 60MB memory usage. (This includes
> binary
> > > > > data). However, the kernel still reports the high memory usage
> values
> > > > > (both
> > > > > VmRss and VmTotal) for the node. The kernel's usage view will stay
> > > > > stable
> > > > > until the ETS tables are filled to a point where the real memory
> usage
> > > > > exceeds the kernel's view, then the kernel reported usage will
> grow as
> > > > > well.
> > > > >
> > > > > Now having checked the node in some details I'm wondering what
> causes
> > > > > this
> > > > > difference between the BEAM's view and the Kernel's view on memory
> > > > > usage. I
> > > > > have 2 ideas which I'm checking right now.
> > > > >
> > > > > (1) Not GC'ed binaries: Could it be that binary data is not GC'ed
> > > > > because
> > > > > the original dispatcher process which it was passed through before
> > > > > being
> > > > > stored in an ETS table is still alive. Thus there is still some
> > > > > reference
> > > > > to it? However, this would not explain why erlang:memory() reports
> a
> > > > > very
> > > > > low value for used memory for binaries.
> > > > >
> > > > > (2) low-level memory leak: Some driver or NIF leaking memory, which
> > > > > would
> > > > > obviously not be reported by erlang:memory(). However, then it
> > > > > surprises me
> > > > > that the Kernel's view stays stable while the BEAM's actual memory
> > > > > usage is
> > > > > still below the Kernel's view. It should be continuously growing in
> > > > > this
> > > > > case imho.
> > > > >
> > > > > I'd appreciate if anyone has some more insight or experience with
> such
> > > > > a
> > > > > behaviour, while I'm further digging into this.
> > > > >
> > > > > Cheers,
> > > > > Tino
> > > >
> > > > > _______________________________________________
> > > > > erlang-questions mailing list
> > > > > erlang-questions@REDACTED
> > > > > http://erlang.org/mailman/listinfo/erlang-questions
> > > >
> > > > _______________________________________________
> > > > erlang-questions mailing list
> > > > erlang-questions@REDACTED
> > > > http://erlang.org/mailman/listinfo/erlang-questions
> > > >
> >
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130917/6774b271/attachment.htm>