[erlang-questions] VM leaking memory

Thu Jan 31 23:14:15 CET 2019

On 01/31, Frank Muller wrote:
>After adding a new feature to my app (running non-stop for 5 years), it
>started leaking memory in staging.
>
>Obviously, I’m suspecting this new feature. Command top shows RES going
>from 410m (during startup) to 6.2g in less than 12h.
>
>For stupid security reasons, it will take me weeks to be allowed to share
>collected statistics (from recon, entop) here, but I can share them in
>private if someone is willing to help.
>

I'd recommend checking things like:

- recon_alloc:memory(usage) and see if the ratio is high or very low; 
  this can point towards memory fragmentation if the ratio is low.
- in case there is fragmentation (or something that looks like it) 
  recon_alloc:fragmentation(current) will return lists of all the 
  various allocators and types, which should help point towards which 
  type of memory is causing issues
- if usage seems high, see recon_alloc:memory(allocated_types) to see if 
  there's any allocator that's higher than others; ETS, binary, or eheap 
  will tend to point towards an ETS table, a refc binary leak, or some 
  process gathering lots of memory

Based on this it might be possible to then orient towards other avenues 
without you having to share any numbers.

Quick checks if it's binary memory is to call recon:bin_leak(10), which 
will probe all processes for their binary memory usage, run a GC on all 
of them, then run a probe again, and give you those that have the 
largest gap. This can point to processes that had the most dead memory.

There's an undocumented 'binary_memory' option that recon:info, 
recon:proc_count, and recon:proc_window all support -- it's undocumented 
because it might be expensive and not always safe to run -- that you can 
use to find which processes are holding the most binary memory; after a 
call to bin_leak, this can let you know about biggest users.

You can also use proc_count with:
- message_queue_len for large mailboxes
- memory for eheap usage

You can use the same values with proc_window to see who is currently 
allocating the most.

If ETS is taking a lot of place, calling ets:i() can show a bunch of 
tables with content; you might have a runaway cache table or something 
like that.

Regards,
Fred.