[erlang-questions] Process heap inspector

Tue Nov 29 08:28:53 CET 2011

On Nov 28, 2011, at 5:29 PM, Witold Baryluk wrote:

> On 11-28 09:23, Paul Davis wrote:
>> On Mon, Nov 28, 2011 at 7:55 AM, Kostis Sagonas <kostis@REDACTED> wrote:
>>> On 11/28/2011 08:39 AM, Michal Ptaszek wrote:
>>>> 
>>>> Hi everyone,
>>>> 
>>>> This idea was born in my mind when debugging some complex, live system
>>>> and trying to figure out where did all my memory go.
>>>> 
>>>> So, when debugging live system/investigating suspicious memory consumption
>>>> patterns
>>>> or simply trying to understand better what's going on with our processes,
>>>> it might be useful
>>>> to take a peep at the data given process operates on.
>>>> 
>>>> ...
>>>> 
>>>> The implementation is rather simple: if the process we probe is not the
>>>> caller one (we are not doing
>>>> erlang:inspect_heap(self()), the data is copied from the callee heap to
>>>> caller heap (to prevent from having
>>>> cross-process references in variables), then we compute flat size of the
>>>> each term we moved. Also, rootset
>>>> is also included in the summary (i.e. process dict, seq tokens, etc.).
>>>> 
>>>> Code is included in my inspect_heap OTP branch on:
>>>>  github: https://github.com/paulgray/otp/tree/inspect_heap
>>>> 
>>>> I am still a little bit hesitant about suspending process we probe: can
>>>> anyone tell
>>>> me if acquiring main process lock would be enough to keep its heap
>>>> untouched during
>>>> the call?
>>>> 
>>>> Please, do point any bugs and tell me what do you think about the idea.
>>> 
>>> I can see that this may be handy to have at some situations, but provided I
>>> understand what is happening at the implementation level (disclaimer: I have
>>> not looked at the implementation), I think it's actually a pretty bad idea
>>> to include in a non debug-enabled runtime system.
>>> 
>>> The reason is that this breaks all assumptions/invariants of the runtime
>>> system in that Erlang processes are independent and can be scheduled to
>>> execute concurrently on an SMP without being preempted by anything other
>>> than exhausting their reduction step count or being stuck on some receive.
>>> With this "built-in feature" processes need to be able to stop at more or
>>> less any random point and stay suspended for an indefinite amount of time
>>> based on code that _another_ process is executing.
>>> 
>> 
>> Bit confused, but wouldn't this objection also apply to
>> erlang:suspend_process/2 [1] as well? I use this quite often in
>> production on long lived processes that are chewing up resources. Its
>> quite the handy tool in certain cases.
>> 
>> [1] http://erlang.org/doc/man/erlang.html#suspend_process-2
>> 
> 
> I think problem with such feature, is that it break soft-realtimenes
> and preemptibility of all erlang processes. By creating and calling such BIF
> you are essentially makeing impossible to schedule other processes,
> if you have single scheduler and single CPU.
> Most long running BIFs run in separate async threads or are done
> in such way that one can stop them in any reasonable point,
> and continue later, this way long running BIF is broken
> into some (maybe large) incremental steps, which one bringing
> you closer to result, but at each transition you can choice
> to perform step or go back to scheduler (due reductions exhaustion),
> and be scheduled later to continue this steps...
> 
> This is for example situation in re module (regular expression) BIFs,
> or even simple one like length/1.
> 
> So unless such BIF is written in preemptible way, it should not be included
> in the non-debug build.

I'm sorry, but I disagree. In this case one process operates on the heap
of another process - if we let caller to be preempted we have two ways to
go:
a) resume callee and risk GC/storing new terms on the heap
b) leave callee suspended and risk caller to be terminated (I assume that
we can be e.g. killed by exit(Pid, kill) BIF by any other process in the system?). 
If so - callee will never be brought back to life again and remain suspended 
for good. Plus, if we allow caller process to be preempted, time required for
callee to be awaken grows again. 
Also, if we take a look at the debug process_info(Pid, messages) - it also
does not implement any kind of process interleaving, even if the message 
queue is extremely large (and on non-SMP VMs can take a while). 

As it was pointed out debug tools are used for debugging, and thus should be 
operated with knowledge on what are the consequences of the call. We have 
init:stop/0 in the API, but no one complains that someone might apply it by 
accident on the live system. 

On Nov 28, 2011, at 2:55 PM, Kostis Sagonas wrote:
> I am also concerned about how/whether sharing of subterms is preserved or not when doing the copying. (Based on the phrasing that "then we compute flat size of the each term we moved", I suspect the answer is no.)  Why is this useful?  You may end up with an arbitrarily bigger heap in the caller than the one that the callee currently has. Call me unimaginative but I do not really see why you would want that...

Right, that's yet another thing I must work on: thank you for the hint!

Kind regards,
Michal Ptaszek