[erlang-questions] Process heap inspector

Wed Nov 30 12:07:59 CET 2011

On Nov 29, 2011, at 7:56 PM, Witold Baryluk wrote:

> On 11-29 08:28, Michal Ptaszek wrote:
>> 
>>>> Bit confused, but wouldn't this objection also apply to
>>>> erlang:suspend_process/2 [1] as well? I use this quite often in
>>>> production on long lived processes that are chewing up resources. Its
>>>> quite the handy tool in certain cases.
>>>> 
>>>> [1] http://erlang.org/doc/man/erlang.html#suspend_process-2
>>>> 
>>> 
>>> I think problem with such feature, is that it break soft-realtimenes
>>> and preemptibility of all erlang processes. By creating and calling such BIF
>>> you are essentially makeing impossible to schedule other processes,
>>> if you have single scheduler and single CPU.
>>> Most long running BIFs run in separate async threads or are done
>>> in such way that one can stop them in any reasonable point,
>>> and continue later, this way long running BIF is broken
>>> into some (maybe large) incremental steps, which one bringing
>>> you closer to result, but at each transition you can choice
>>> to perform step or go back to scheduler (due reductions exhaustion),
>>> and be scheduled later to continue this steps...
>>> 
>>> This is for example situation in re module (regular expression) BIFs,
>>> or even simple one like length/1.

I might be wrong, but I can not see how where length/1 has SMP
process interleaving implemented. For me it's just a simple loop
over elements sitting on the list, without any kind of pre-emption 
(erl_bif_guard.c, length_1 function). 

I agree regarding re module. Still, 're' is used in the code and called 
pretty extensively when working on textual data. No one should ever
think of putting erlang:process_info, erlang:inspect_heap or erlang:trace
in the code and rely on the return values (unless the code is used to 
automate debugging process, not doing the actual logic). 

>>> 
>>> So unless such BIF is written in preemptible way, it should not be included
>>> in the non-debug build.
>> 
>> I'm sorry, but I disagree. In this case one process operates on the heap
>> of another process - if we let caller to be preempted we have two ways to
>> go:
>> a) resume callee and risk GC/storing new terms on the heap
> 
> I was saying that it affects all OTHER processes.  We should not resume
> target processes, exactly to prevent storing anything on heap or running
> GC on him. But we should be able to schedule OTHER processes, which are
> completly independed of both caller and callee.
> 
>> b) leave callee suspended and risk caller to be terminated (I assume that
>> we can be e.g. killed by exit(Pid, kill) BIF by any other process in the system?). 
>> If so - callee will never be brought back to life again and remain suspended 
>> for good. Plus, if we allow caller process to be preempted, time required for
>> callee to be awaken grows again.

Actually I was wrong about resuming suspended processes on the suspender 
termination. It's done automatically, as Process structure keeps a list of things
we suspended and resume them on its exit. My argument is irrelevant here then. 

> How about fixing exit(Pid, kill) BIF?. It needs to perform cleanup anyway (like
> processes linked processes), and if we are locking there is actually no
> way to kill process which is doing heap inspection. We we are copying in
> incremenetal way it is simple. Already copied data will be easly
> cleaned/deallocated up by existing memory cleanup procedure and GC.
> 
> We just should not allow target process to be scheduled or running, if
> any other process is performing heap inspection on him.
> 
> We however should allow all other process to continue operation, as long
> they do not communicate with both processes or ar not calling our BIF in
> the same time.

Agree.

> Only message queue will then be affected, but we can solve this in two possible ways:
> 
> 1) block senders when trying to queue a messagesd into recieving process
>  - not really and option, it break asynchronous sending, as well
>    is essentially impossible in distributed erland
> 2) copy messae queue and ignore new messages queued after coping strted.
> 3) do not copy at all (leave it to process_info(Pid, messages) BIF.
> 
> In fact solution 3) (that is process_info(Pid, messages)), should use solution 2).

To be honest I am a little bit more biased towards 3). As you mentioned before, 
it might be interesting to implement process_info(Pid, messages) in the safe way, 
however this might require a little bit more of work (+ might change the behavior 
of the system as well). 

Also, it all breaks down to priorities: i.e. whether we want the system to reply
to our debug requests or simply treat them as any other calls and do not favor 
them over anything else. 

> In fact we do not even need to fully lock queue. If we blocked
> scheduling of target process, then it cannot dequeue anything from its
> message queue, and we should be happy by just returning how queue was
> looking in the moment we started copying (starting from head to curent
> end element) If someone queues new message into targets message queue,
> it can be done without problem as it will done only at the end of queue,
> and it is secured by separate lock AFAIK. We can just stop copying
> message queue when we hit saved end element, or then retry locking, and
> copy one more time new elements (but not again, as it indeeded could
> trigger copying to never end). This way it is all predictable and
> doesn't affect other processes as well ends in finite time.
> 
>> Also, if we take a look at the debug process_info(Pid, messages) - it also
>> does not implement any kind of process interleaving, even if the message 
>> queue is extremely large (and on non-SMP VMs can take a while). 
> 
> I think it can be improved. Because even in SMP build, what will happen
> if multiple processes will try to perform process_info(Pid, messages) of
> multiple other processes? It will be all scheduled on single
> predetermined processes (which will make other processes to schedule
> without problems using other CPUs, but may introduce deadlock situation,
> but I'm not sure about it, how serialization is done in this scenario),
> or each on separate CPUs (which will make same probelem if we have not
> enough CPUs).
> 
> In fact if process_info(Pid, messages) would be improved as presented in
> previous paragraph, it could be used by heap_inspector, as it is not
> returning message queue anyway currently.
> 
> 
> 
>> 
>> As it was pointed out debug tools are used for debugging, and thus should be 
>> operated with knowledge on what are the consequences of the call. We have 
>> init:stop/0 in the API, but no one complains that someone might apply it by 
>> accident on the live system. 
> 
> Everybody know what init:stop/0 is doing, and it is obvious what it will
> do.
> 
> IMHO it is possible to write such heap inspector functionality in
> correct way, which will make it possible to run also on live system
> correctly. Especially that not everybody will have knowledge that usage
> of such debuging function may affect whole system. Also inexpirienced
> developer may be temped to use it because he/she belive that target
> process have small heap and it will finish quickly. But there is no way
> to know this for sure in advance easly (one will essentially need to
> check various informations like total_heap_size, message_queue_len,
> returned by process_info(Pid, ItemSpec)), etc.
> 
> In fact documentation of erlang:process_info/2 doesn't mention at all
> that it breaks Erlang promisses of scheduling. There is note for
> erlang:process_info/1, that it should only be used for debugging, but
> there is nothing about erlang:process_info/2, even calling for returning
> messages.
> 
> """
>  Warning:
>            This  BIF  is intended for debugging only, use process_info/2
>              for all other purposes.
> """
> 
> So this need fix, as it clearly give me impression that calling
> process_info(Pid, messages | links | dictionary) is safe in some sense
> (I only mention this try ItemSpec items, because rest are probably
> computable in constant time, so will be neglibegle impact even if they
> use full process locking, or even full VM locking).
> 
> 
> 
> Existence of badly implemented BIFs is not a justification for creating
> more badly implemented BIFs. I'm not against helpful BIFs (well,
> actually in some sense I'm, because it adds to the VM complexity
> substantially), especially outside of OTP, but if one wants better
> integration and official support, IMHO they should be implemented
> properly.
> 
> And it is not hard, you just need to have explicit stack which will be
> used for knowing when in process of coping we are. For copying single
> term there is already functionality in VM (used when copying message to
> another process' message queue) and it is already probably written in
> correct way (modulo next paragraph which is in practice of small
> importance).
> 
> Other interesting problem is what with terms constructed by process like
> this:
> 
>  A = {"just small list"},
>  B = {A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A},
>  C = {B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B},
>  D = {C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C},
>  E = {D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D},
>  F = {E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E},
>  G = {F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F},
>  337781283 = size(term_to_binary(G)).  % it actually crash dumps my beam, with eheap_alloc failure.
> 
> It can be disaster when trying to copy them to other process
> in normal way (using existing copy term functionality).
> (it is possible to copy them safely, maybe even quite fast
> when doing this in C, because in fact it is possible even in pure
> erlang: http://erlang.org/pipermail/erlang-questions/2009-September/046452.html )
> And even checkinfo process_info(Pid, heap_size) will not
> help developer to know if it is safe to perform heap inspection.
> Such terms doesn't occur normally, and developers will know
> when to expect them. Other possibility is to have additional
> option to heap inspector which will allow seeing only part of heap
> (you will not inspect 300MB of heap manually anyway, so it is
> in most cases pointless to copy it all).
> 
> 
> Maybe there is something I missed which make it impossible to write
> heap inspector properly at all? If yes, then well, I will
> accept reality. I just do not want to make Erlang situation
> with BIFs worse than it is currently if possible.
> 
> Regards,
> Witek

Cheers, 
Michal