[erlang-patches] gen_stream intermittent errors

Tue Aug 24 16:21:05 CEST 2010

On Mon, Aug 23, 2010 at 08:03:00AM -0700, jay@REDACTED wrote:
> >
> > The question is if the problem should be solved in the test suite
> > or in the {stop} handshake (there seem to be no handshake today)?
> 
> The {stop} message is internal to a single opaque gen_stream instance.  It
> is a cast message to the self-spawned and owned internal processes
> delivering the stream in parallel, and is sent on first detection of
> 'end_of_stream' internal to the gen_stream process.
> 
> >
> > By looping over the list orig_procs with is_process_alive/1 there
> > are inherent race conditions since you can not know if a process
> > is on the way down or not.
> >
> > You either write an explicit handshake where the process receiving
> > {stop} replies, or process sending {stop} can use erlang:monitor/2
> > to ensure the stopped process has stopped.
> 
> These problems are merely test suite artifacts of my own creation in
> trying to make the test suite thorough enough to prove the code works.  I
> am just trying to show that when a gen_stream starts it will spin up the
> requested number of processes, and when the stream reaches completion, the
> processes are no longer running.  I could provide no such guarantee, and
> just say that when you stop the gen_stream all linked internal processes
> will die.

Ok.

> 
> >
> > I have not had time to dig into and understand the gen_stream code.
> > If there should be a handshake for {stop} and if it should be
> > through monitor depends on if one gen_stream can be stopped
> > from several others. It is a design question; how is a
> > gen_stream state defined and how can other gen_streams
> > know the state? Think distributed. If gen_streams are on
> > different nodes, what happens then?
> 
> The initial implementation of gen_stream is to provide multi-core benefits
> by reading a serial stream using multiple processes on the same node. The
> serial data is striped across the local processes so that a single logical
> buffer is generated by the user-requested number of processes internally.
> 
> When client callers do a gen_stream:next call, the next serial chunk of
> the stream is served from an internal process and the counter is
> incremented so that a different process will serve the next chunk.
> 
> When the last chunk of the entire stream is served, internally I cast a
> message to all internal process to stop executing.  Continued calls to
> gen_stream:next just return 'end_of_stream' directly from the parent
> gen_server with complete ignorance as to whether the internal worker
> processes are still alive.  All internal processes are linked to the
> parent gen_server / gen_stream process, so they will get cleaned up if the
> parent is terminated.
> 
> I foolishly added a check to the test suite to see that once
> 'end_of_stream' is being returned, the internal processes should have
> terminated.  I wanted to prove there is no process leakage during the time
> between end_of_stream and when the parent process is terminated.

That is a very good test! I wish we had more of those.
We have process counting in some test suites, and if your
leaves behind stray processes it could annoy later suites...

> 
> >
> > To solve this problem only in the test suite sounds like sweeping
> > a real problem under the rug, but I am not certain about it.
> > You can e.g use erlang:monitor/2 to observe the processes going
> > down before calling gen_stream:proc_info/1, or by delay
> > repeate until it replies correctly.
> >
> 
> I really never even wanted to expose the proc_info interface (as there is
> no way to later hide it), since the user should neither access nor
> communicate with the internal processes, but I was intent on having
> thorough testing to get it through the review process.

I aggree on your first approach here. Exposing internals just for testing
is seldom a good idea.

> 
> Should I leave this approach in, call proc_info and do a busy wait on the
> monitors for a few seconds?  Or should I remove the portion of the tests
> which count the number of processes internal to the opaque gen_stream
> process?

How about removing proc_info, use the known fact that the workers are linked
and write an utility function in the test suite that uses
erlang:process_info(GenStream, links) to (filter out the parent of GenStream)
check if all workers are gone, and busy wait with one second delay over it.
Let the test suite timeout detect a failure.

> 
> jay
> 

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB