[erlang-questions] Overlong wait in receive after

Robert Raschke rtrlists@REDACTED
Mon Feb 22 14:31:57 CET 2010


On Fri, Feb 19, 2010 at 7:55 PM, Michael <erlangy@REDACTED> wrote:

> On Fri, Feb 19, 2010 at 05:33:36PM +0000, Robert Raschke wrote:
> > Hi,
> >
> > has anyone encountered spurious overlong waits using receive ... after?
> That
> > is, the code says "after 1000 ->" but it's way longer than 1 second
> before
> > it gets there.
> >
> > My code starts a Java jinterface node, the Java code creates the node,
> > creates an mbox, and waits a minute for a message to appear and sends one
> > back.
> >
> > The Erlang code starts the Java as a port program, and starts sending ok
> > messages to the registered mbox of the Java program spaced 1 second
> apart.
> > My logging indicates that the sending of the messages is actually spaced
> > somewhere between 5 and 12 seconds apart!
> >
> > This is on a Windows 2003 Server running R12B-5.
> >
> > In Erlang (started with -sname "erl@REDACTED") this is roughly what's
> > happening (apologies for typos):
> >
> > run() ->
> >     Node_Name = "foo@REDACTED",
> >     Mbox = {box, list_to_atom(Node_Name)},
> >     Port_Pid = spawn_link(?MODULE, run_port_program, [Node_Name,
> > atom_to_list(erlang:get_cookie())]),
> >     {ok, Java_Pid} = shake_hands(Mbox, 0).
> >
> > shake_hands(Mbox, N) when N < 50 ->
> >     error_logger:info_report([{module, ?MODULE}, {handshake, N}, {mbox,
> > Mbox}]),
> >     Mbox ! {self(), ok},
> >     receive
> >         {ok, Pid} ->
> >             {ok, Pid}
> >     after 1000 ->
> >         error_logger:error_report([{module, ?MODULE}, {'handshake
> timeout',
> > N}]),
> >         shake_hands(Mbox, N+1)
> >     end;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>  Seems likely that messages other than {ok, Pid} are coming in
>  and growing the message queue.
>
>  try adding a catchall to check ...
>
>
> >     receive
> >         {ok, Pid} ->
> >             {ok, Pid}
>
>
>       ; _Other -> io:fwrite("Catchall: ~p~n", [_Other])
>
>
> >     after 1000 ->
>
>
>  then, presuming it's junk, you can just throw the spurious msgs out
>
>
> ~Michael
>
>

I am currently trying to narrow down where the time is going, and I have a
feeling that it is actually delaying on the send operation, not the receive.
I've got some more extended logging out to the installation but will have to
wait until I hear back.

Incidentally, I have just received another report of this sequence failing
at a different location. SW was in place for several months and after the
weekend is now exhibiting the same timeout behaviour. Slight additional bit
of info is that it looks like epmd.exe is killed spuriously, but not
consistently.

Is anyone out there aware of any MS patches, virus DB updates, etc. that may
be suddenly impacting Erlang communications?

Thanks for any pointers,
Robby


More information about the erlang-questions mailing list