long ping delay after network partition/process suspension

Thu Jan 8 08:41:50 CET 2004

I've written a simple test that measures the time an erlang ping takes 
among a set of nodes [using timer:tc(net_adm:ping())].

My test setup either suspends a beam process (^Z), or disconnect the 
network to that node.  I see the ping hang until the net_tick timeout 
hits and the connection is removed due to the timeout.  This is all 
great of course.

However, even after the connection is removed

=ERROR REPORT==== 7-Jan-2004::23:23:28 ===
** Node a@REDACTED not responding **
** Removing (timedout) connection **

I continue to see huge ping round trip times to a failed node (some 7 
seconds).   My expectation is that I should receive a "pang" really 
quickly (i.e. a few dozen millis max).

Tuning the net_tick time down to e.g. 4s (rather than the default 60s) 
does not help.  Note that my nodes do not use DNS names, so there's lag 
in some attempt to resolve names either.  I've suspected epmd, but it 
seems that's not the culprit (tuned the packet_timeout to no effect, 
don't see any hanging resolution attempts with -d flag).

Originally I hit this issue with a multi_call that includes nodes that 
can get partitioned away for some time. In that case, the timeout on 
the gen_server:multi_call is not getting honored and the call only 
seems to return after some half a dozend seconds in case a connection 
timed out. Further multi_calls with the same node set continue to 
return only after seconds despite the timed out connection.

Looking at the source I can only suspect that {reg_name, node} ! Msg 
hangs for a long time if it tries to send on a timed out connection, 
but I refuse to belive this is so ;-)   If this is so anyway, can I 
tell erl to not wait for the socket connection too long (i.e. <7s)?

Can someone enlighten me how to work around this issue (what parameters 
to tune)?

Thanks,
- Reto