long ping delay after network partition/process suspension
Reto Kramer
kramer@REDACTED
Thu Jan 8 08:41:50 CET 2004
I've written a simple test that measures the time an erlang ping takes
among a set of nodes [using timer:tc(net_adm:ping())].
My test setup either suspends a beam process (^Z), or disconnect the
network to that node. I see the ping hang until the net_tick timeout
hits and the connection is removed due to the timeout. This is all
great of course.
However, even after the connection is removed
=ERROR REPORT==== 7-Jan-2004::23:23:28 ===
** Node a@REDACTED not responding **
** Removing (timedout) connection **
I continue to see huge ping round trip times to a failed node (some 7
seconds). My expectation is that I should receive a "pang" really
quickly (i.e. a few dozen millis max).
Tuning the net_tick time down to e.g. 4s (rather than the default 60s)
does not help. Note that my nodes do not use DNS names, so there's lag
in some attempt to resolve names either. I've suspected epmd, but it
seems that's not the culprit (tuned the packet_timeout to no effect,
don't see any hanging resolution attempts with -d flag).
Originally I hit this issue with a multi_call that includes nodes that
can get partitioned away for some time. In that case, the timeout on
the gen_server:multi_call is not getting honored and the call only
seems to return after some half a dozend seconds in case a connection
timed out. Further multi_calls with the same node set continue to
return only after seconds despite the timed out connection.
Looking at the source I can only suspect that {reg_name, node} ! Msg
hangs for a long time if it tries to send on a timed out connection,
but I refuse to belive this is so ;-) If this is so anyway, can I
tell erl to not wait for the socket connection too long (i.e. <7s)?
Can someone enlighten me how to work around this issue (what parameters
to tune)?
Thanks,
- Reto
More information about the erlang-questions
mailing list