Request for next net_kernel release: new switch "net_setuptime_millis"

Fri Jan 9 16:03:14 CET 2004

Hi,

Well, the reason for the 7s delay is simply that a suspended erlang node 
is still in some sense "alive". The listen socket is still listening and 
it's still registered with it's local epmd. On the other hand, if the 
node is completely halted, epmd has it unregistered and connection 
attempts fail instantly. What happens is for a suspended node (suspended 
node named b@REDACTED, connected from a@REDACTED):

a@REDACTED issues a request for b@REDACTED to epmd on host2. The request gets 
answered immediately, a portnumber is the answer.

a@REDACTED connects to the given port (which succeeds as TCP/IP and the 
socket interface works that way).

a@REDACTED sends the initial handshake string

a@REDACTED waits the stipulated timeout for the answer.

-> pang after 7 s

If the network cable is cut to host2, we would instead get:

a@REDACTED tries to connect to epmd on host2. The connection attempt fails 
after the stipulated timeout.

-> pang after 7 s

On the other hand, if the node has never been alive:

a@REDACTED issues a request to epmd which immediately answers that there is 
no such host.

-> pang after a few millis.

So, it's not that the distribution in node A keeps any data, it's simply 
that we have a timeout situation when a node is suspended, regardless if 
we have known the node before or not.

TCP/IP is unfortunately not a protocol made for realtime applications... 
That's what all this long timeout hazzle comes from. A better network 
protocol would make things much funnier for Erlang, no ticking, no 
timeouting etc... A reliable protocol that monitors the links... a 
protocol designed not for web browsing but for telecom... Sigh...  
Timeouts shorter than 1 s could be useful though. I agree. I'll add the 
possibility to define fractions of seconds to the configuration.

Cheers,
/Patrik

Reto Kramer wrote:

> I found where the 7s ping/net_connect delay I was seeing if I was 
> pinging (or multi_call'ing) a node who's connection had timeout comes 
> from (see net_kernel excerpt below).
>
> I'm glad it's configurable (thanks for the foresight, whoever did 
> it!), however I need it smaller than 1s, which is the current minimum 
> I can set.
>
> Is there any chance that in the next release of the net_kernel we 
> could add a new (equally undocumented, i.e. unsupported) switch: 
> net_setuptime_millis? I suggest the net_setuptime switch dominates if 
> present to maintain backwards compatibly.
>
> Funny enough, when the node in question (say c) never started in the 
> first place, then a multi_call to [a,b,c] does not suffer from the 
> connection setup delay, it's only because a connection (that timed 
> out) had been setup initially that I see the kernel setuptime 
> timeout/delay.
>
> Perhaps someone could educate me here - it seems that perhaps not all 
> the data structures associated with the timed out connection are 
> cleaned up (i.e. behavior is different from when the connection has 
> never been created in the first place).
>
> Thanks,
> - Reto
>
> PS: in the meantime, I'll resort to removing the failed node from the 
> nodeset I multicall to, and then re-add it using a multicast based 
> discovery protocol (the latter I have anyway). It would be nice if my 
> app could be naive about all that and just blindly multi_call to a 
> node set.
>
> --------- net_kernel: ---------
> [...]
> %% Default connection setup timeout in milliseconds.
> %% This timeout is set for every distributed action during
> %% the connection setup.
> -define(SETUPTIME, 7000).
> [...]
> connecttime() ->
> case application:get_env(kernel, net_setuptime) of
> {ok, Time} when integer(Time), Time > 0, Time < 120 ->
> Time * 1000;
> _ ->
> ?SETUPTIME
> end.
> [...]
> --------- net_kernel: ---------
>
> ______________________
> An engineer can do for a dime what any fool can do for a dollar.
> -- unknown