[erlang-questions] running without net tick

Valentin Micic v@REDACTED
Fri Sep 25 12:04:20 CEST 2009


You may change TICK value all day long, but if the underlying infrastructure
s in some kind of trouble, that alone is not going to solve the problem.

The following is just a speculation, but quite plausible in my mind:

AFAIK, ERTS is multiplexing inter-nodal traffic over a single socket. Thus,
if the socket is heavily utilized, the sending buffer may get congested due
to dynamically reduced TCP window size (because remote side is not flushing
its buffer fast enough -- if the same process is reading and writing the
socket, this may cause a deadlock under a heavy load). As much as I am not
certain about particular implementation here, I know that sender will not
wait for ever -- it will eventually timeout and this (exception?) has to be
handled somehow by the sender. The reasonable course of action would be to
reset the connection. If and when that happens, node can be declared
unreachable; therefore the "net-split" may occur. In other words, net-split
may occur with or without "ticker" process running and regardless of the
real network availability (*).


I think the net-tick method is good on its own, however, it is utilizing a
*wrong* transport! IMO, tick should be handled as out-of-band data, and this
cannot be done using TCP/IP (well, at least not at the user level). My
suggestion would be to use UDP for net-kernel communication (including TICK
messages). This way one would be able to find out about peer health more
reliably (yes, a small protocol may be required, but that's relatively
easy).

To make things simpler regarding the distribution, one may use the same port
number as advertised in EPMD for a particular node, hence bind UDP socket to
that number.

V/

(*) I've seen "net-splits" between nodes collocated on the same machine --
therefore indicating TCP buffer/load related issue. Maybe situation may be
improved by creation of more than one connection between two nodes, but that
may come with a bag of problems on its own.


-----Original Message-----
From: erlang-questions@REDACTED [mailto:erlang-questions@REDACTED] On
Behalf Of Ulf Wiger
Sent: 25 September 2009 09:13 AM
To: erlang-questions Questions
Subject: [erlang-questions] running without net tick


The problem of netsplits in Erlang comes up now and again.
I've mentioned that we used to have a more robust
supervision algorithm for device processor monitoring in
AXD 301...

I read the following comment in kernel/src/dist_util.erl

%% Send a TICK to the other side.
%%
%% This will happen every 15 seconds (by default)
%% The idea here is that every 15 secs, we write a little
%% something on the connection if we haven't written anything for
%% the last 15 secs.
%% This will ensure that nodes that are not responding due to
%% hardware errors (Or being suspended by means of ^Z) will
%% be considered to be down. If we do not want to have this
%% we must start the net_kernel (in erlang) without its
%% ticker process, In that case this code will never run


...and thought: promising - it is then possible to experiment
with other tick algorithms?

However, looking at net_kernel.erl:

init({Name, LongOrShortNames, TickT}) ->
     process_flag(trap_exit,true),
     case init_node(Name, LongOrShortNames) of
         {ok, Node, Listeners} ->
             process_flag(priority, max),
             Ticktime = to_integer(TickT),
             Ticker = spawn_link(net_kernel, ticker, [self(), Ticktime]),

In other words, you can't set net_ticktime to anything other
than an integer (and it has to be a smallint, since it's used
in a receive ... after expression.

(To do justice to the comment above, couldn't a net_ticktime
of, say, 0 turn off net ticking altogether?)

What one can do then, is to set net_ticktime to a very large
number, and then run a user-level heartbeat. If netsplits are
still experienced without visible problems in the user-level
monitoring, or perhaps even serviced traffic during this
interval, then something is definitely wrong with the tick
algorithm. :)

BR,
Ulf W
-- 
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org



More information about the erlang-questions mailing list