[erlang-questions] strange node communication startup delay

Anthony Shipman als@REDACTED
Tue Mar 4 15:11:51 CET 2008

I have several nodes running on the same computer, so inter-node
communication is normally very fast.  Shortly after starting the first
node starts a third node which has a gen_server that immediately registers
with a gen_server in the first node. There is a long round-trip delay
of up to 3 seconds at this point. (The second node is idle at
this time).

The third node performs these steps at startup:

    case net_kernel:connect_node(Node) of
    true ->
	log:format("muConnector: connected to ~p", [Node]),
	monitor_node(Node, true),
    log:format("muConnector: MU is up"),
    log:format("fleetMgrImpl: MU connected"),

    regIface:register(fleet, ID, self()),
    log:format("fleetMgrImpl: Fleet process registered"),

which results in these log messages on the third node:

    5-Mar-2008 00:35:14.517| muConnector: connected to 'SMU_mySite@REDACTED'
    5-Mar-2008 00:35:14.723| muConnector: MU is up
    5-Mar-2008 00:35:14.723| fleetMgrImpl: MU connected
    5-Mar-2008 00:35:16.174| regIface:rtt delay 1449075 us
    5-Mar-2008 00:35:16.215| fleetMgrImpl: Fleet process registered
    5-Mar-2008 00:35:16.215| regIface:rtt delay 842 us

The first call to rtt has measured a 1.45s delay for a simple echo.

I have process tracing on the first node with options: send, 'receive',
procs, timestamp, running, set_on_spawn.  The trace on the first node
shows lots of global name registration activity around 00:35:14.721
to 00:35:14.722:

    5-Mar-2008 00:35:14.721| [<0.11.0>,
    5-Mar-2008 00:35:14.722| [<0.11.0>,
    5-Mar-2008 00:35:14.722| [<0.11.0>,
    5-Mar-2008 00:35:14.722| [<0.11.0>,
    5-Mar-2008 00:35:14.723| [<0.12.0>,
    5-Mar-2008 00:35:14.761| [<0.11.0>,
    5-Mar-2008 00:35:14.761| [<0.12.0>,out,{global,loop_the_locker,1}]

Next the trace shows the arrival of the echo message with its
time-stamp. The message left the third node at 00:35:14.725 but it
did not arrive at the first node until 00:35:16.174. The reply was
essentially immediate.

    5-Mar-2008 00:35:16.174| [<0.84.0>,
    5-Mar-2008 00:35:16.174| [<0.84.0>,in,{gen_server,loop,6}]
    5-Mar-2008 00:35:16.174| [<0.84.0>,send,{#Ref<8232.0.0.204>,{ok,

According to the process trace nothing else happened on the first node
between 00:35:14.761 and 00:35:16.174.  So why the 1.4 second delay
in communication?

Can I get traces of the inter-node communication?

More information about the erlang-questions mailing list