[erlang-questions] strange node communication startup delay

Tue Mar 4 15:11:51 CET 2008

I have several nodes running on the same computer, so inter-node
communication is normally very fast.  Shortly after starting the first
node starts a third node which has a gen_server that immediately registers
with a gen_server in the first node. There is a long round-trip delay
of up to 3 seconds at this point. (The second node is idle at
this time).

The third node performs these steps at startup:

    case net_kernel:connect_node(Node) of
    true ->
	log:format("muConnector: connected to ~p", [Node]),
	global:sync(),
	monitor_node(Node, true),
	onUp(State);
    ...
    log:format("muConnector: MU is up"),
    ...
    log:format("fleetMgrImpl: MU connected"),

    regIface:rtt(),
    regIface:register(fleet, ID, self()),
    log:format("fleetMgrImpl: Fleet process registered"),
    regIface:rtt(),

which results in these log messages on the third node:

    5-Mar-2008 00:35:14.517| muConnector: connected to 'SMU_mySite@REDACTED'
    5-Mar-2008 00:35:14.723| muConnector: MU is up
    5-Mar-2008 00:35:14.723| fleetMgrImpl: MU connected
    5-Mar-2008 00:35:16.174| regIface:rtt delay 1449075 us
    5-Mar-2008 00:35:16.215| fleetMgrImpl: Fleet process registered
    5-Mar-2008 00:35:16.215| regIface:rtt delay 842 us

The first call to rtt has measured a 1.45s delay for a simple echo.

I have process tracing on the first node with options: send, 'receive',
procs, timestamp, running, set_on_spawn.  The trace on the first node
shows lots of global name registration activity around 00:35:14.721
to 00:35:14.722:

    5-Mar-2008 00:35:14.721| [<0.11.0>,
     'receive',
     {'$gen_call',{<8232.12.0>,{#Ref<8232.0.0.182>,'SMU_mySite@REDACTED'}},
		  {set_lock,{global,[<0.12.0>,<8232.12.0>]}}}]
    ................
    5-Mar-2008 00:35:14.722| [<0.11.0>,
     send,
     {'$gen_cast',
	 {exchange,
	     'SMU_mySite@REDACTED',
	     [{'mySite/registry',<0.84.0>,{global,random_exit_name}},
	      {'mySite/sigMgr',<0.108.0>,{global,random_exit_name}},
	      {'mySite/evtMgr',<0.90.0>,{global,random_exit_name}},
	      {'mySite\001muMgr',<0.106.0>,{global,random_exit_name}}],
	     [],
	     {1204,637714,517547}}},
     {global_name_server,fleet_1_mySite@REDACTED}]
    ................
    5-Mar-2008 00:35:14.722| [<0.11.0>,
     send,
     {'$gen_cast',{resolved,'SMU_mySite@REDACTED',
			    [],
			    ['SPU1_mySite@REDACTED'],
			    ['SPU1_mySite@REDACTED'],
			    [],
			    {1204,637714,517547}}},
     {global_name_server,fleet_1_mySite@REDACTED}]
    5-Mar-2008 00:35:14.722| [<0.11.0>,
     send,
     {cancel,fleet_1_mySite@REDACTED,
{1204,637714,517912},#Fun<global.7.70730269>},
     <0.12.0>]
    ................
    5-Mar-2008 00:35:14.723| [<0.12.0>,
     send,
     {'$gen_call',{<0.12.0>,{#Ref<0.0.0.388>,'SPU1_mySite@REDACTED'}},
		  {del_lock,{global,[<0.12.0>,<8232.12.0>]}}},
     {global_name_server,'SPU1_mySite@REDACTED'}]
    ................
    5-Mar-2008 00:35:14.761| [<0.11.0>,
     'receive',
     {'$gen_call',{<8232.12.0>,{#Ref<8232.0.0.192>,'SMU_mySite@REDACTED'}},
		  {del_lock,{global,[<0.12.0>,<8232.12.0>]}}}]
    5-Mar-2008 00:35:14.761| [<0.12.0>,out,{global,loop_the_locker,1}]

Next the trace shows the arrival of the echo message with its
time-stamp. The message left the third node at 00:35:14.725 but it
did not arrive at the first node until 00:35:16.174. The reply was
essentially immediate.

    5-Mar-2008 00:35:16.174| [<0.84.0>,
     'receive',
     {'$gen_call',{<8232.87.0>,#Ref<8232.0.0.204>},
		  {regEcho,{1204,637714,725844}}}]
    5-Mar-2008 00:35:16.174| [<0.84.0>,in,{gen_server,loop,6}]
    5-Mar-2008 00:35:16.174| [<0.84.0>,send,{#Ref<8232.0.0.204>,{ok,
{1204,637714,725844}}},<8232.87.0>]

According to the process trace nothing else happened on the first node
between 00:35:14.761 and 00:35:16.174.  So why the 1.4 second delay
in communication?

Can I get traces of the inter-node communication?