[erlang-questions] Intermittent failures reconnecting C hidden nodes

Sat Jul 7 04:30:13 CEST 2007

Here's the latest status, for anyone who cares.

I'm getting the atom "nok" in response to sending my node name when
the failure occurs.  It's only happened once since we restarted
yesterday.  The node had been down for about eight minutes, and came
back up on a different cluster machine.  It failed connecting several
times (receiving the packet "snok"), and then about a minute later it
succeeded.

here's our log from the erlang side (the times are in CDT):
16:16:45 - <0.73.0> : received nodedown for:'c4780@REDACTED'
[{nodedown_reason, connection_closed},{node_type,hidden}]
16:25:24 - <0.73.0> : received nodeup for:'c4780@REDACTED' [{node_type, hidden}]

I amended the erl_interface code to show the actual packet recv_status received.

(the times are in GMT... don't ask)
ei_xconnect: Fri Jul  6 21:24:18 2007: -> CONNECT attempt to connect to yt
ei_epmd_r4_port: Fri Jul  6 21:24:18 2007: -> PORT2_REQ alive=yt
ip=204.11.209.42
ei_epmd_r4_port: Fri Jul  6 21:24:18 2007: <- PORT2_RESP result=0 (ok)
ei_epmd_r4_port: Fri Jul  6 21:24:18 2007:    port=4000 ntype=77
proto=0 dist-high=5 dist-low=5
ei_xconnect: Fri Jul  6 21:24:18 2007: -> CONNECT connected to remote
recv_status: Fri Jul  6 21:24:18 2007: <- RECV_STATUS not ok; got: 73
6e 6f 6b = "snok"
ei_xconnect: Fri Jul  6 21:24:18 2007: -> CONNECT failed
erl_connect: Input/output error

Now... In my reading of the code, the only way the 'nok' can be sent
is if handle_info({...,{accept_pending,...}},...) in net_kernel.erl
returns 'nok_pending' to mark_pending/1 in dist_util.erl, like so:

handle_info({AcceptPid, {accept_pending,MyNode,Node,Address,Type}}, State) ->
    case ets:lookup(sys_dist, Node) of
        [#connection{state=pending}=Conn] ->
            if
                MyNode > Node ->
                    AcceptPid ! {self(),{accept_pending,nok_pending}},
                    {noreply,State};
                true ->
                    [...snip]

If I'm reading that right, it's doing a *lexical comparison* of the
local node name atom with the connecting one.  I cannot fathom why you
would want to do that... can someone clue me in?  In our case, it is
true that yt@REDACTED > c4780@REDACTED

Regardless, it seems that my previous connection, even though I got
the nodedown, was pending.  I scoured the rest of my server logs for
any place where some other c4780 could have made an abortive attempt
to connect but I don't see anything.  So my conjecture at this point
is that some part of the previous connection is stuck for a little
while but eventually clears out.

I saw this in the changelog for R11B-4 and was curious:
    OTP-6447  Under rare circumstances a terminating connection between two
	      nodes could cause an instantaneous reconnect between the two
	      nodes to fail on the runtime system with SMP support.

Might that change have any impact (we are running R11B-2 at present)?
There were also extensive changes in B-5 for node monitoring.  Is
there a publicly-accessible Erlang bug tracker, by the way, where I
could look up issue OTP-6447 specifically?  Just grasping at straws.

I am now compiling R11B-5 with dist_debug defined to true in
net_kernel.erl and dist_util.hrl.  So either the upgrade will fix the
issue or I will have even more data, and then you guys can tell me how
to configure my ethernet switch.

[erlang-questions] Intermittent failures *reconnecting* C hidden nodes

[erlang-questions] Intermittent failures reconnecting C hidden nodes