[erlang-questions] Intermittent failures connecting C hidden nodes

Fri Jul 6 16:12:25 CEST 2007

Thanks for the replies.

Just to clarify something, though: there is no real I/O error.
erl_interface sets errno to EIO if *anything* goes slightly wrong --
in this case, simply because the remote end sent something in response
other than an "sok" packet.  It is not timing out; it fails instantly,
so as you said, the TCP connection is obviously correctly established.
 There are also several other connections to beam from different
cnodes on the same machine at the time of the failure.

So the only possibility of a firewall issue is if a firewall were
resetting connections right after they were established, which I've
never heard of before.  The firewall rules are not stateful among the
cluster nodes, anyway, and there is no NAT.

I also discovered that I was indeed using 3.5.5.2 on the client and
R11B-2 on the server.

strace - well, a mountain of data is an understatement; it would give
me more like a planet of data.  The whole cluster has to be running
for at least a day or so before this shows up for the first time, and
it can happen on any machine, after thousands of simultaneous
player-hours.  tcpdump on the erlang port is feasible though and I'll
do that next time.

The important missing piece of data is what the actual response from
beam is when the connection failed, and I am now logging that, but
haven't seen the problem recur yet.

During the time that the erlang connection fails, other connections
continue to work and the rest of the system proceeds normally.  The
simulator process that tried to connect gives up each time and goes
about its business for a while before attempting to reconnect to the
erlang part of the system, and each connection attempt either succeeds
or fails within a second.

Another interesting thing to note is that I was manufacturing process
ids from the cnode which were referenced by other processes in the
system, and there was a bug where those processes wouldn't stop
sending messages to them after the cnode went down.  This bug has now
been fixed, but I wonder if that had anything to do with it.  Guess
I'll find out soon.