[erlang-questions] Intermittent failures connecting C hidden nodes
Fri Jul 6 01:06:18 CEST 2007
We recently integrated Erlang into our MMO game's cluster, and
we're having a couple problems, but I want to address them one at a
There are several simulation processes which, as they are needed,
fork() on any of several machines and subsequently connect to beam
(currently running only on one machine) as a C node, using
erl_interface, with an sname of "" where S is the integer
sectorid and voc1-X is the machine it runs on, obviously. When the
simulation of that sector of space is no longer needed, the process
disconnects and dies.
This works fine, except occasionally a process will fail to connect on
our production cluster. It then aggressively retries every time an
erl_send() is attempted, and after several consecutive failures will
eventually succeed in connecting.
We are logging nodeups and nodedowns to ensure that as sectors start
up and shut down, they are properly connecting and disconnecting; my
initial thought was that maybe a sector is trying to connect as
"c200", for instance, but that name is taken. This doesn't appear to
be the case. A node that has been down for several hours can still
fail to connect. But certain sectors tend to fail more often, or at
least it seems that way to me at a glance.
To get more info, I increased ei_tracelevel during the erl_connect()
call. This is what I see when there is a failure:
ei_xconnect: Wed Jul 4 10:25:04 2007: -> CONNECT attempt to connect to yt
ei_epmd_r4_port: Wed Jul 4 10:25:04 2007: -> PORT2_REQ alive=yt
ei_epmd_r4_port: Wed Jul 4 10:25:04 2007: <- PORT2_RESP result=0 (ok)
ei_epmd_r4_port: Wed Jul 4 10:25:04 2007: port=4000 ntype=77
proto=0 dist-high=5 dist-low=5
ei_xconnect: Wed Jul 4 10:25:04 2007: -> CONNECT connected to remote
ei_xconnect: Wed Jul 4 10:25:04 2007: -> CONNECT failed
erl_connect: Input/output error
So it's obviously making the TCP connection, getting through send_name
without any errors, and then recv_status is reading a response other
than "sok" but, unhelpfully, doesn't tell me what the response
actually was and just bombs with EIO. This is what led me to believe
beam thinks there's something wrong with the sname.
The server is using R11B2 on FreeBSD, built from /usr/ports.
The clients are Linux using the erl_interface library 18.104.22.168 from
R11B3 currently. (The changelog between 22.214.171.124 and 126.96.36.199 seems
like it wouldn't make any difference here).
I'm going to patch ei_connect.c to log the actual response in
recv_status to get some insight, but in the meantime I would
appreciate any advice. In particular: is there any way to get more
verbose information from erlang of any attempts to connect? Something
like net_kernel:monitor_nodes but more low-level?
I intend to upgrade both client and server to R11B5 at some point in
the near future, but I don't believe that that will solve this issue,
given the changelogs.
More information about the erlang-questions