[erlang-bugs] Re: [erlang-questions] epmd leaving ports in TIME_WAIT?

Tue Mar 23 22:04:03 CET 2010

On Tue, Mar 23, 2010 at 01:15:18PM -0400, Nicholas Frechette wrote:
> Hi,
> I did as you suggested and ran epmd -d.
> It ends up outputting something like:
> epmd: Tue Mar 23 09:26:39 2010: ** sent PORT2_RESP (error) for "rodc10"
> epmd: Tue Mar 23 09:26:40 2010: ** got PORT2_REQ
> 
> Over and over.
> This is because one of my nodes pings (net_adm:ping) a node that doesn't
> exist from time to time. (Every couple seconds or so)

Right, so every time the node connects and disconnects the TCP session
will go into TIME_WAIT.

> Also, when epmd dies, the ports are closed properly. In any case, I find it
> surprising that epmd has to open so many sockets to ask around if someone
> has seen the missing node.

1> [ begin {ok,S} = gen_tcp:connect({127,0,0,1},4369,[]), ok = gen_tcp:close(S) end || _ <- lists:seq(1,10000) ].

That will generate 10,000 sessions in TIME_WAIT :) I guess the question
is why your nodes keep disappearing from the network.

> On Mon, Mar 22, 2010 at 11:45 AM, Michael Santos
> <michael.santos@REDACTED>wrote:
> 
> > On Mon, Mar 22, 2010 at 11:17:25AM -0400, Nicholas Frechette wrote:
> > > Escalating to erlang-bugs.
> > > I've restarted both my server and laptop over the weekend.
> > > On both machines, I restarted my 2 erlang applications (4 nodes,
> > connected
> > > in pairs: A <-> B, C <-> D, with pairs on the same computer)
> > >
> > > This was yesterday. This morning I did another netstat -t, and indeed, I
> > > have >100 sockets stuck in TIME_WAIT on both computers.
> >
> > Sockets in TIME_WAIT state are normal. After the socket is closed,
> > the OS puts the socket into TIME_WAIT to ensure any pending packets
> > queued somewhere in the network for the socket pair have time to arrive.
> > Usually TIME_WAIT is 2 or 4 minutes.
> >
> > It looks as if there a is a number of TCP connections that are being
> > established and closed to your epmd.
> >
> > > Both with outgoing
> > > on localhost and the other pc, in about equal proportion.
> > > No node has crashed/restarted. None of the nodes does anything fancy,
> > simply
> > > net_adm:ping to connect the nodes and then data is exchanged using
> > messages.
> > >
> > > The problem seems somewhat related to the fact that epmd seems to restart
> > > from time to time as the OS gets confused and cannot retrieve the PID
> > that
> > > originally opened the sockets (although port shows it is epmd)
> >
> > What is restarting epmd?
> >
> > See anything in your logs? Maybe try running epmd in debug mode. Kill
> > epmd if it is running and run: epmd -d
> >
> > > I briefly looked at the epmd code and did see a few comments in there
> > about
> > > // should probably always close and a few other potential places where it
> > > might leak sockets. Unfortunately I ran out of time.
> >
> > Doesn't appear to be leaking fd's, but you can check with lsof.
> >
> > > Can anyone confirm if they see similar behavior? Note that on both
> > > computers, both nodes are started manually (not automated yet) and as
> > such
> > > it isn't a race to see which node can start epmd first. Although, I
> > wonder
> > > if it might be related to the problem of the epmd 100% cpu use, I believe
> > > another poster made the point that it would happen when epmd runs out of
> > > file descriptor (which would happen if it leaks sockets in TIME_WAIT).
> >
> > That's just one error condition; for example, the connection could have
> > been aborted or the socket could have been closed. Are you seeing a lot
> > of CPU usage?
> >
> >
> >