[erlang-bugs] Re: [erlang-questions] epmd leaving ports in TIME_WAIT?

Mon Mar 22 16:45:37 CET 2010

On Mon, Mar 22, 2010 at 11:17:25AM -0400, Nicholas Frechette wrote:
> Escalating to erlang-bugs.
> I've restarted both my server and laptop over the weekend.
> On both machines, I restarted my 2 erlang applications (4 nodes, connected
> in pairs: A <-> B, C <-> D, with pairs on the same computer)
> 
> This was yesterday. This morning I did another netstat -t, and indeed, I
> have >100 sockets stuck in TIME_WAIT on both computers.

Sockets in TIME_WAIT state are normal. After the socket is closed,
the OS puts the socket into TIME_WAIT to ensure any pending packets
queued somewhere in the network for the socket pair have time to arrive.
Usually TIME_WAIT is 2 or 4 minutes.

It looks as if there a is a number of TCP connections that are being
established and closed to your epmd.

> Both with outgoing
> on localhost and the other pc, in about equal proportion.
> No node has crashed/restarted. None of the nodes does anything fancy, simply
> net_adm:ping to connect the nodes and then data is exchanged using messages.
> 
> The problem seems somewhat related to the fact that epmd seems to restart
> from time to time as the OS gets confused and cannot retrieve the PID that
> originally opened the sockets (although port shows it is epmd)

What is restarting epmd?

See anything in your logs? Maybe try running epmd in debug mode. Kill
epmd if it is running and run: epmd -d 

> I briefly looked at the epmd code and did see a few comments in there about
> // should probably always close and a few other potential places where it
> might leak sockets. Unfortunately I ran out of time.

Doesn't appear to be leaking fd's, but you can check with lsof.

> Can anyone confirm if they see similar behavior? Note that on both
> computers, both nodes are started manually (not automated yet) and as such
> it isn't a race to see which node can start epmd first. Although, I wonder
> if it might be related to the problem of the epmd 100% cpu use, I believe
> another poster made the point that it would happen when epmd runs out of
> file descriptor (which would happen if it leaks sockets in TIME_WAIT).

That's just one error condition; for example, the connection could have
been aborted or the socket could have been closed. Are you seeing a lot
of CPU usage?