[erlang-questions] Unable to restart epmd, sockets stuck in close_wait

Fri Aug 26 17:46:45 CEST 2016

Hi Folks,
I've got a system using erlang/OTP 18.3.4.1 and rabbitmq 3.6.3.  Everything
is local to the system and there is no clustering.

We are seeing intermittent failures when
stopping-uninstalling-reinstalling-starting epmd.

When this happens we also see many sockets stuck in close_wait like so:
tcp       48      0 0.0.0.0:4369            0.0.0.0:*
LISTEN      0          570937     1/systemd
tcp        5      0 127.0.0.1:4369          127.0.0.1:37560
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:42564
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:53126
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:40222
CLOSE_WAIT  0          0          -
tcp       38      0 127.0.0.1:4369          127.0.0.1:33506
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:56332
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:50511
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:45528
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:59487
CLOSE_WAIT  0          0          -
tcp        4      0 127.0.0.1:4369          127.0.0.1:37506
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:41554
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:40080
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:32903
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:48851
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:35177
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:44931
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:54730
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:48311
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:39159
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:47166
CLOSE_WAIT  0          0          -
tcp        2      0 127.0.0.1:4369          127.0.0.1:37541
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:38290
CLOSE_WAIT  0          0          -
tcp       31      0 127.0.0.1:4369          127.0.0.1:43044
CLOSE_WAIT  0          0          -
tcp        2      0 127.0.0.1:4369          127.0.0.1:37540
CLOSE_WAIT  0          0          -
tcp        2      0 127.0.0.1:4369          127.0.0.1:37544
CLOSE_WAIT  0          0          -

On an identical working system the output looks like this:
tcp        0      0 0.0.0.0:4369            0.0.0.0:*
LISTEN      1/systemd
tcp        0      0 <ip address>:4369        9.47.80.245:36368
TIME_WAIT   -
tcp        0      0 127.0.0.1:34836         127.0.0.1:4369
ESTABLISHED 22713/beam.smp
tcp        0      0 127.0.0.1:4369          127.0.0.1:34836
ESTABLISHED 21186/epmd

on the hung system:
epmd -names and epmd -kill both hang indefinitely
Attempting to restart epmd.socket or epmd.service gives the error
epmd.socket failed to listen on sockets: Address already in use

Is there any way to
a) Get more information about what is causing the state to occur (so I can
hopefully prevent it in the future)
or
b) Recover from this state (without rebooting the system)?

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160826/9c218bc2/attachment.htm>