gen_server:call across nodes hangs indefinitely, net_adm:ping works

Simon MacMullen simon@REDACTED
Fri Oct 22 15:59:10 CEST 2010


We're seeing a customer have some weird issues with RabbitMQ, and it's 
beginning to look like an Erlang problem.

They have a two node system, with both nodes running R13B04 on 
virtualised CentOS 5. The VMs are both using iptables as a firewall, 
with the epmd port open, plus one other port for inter-node message 
passing (using inet_dist_listen_min / max to limit to that exact port).

After the system has been up for a number of hours / days, they're 
seeing RabbitMQ hanging in various ways. Investigation of the system in 
this state shows that:

* epmd is up and working
* epmd -d -names gives the expected results
* net_adm:ping/1 from either node to the other works
* any Rabbit APIs that invoke gen_server:call/3 locally work
* any Rabbit APIs that invoke gen_server:call/3 across nodes hang

We've also seen a weird-looking error pop up around the same time 
(user_sup dies with "eio", see attached log), although I'm unclear as to 
whether this is a cause or a symptom.

Unfortunately I'm a long way away from cutting this down to a minimal 
test case yet; I can't even replicate this myself. But does this look 
like anything anyone's ever seen before?

Cheers, Simon
-- 
Simon MacMullen
Staff Engineer, RabbitMQ
SpringSource, a division of VMware

-------------- next part --------------
A non-text attachment was scrubbed...
Name: node1-sasl.log
Type: text/x-log
Size: 982 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20101022/dc00c300/attachment.bin>


More information about the erlang-questions mailing list