[erlang-questions] gen_server:call across nodes hangs indefinitely, net_adm:ping works

Jesper Louis Andersen jesper.louis.andersen@REDACTED
Fri Oct 22 16:50:14 CEST 2010


On Fri, Oct 22, 2010 at 3:59 PM, Simon MacMullen <simon@REDACTED> wrote:
> We're seeing a customer have some weird issues with RabbitMQ, and it's
> beginning to look like an Erlang problem.

Personally, I would go aggressively for the network stack. When the
system hangs, tcpdump the connection and see what kind of messages are
passed. Are they received on the other end at all? For these kinds of
problems, I like going from the copper and upwards, simply to
establish a true knowledge about the problem. You might have built up
a hypothetical knowledge about certain lower-level things. This must
be acknowledged by inspection.

* Is there a firewall state in iptables (assuming conntrack is used.
/proc knows about the table)?
* Are packets ending at the other machine at all (tcpdump)
* Can you inspect the amount of network traffic going on at the
incident? If you have a large setup, sysadmins tend to have nice
graphs of this kind of stuff.
* Duplex/speed negotiation problems in the network some place?
* strace the erlang process for network communication stuff?
* Kernel message logs (dmesg?)
* Virtual machines? I've seen VMwares internal routing switch go nuts
and throw away things or delay things.

At least these are some points I would go for. It may be an Erlang
bug, but I would guess it is somehow related to the infrastructure -
otherwise others would probably have seen it.


-- 
J.


More information about the erlang-questions mailing list