[erlang-questions] Intermittent failures connecting C hidden nodes
Fri Jul 6 17:47:11 CEST 2007
Jeff >> * Double checking that the switches between the two machines are locks
Jeff >> at the correct speed, eg 100Mps full-duplex.
Matthias > Verifying that the interfaces are running as expected, e.g. that both
Matthias > ends have the same idea about what they're doing, is good.
Chasing possible ethernet problems is not the first thing I'd do if
presented with the evidence Andy gave.
I posted because a small part of Jeff's advice (above) is prone to
misinterpretation. And also because my calling in life is to prevent
people from making that particular mistake.
Richard Andrews writes:
> TCP should take care of collisions and it is already established
> that data is being exchanged between the two hosts as TCP connect
> succeeds which requires SYN, SYN-ACK, ACK. Ethernet is obviously
This is misleading.
Collisions are part of the normal operation of a half-duplex
ethernet. TCP does not take care of them. Collisions are normally
taken care of by ethernet itself (by retransmission*).
Late collisions are not a normal part of ethernet. Late collisions
result in lost packets. TCP will take care of packet losses, but
throughput will often suffer dramatically. Timeouts chosen to be
reasonable for a LAN will also be exceeded.
Getting a successful TCP connection is a necessary condition for
having an OK ethernet, but it's not sufficient.
The rest of your advice is pretty good.
> I would suggest
> 1) attach strace - you get a mountain of data but you can see what your EIO is
> at the socket level.
> 2) tcpdump - try and get packet trace of the connection; you can use "tcpdump
> -w -X -s0 ..." and just log everything and post-process it later.
> It could be a firewall or NAT issue; I've seen stateful firewalls get confused
> and block connections for a long time.
> It is important to find out what is happening around the connection retries (
> I'm assuming this means making new TCP connections). I think packet logs are
> your most important tool. Maybe you have an intermittent IP address conflict.
> It might be another host outside the client and server which is causing the
> problem. Might not even be connected to an erlang node - of course the response
> won't make sense then.
> You'll slap yourself on the forehead when you see it.
* There is a limit to the number of retransmission attempts. Exceeding
that limit is unusual.
More information about the erlang-questions