[erlang-questions] Intermittent failures connecting C hidden nodes

Matthias Lang matthias@REDACTED
Fri Jul 6 17:47:11 CEST 2007


 Jeff >> * Double checking that the switches between the two machines are locks
 Jeff >> at the correct speed, eg 100Mps full-duplex.

 Matthias > Verifying that the interfaces are running as expected, e.g. that both
 Matthias > ends have the same idea about what they're doing, is good.

Chasing possible ethernet problems is not the first thing I'd do if
presented with the evidence Andy gave. 

I posted because a small part of Jeff's advice (above) is prone to
misinterpretation. And also because my calling in life is to prevent
people from making that particular mistake.

Richard Andrews writes:

 > TCP should take care of collisions and it is already established 
 > that data is being exchanged between the two hosts as TCP connect 
 > succeeds which requires SYN, SYN-ACK, ACK. Ethernet is obviously
 > OK.

This is misleading.

Collisions are part of the normal operation of a half-duplex
ethernet. TCP does not take care of them. Collisions are normally
taken care of by ethernet itself (by retransmission*).

Late collisions are not a normal part of ethernet. Late collisions
result in lost packets. TCP will take care of packet losses, but
throughput will often suffer dramatically. Timeouts chosen to be
reasonable for a LAN will also be exceeded.

Getting a successful TCP connection is a necessary condition for
having an OK ethernet, but it's not sufficient.

The rest of your advice is pretty good.

Matthias

 > I would suggest 
 >  1) attach strace - you get a mountain of data but you can see what your EIO is
 > at the socket level.
 >  2) tcpdump - try and get packet trace of the connection; you can use "tcpdump
 > -w -X -s0 ..." and just log everything and post-process it later.
 > 
 > It could be a firewall or NAT issue; I've seen stateful firewalls get confused
 > and block connections for a long time.
 > 
 > It is important to find out what is happening around the connection retries (
 > I'm assuming this means making new TCP connections). I think packet logs are
 > your most important tool. Maybe you have an intermittent IP address conflict.
 > It might be another host outside the client and server which is causing the
 > problem. Might not even be connected to an erlang node - of course the response
 > won't make sense then.
 > 
 > You'll slap yourself on the forehead when you see it.

* There is a limit to the number of retransmission attempts. Exceeding
  that limit is unusual.



More information about the erlang-questions mailing list