Fri Dec 14 12:26:58 CET 2012
Preemptive apologies that this turned into a bit of a novel...
I've been working on debugging a cluster of three Erlang nodes for a
couple days and I've run up against some interesting data points that
I can't quite figure out how to investigate much further.
For background, I'm running R14B01 on three nodes with one of the
three nodes in a remote data center that's about 40ms away from the
other two which are <1ms apart.
What I'm observing is that the remote node ends up accumulating
processes stuck in erlang:bif_return_trap/1 which eventually
accumulate to the point where the node exhausts RAM and the node
reboots (if I let it go that long). Each process stuck in
bif_return_trap is related to distributed message passing.
My biggest question is what would be holding processes in this
function. From the Googling and source browsing I've been doing it
looks as though it has to do with reductions and scheduling but I
can't quite cover the last bit of ground in connecting how that
explains the behavior I'm seeing.
The obvious theory is if this is an IO related issue over a long
distance link then perhaps I'm just seeing saturation of the network.
Though watching Boundary graphs I can see that the link is not at all
saturated. I'm running around 200Mbps over a 1Gps link. There's spikes
above that as well that suggest its not just a busy link.
This email has gotten a bit long, so I'll just enumerate a number of
questions and call it quits:
1. What exactly could bif_return_trap be waiting on?
2. How might reductions affect bif_return_trap?
Also, links of note that I keep seeing in my Googling:
Neither of which have an obvious resolution.
More information about the erlang-questions