odd problem with distributed erlang system
Wed Mar 26 19:28:36 CET 2003
i'm trying to debug a strange problem we're seeing in one
of our systems. we've got a number of linux clusters, all
running vanilla redhat 7.x, and we only see this on one
of them. the application is a pretty basic one, where we
have a master node that starts up processes on other compute
nodes. each of these others is runs an erlang port process
which invokes an external program to handle queries. the whole
thing is accessed through a web site which does queries of the
master server. the master forwards the query to the appropriate
child. each type of information server is started on a different
node, in on-demand order.
the problem is this: the first query comes in, causing a server,
say, foo, to run on node1. all is good. a different query comes
in which needs a baz server, which gets started on node2. this query
will run *realllllly* slow, eventually likely timing out before it returns
a result (which it will do eventually). when i say slow, i'm talking
minutes. the really odd thing is that the child process is writing to
a log file (nfs mounted from master), using the erlang dblog stuff,
and debug messages that occur in the same function, only a few lines
apart, come in several minutes apart.
the problem is independent of the order in which the nodes start,
or the servers start, or which subserver runs on whoich node, etc.
we've eliminated (i think) all the obvious cases. it really feels
like a machine configuration problem. this cluster is in a secure
lab, and has gone through a few rounds of network rearrangements.
the problem doesn't occur on any of the other similar systems we've
deployed, and we can't recreate it in our development systems.
so, does anybody have any ideas about what to look for, or what kinds
of things could make erlang run so slow on machine? it feels like a
timeout of some sort, but we've looked at the obvious name lookup
things and such (does erlang require dns? we're using local /etc/host
for name resolution). is there something else internal that we might
be waiting on?
any ideas, suggestions, or pointers to useful test cases would be
greatly appreciated. it's awkward to debug since we have only
restricted access to the broken system.
thanks for your help.
More information about the erlang-questions