odd problem with distributed erlang system

Mon Mar 31 13:30:56 CEST 2003

I have had a probably related problem.

It was on Debian 3.0 (Linux 2.4.18 kernel) with OTP R9B and an 
automounted NFS volume. I had lots of problems to make the automounter 
use the mount options I wanted, partly because the automounter I used 
(amd) seems to not use the mount command, but a library API, so when I 
was done I had changed automounters, written some weird scripts, etc.

But if I try to get down to the core of the problem:

Writing a file on the NFS volume took forever (some 10..100 times slower 
than local disk). The file was opened with file:open(Name, [write, 
delayed_write, binary]) so small writes were buffered up to 64 KBytes 
(default for delayed_write). The writes were small, one line at the 
time. (This is probably the way disk_log does it too, by the way)

What happened was that the file driver buffered write request up to 64 
KBytes, but to minimize copying they are queued, and later written using 
writev() with a vector of 1024 (on the Linux in question) small buffers. 
Some code in the Linux kernel that writes the data on the NFS filesystem 
(or the NFS code itself) then chops this up into 1024 write requests 
over NFS instead of holding the writev request together. The NFS volume 
was synchronously monted which resulted in about 4 NFS messages back and 
forth for each write request. Performance sucked.

The solution was to not mount the volume synchronously, i.e I added the 
mount options: "async,actimeou=0". The "actimeout=0" flag sets attribute 
caching timeout to 0, which is pretty syncronous, and will do for me.

With these mount options the NFS filesystem realizes that it can bundle 
those 1024 small buffers into one write request, with some handshaking 
around it, and it goes as fast as one can expect.

/ Raimo Niskanen, Erlang/OTP

Garry Hodgson wrote:
> i'm trying to debug a strange problem we're seeing in one
> of our systems.  we've got a number of linux clusters, all
> running vanilla redhat 7.x, and we only see this on one
> of them.  the application is a pretty basic one, where we
> have a master node that starts up processes on other compute
> nodes.  each of these others is runs an erlang port process
> which invokes an external program to handle queries.  the whole
> thing is accessed through a web site which does queries of the
> master server.  the master forwards the query to the appropriate
> child.  each type of information server is started on a different
> node, in on-demand order.
> 
> the problem is this: the first query comes in, causing a server,
> say, foo, to run on node1.  all is good.  a different query comes
> in which needs a baz server, which gets started on node2.  this query
> will run *realllllly* slow, eventually likely timing out before it returns
> a result (which it will do eventually).  when i say slow, i'm talking
> minutes.  the really odd thing is that the child process is writing to
> a log file (nfs mounted from master), using the erlang dblog stuff,
> and debug messages that occur in the same function, only a few lines
> apart, come in several minutes apart.
> 
> the problem is independent of the order in which the nodes start, 
> or the servers start, or which subserver runs on whoich node, etc.
> we've eliminated (i think) all the obvious cases.  it really feels
> like a machine configuration problem.  this cluster is in a secure
> lab, and has gone through a few rounds of network rearrangements.
> the problem doesn't occur on any of the other similar systems we've
> deployed, and we can't recreate it in our development systems.
> 
> so, does anybody have any ideas about what to look for, or what kinds
> of things could make erlang run so slow on machine?  it feels like a
> timeout of some sort, but we've looked at the obvious name lookup
> things and such (does erlang require dns?  we're using local /etc/host
> for name resolution).  is there something else internal that we might
> be waiting on?
> 
> any ideas, suggestions, or pointers to useful test cases would be
> greatly appreciated.  it's awkward to debug since we have only
> restricted access to the broken system.
> 
> thanks for your help.
>