Thank you Edwin and Per for your suggestions.<br><br>So far no solution...<br><br>Lowering TIME_WAIT didn't have a noticeable effect. We're getting the best results on our EC2 Fedora Core Release 4 test machine (1.7 GB of memory). On this machine we're now we're able to push ~300 ibrowse http request/responses through before we start to get a large number of conn_failed and req_timedout messages from ibrowse.<br>


<br>Digging deeper into ibrowse... conn_failed is a timeout error from gen_tcp:connect(). This would appear to mean that gen_tcp:connect() isn't able to establish a connection at all.<br>

<br>req_timedout is triggered when the socket stays open(?) for too long (for longer than the user supplied timeout (in our case 10 seconds)). (BTW, all of the sites we're hitting are available in a 10 second window. Only rarely should we get this type of timeout. Our test run hits 500 urls.)   <br>


<br>It still seems like we don't have enough sockets available to us.<br><br>I dug deeper on this. How do you tell how many sockets a given process has open? It seems like one way is to do a ls on /proc/[the beam process id]/fd. This gives a list of numbers that presumably correspond to the file descriptors (sockets) for a process. On the first pass of the test (right after starting the erlang process), the number of fds shown are approximately the number of successful requests (around 300). Yet, on repeat runs the number of fds doesn't exceed approximately 1000 (they stay open for a while), which would seem to mean that erlang still doesn't have more than 1024 sockets available to it, despite what ulimit says. This doesn't explain why it doesn't work right on the first pass though (since we're only looping through 500 urls).<br>

<br>I have tried setting ERL_MAX_PORTS to 50000 before starting erlang from the command prompt. This doesn't appear to do anything.<br>

<br>What to try next? Approximately how many good request/reponse cycles should we *expect* to get if everything is working right? (It would seem like from what I read we should expect many many more....) Do the ibrowse folks have any insight on any of this? Is there anything we can do to get the system to give us more information about what is going on? Is there an erlang error log we can look at?<br>

<br>PS I tried looking at the tcpdump of one of our request loops but wasn't able to see anything meaningful there. Any idea what I should be looking for in the tcpdump output?<br>

<br>Thanks!<br><br>Steve<br><br><div class="gmail_quote">On Sun, Feb 15, 2009 at 7:15 AM, Per Hedeland <span dir="ltr"><<a href="mailto:per@hedeland.org" target="_blank">per@hedeland.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div>Edwin Fine <<a href="mailto:erlang-questions_efine@usa.net" target="_blank">erlang-questions_efine@usa.net</a>> wrote:<br>

><br>

>If that doesn't help, try decreasing TIME_WAIT (but first read<br>

><a href="http://www.erlang.org/pipermail/erlang-questions/2008-September/038154.htmland" target="_blank">http://www.erlang.org/pipermail/erlang-questions/2008-September/038154.htmland</a><br>

><a href="http://www.developerweb.net/forum/showthread.php?t=2941" target="_blank">http://www.developerweb.net/forum/showthread.php?t=2941</a>).<br>

><br>

># Set TIME_WAIT timeout to 30 seconds instead of 120<br>

>sudo /sbin/sysctl -w net.ipv4.tcp_fin_timeout=30<br>

<br>

</div>That may help, but change the TIME_WAIT time (which isn't really a<br>

"timeout", it's not waiting for anything to "happen") it does not, as<br>

one might guess from the name of the variable. It reduces the timeout<br>

waiting for a close from the peer in the FIN_WAIT_2 state (see the state<br>

diagram in RFC 793), by default 60 seconds on Linux I believe. Note that<br>

this is generally short duration, and you shouldn't hit the timeout<br>

unless connectivity with the peer is lost - but reducing it too<br>

agressively might cause loss of data.<br>

<br>

As far as I know there is no way to reduce the TIME_WAIT time on Linux<br>

other than modifying the kernel - it's a #define (60*HZ) in a kernel<br>

header file. There are other ways to deal with the problem of having a<br>

lot of connections in TIME_WAIT on Linux though.<br>

<div><br>

>2009/2/13 steve ellis <<a href="mailto:steve.e.123@gmail.com" target="_blank">steve.e.123@gmail.com</a>><br>

><br>

>> We're trying to build an app that uses ibrowse to make concurrent requests.<br>

>> We are not able to get more than a few concurrent requests at a time to<br>

>> return successfully. We repeatedly get "conn_failed"<br>

<br>

</div>If that's all you get, I'm afraid it's pretty useless. Did it time out,<br>

get "connection refused" i.e. RST, was the connection established but<br>

immediately closed, or did it run into the "lack of ports" problem? I<br>

wouldn't say that unhandled "let it crash" is appropriate for problems<br>

occurring way below the user interface of an application, but a badmatch<br>

would at least have told us what gen_tcp said.<br>

<br>

Anyway if you get problems with connections in the low hundreds, "lack<br>

of ports" is really unlikely. So dig deeper instead of blindly trying to<br>

fix a problem that you may not have - find the place(s) in the source<br>

where 'conn_failed' is generated and make it/them report what actually<br>

happened, and/or use tcpdump or similar to figure out what goes wrong<br>

with the connections.<br>

<font color="#888888"><br>

--Per<br>

</font></blockquote></div><br>