[erlang-questions] Erlang crash gen_tcp related (probably only under Windows)
Edwin Fine
erlang-questions_efine@REDACTED
Sat Sep 13 05:49:14 CEST 2008
Micheal,
I was able to duplicate your problem on WIndwos XP *SP3* ;) on an Intel
E6600 dual-core system in 32-bit mode with 4 GB of RAM, Erlang R12B-3, but
it took more processes than 1000 (I used 10000), at which point the emulator
crashed with the same error message you reported. It crashed in the second
test, but I didn't try to get the first test to crash after that.
At the very least, this should be reported as a bug to erlang-bugs because
the crash dump report is unhelpful.
I wonder if you are running into an ephemeral port starvation (TIME_WAIT)
problem? if you run netstat in a command window you will most likely see
many ports in a TIME_WAIT state. One way to find out is to increase the
number of ephemeral ports and see if that delays or eliminates the problem
unless you run with more processes, or run more often.
A good explanation about the TIME_WAIT state can be found at
http://www.developerweb.net/forum/showthread.php?t=2941
On Linux:
# Set TIME_WAIT timeout to 30 seconds instead of 120
/sbin/sysctl -w net.ipv4.tcp_fin_timeout=30
On Windows:
In the registry, under HKEY_LOCAL_MACHINE\SYSTEM\
CurrentControlSet\Services\Tcpip\Parameters
- Add DWORD MaxUserPort = 65534 (decimal).
- Add TcpTimedWaitDelay = 60 (decimal)
Explanation:
The default maximum number of ephemeral TCP ports is 5000. Increase the
maximum number of ephemeral ports (MaxUserPort) to 65534 (decimal). This
parameter controls the maximum port number that is used when a program
requests any available user port from the system. Typically ,
ephemeral(short-lived) ports are allocated between the values of 1024
and 5000
inclusive. This parameter determines the time that a connection stays in the
TIME_WAIT state when it is closing. As long as a connection is in the
TIME_WAIT state, the socket pair cannot be re-used. This is also known as
the "2MSL" state. According to RFC793, the value should be two times the
maximum segment lifetime on the network. See RFC793 for more information.
However, in fast LAN environments (same-segment 1GB/sec, for example), this
can be lowered to 60 or less.
TcpTimedWaitDelay - Set to 60
Key: Tcpip\Parameters
Value Type: REG_DWORD - Time in seconds
Valid Range: 30-300 (decimal)
Hope this helps.
On Fri, Sep 12, 2008 at 6:20 PM, Edwin Fine
<erlang-questions_efine@REDACTED>wrote:
> My mistake, thanks.
>
>
> On Fri, Sep 12, 2008 at 5:42 PM, Michael Regen <michael.regen@REDACTED>wrote:
>
>> I would be very interested to hear about some tests from others! And
>> Edwin, I guess you mean SP3. Windows XP SP3 is the most recent service pack.
>> SP4 is if at all a giant trojan, isn't it? ;)
>>
>> Regards,
>> Michael
>>
>>
>> On Fri, Sep 12, 2008 at 9:01 PM, Edwin Fine <
>> erlang-questions_efine@REDACTED> wrote:
>>
>>> I hear you :).
>>>
>>> When I've got some time I will try out my R12B-3 installation on SP4 with
>>> your program, but can't right now. I am interested to see what happens.
>>>
>>> Rgds,
>>> Ed
>>>
>>>
>>> On Fri, Sep 12, 2008 at 2:47 PM, Michael Regen <michael.regen@REDACTED>wrote:
>>>
>>>> Just tested it on another machine with SP3 installed. No difference.
>>>> Same problem.
>>>>
>>>> Yes, Windows is flaky and I personally would like to be able to say I
>>>> would rather go to hell than installing a server on Windows which is
>>>> expected to run robust.
>>>> But well, either Erlang is robust on Windows as well or no Erlang on
>>>> Windows. :(
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>> --
>>>> Quote from a >3000 employees IT centric company's CIO I had the pleasure
>>>> to witness four weeks ago: 'For the messaging back end? No, we can't use
>>>> Java. Java is too slow.'
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 12, 2008 at 7:54 PM, Edwin Fine <
>>>> erlang-questions_efine@REDACTED> wrote:
>>>>
>>>>> Michael,
>>>>>
>>>>> I've always felt that the Windows version of Erlang is a bit flaky.
>>>>> Then again, I think Windows itself is more than a bit flaky, so maybe it's
>>>>> not Erlang's fault ;)
>>>>> I wonder if running on SP4 would improve things?
>>>>>
>>>>>
>>>>> On Fri, Sep 12, 2008 at 1:39 PM, Michael Regen <
>>>>> michael.regen@REDACTED> wrote:
>>>>>
>>>>>> Hi Edwin,
>>>>>>
>>>>>> It is possible that both issues have a similar source but I do not see
>>>>>> many reasons why there must be a common source.
>>>>>> I was running my tests on a 32bit single core Windows XP SP2 system
>>>>>> just by running
>>>>>> werl.exe -boot start_sasl
>>>>>> or
>>>>>> werl.exe
>>>>>>
>>>>>> and did nothing fancy. My R12B-3 version is self compiled, R12B-4 is
>>>>>> out of the erlang.org box.
>>>>>> Client and server tests where done by starting two different instances
>>>>>> of werl.
>>>>>> Furthermore my tcp_test:test does not care whether results from
>>>>>> gen_tcp:connect are correct or not. It just assumes {ok, Socket} and crashes
>>>>>> the process if otherwise. Of course it was a surprise that under some
>>>>>> circumstances the whole emulator crashes.
>>>>>>
>>>>>> By the way, the crash dump slogan is unspectecular: 'Slogan:
>>>>>> Inconsistent, why isnt io reported?'
>>>>>>
>>>>>> UPDATE: I got some more observations which puzzle me even more:
>>>>>>
>>>>>> Just did some more of the same tests but this time by starting:
>>>>>> erl.exe
>>>>>> application:start(tcp_server).
>>>>>> and
>>>>>> erl.exe
>>>>>> tcp_test:test(1000).
>>>>>>
>>>>>> There seems to be a difference between erl.exe and werl.exe.
>>>>>>
>>>>>> This time results are pretty different:
>>>>>> Now it is much harder to crash the emulator. It takes significant more
>>>>>> processes / tries until something bad happens:
>>>>>>
>>>>>> client only (tcp_test:test(5000)) crashes eventually in the same way
>>>>>> but Window's cmd.exe now follows with a:
>>>>>> The exception unknown software exception (0x40000015) occured in the
>>>>>> application at location 0x008fff86
>>>>>>
>>>>>> after the 'Crash dump was written to: erl_crash.dump / Inconsistent,
>>>>>> why isnt io reported?' message and the crash dump file.
>>>>>>
>>>>>> The exception seems to always occure at the same location.
>>>>>>
>>>>>> A lot more error messages are printed now (as expected) until the
>>>>>> crash.
>>>>>> Besides the {{badmatch,{error,econnrefused}},[{tcp_test,test_con,0}]}
>>>>>> I can now also watch lots of
>>>>>> {{badmatch,{error,eaddrinuse}},[{tcp_test,test_con,0}]}
>>>>>> and
>>>>>> {{badmatch,{error,system_limit}},[{tcp_test,test_con,0}]}
>>>>>> errors.
>>>>>>
>>>>>> The good message: During tests together with the server backend I was
>>>>>> not able to crash the server. But I am not convinced that erl.exe solves
>>>>>> everthing server side.
>>>>>>
>>>>>> Regards,
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 12, 2008 at 6:32 PM, Edwin Fine <emofine@REDACTED>wrote:
>>>>>>
>>>>>>> Please be aware that I reported a bug a while ago on erlang-bugs,
>>>>>>> where attempting to connect to a socket that is not being listened on will
>>>>>>> sometimes return an actual success return, but subsequent operations will
>>>>>>> fail. Here is an excerpt from that bug report.
>>>>>>>
>>>>>>> When calling gen_tcp:connect/3 or /4 on a host/port that does not have a
>>>>>>> running program listening on it, at random intervals gen_tcp:connect returns
>>>>>>> an {ok, Sock} instead of the expected {error, econnrefused}. If
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> gen_tcp:recv(Sock, 0) is called immediately using the socket just returned,
>>>>>>> it returns an {error, econnrefused}. Connection options used were [binary,
>>>>>>> {packet, raw}, {active, false}]. It should be noted that the gen_tcp:connect
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> succeeds when there is a program listening on that sane host/port, so it's
>>>>>>> unlikely to be a firewall issue.
>>>>>>>
>>>>>>> See
>>>>>>> http://www.erlang.org/pipermail/erlang-bugs/2008-August/000931.html
>>>>>>>
>>>>>>> This bug is still present in R12B-4. Could this be affecting you?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Edwin Fine
>>>>>>>
>>>>>>> 2008/9/12 Michael Regen <michael.regen@REDACTED>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I got a series of troubles with gen_tcp all eventually resulting in
>>>>>>>> crashes. I tested this under Windows XP and with R12B-3 as well as R12B-4.
>>>>>>>> Under Linux it seems to work but I am not perfectly sure since the crash
>>>>>>>> happens sporadically and seems to be timing related.
>>>>>>>>
>>>>>>>> The two problems below lead me to a couple of questions:
>>>>>>>> a) What is the real cause? Is it the socket error enfile? Do both
>>>>>>>> problems have the same root cause?
>>>>>>>> b) Is there a bug in Erlang? I guess this should not lead to a
>>>>>>>> crash.
>>>>>>>> c) How do you avoid this problem on systems you do not control
>>>>>>>> yourself?
>>>>>>>>
>>>>>>>>
>>>>>>>> Problem #1:
>>>>>>>> ###########
>>>>>>>>
>>>>>>>> Just compile the following code and run it with sasl enabled and the
>>>>>>>> following command:
>>>>>>>> tcp_test:test(1000).
>>>>>>>> and - yes - without anything listening on port 2222. And sometimes
>>>>>>>> you have to try two times!
>>>>>>>>
>>>>>>>> -------------------------- start: tcp_test.erl
>>>>>>>> --------------------------
>>>>>>>> -module(tcp_test).
>>>>>>>>
>>>>>>>> -export([test/1, test_con/0]).
>>>>>>>>
>>>>>>>> -define(DEF_PORT, 2222).
>>>>>>>> -define(DEF_IP, {127,0,0,1}).
>>>>>>>>
>>>>>>>> test(0) -> ok;
>>>>>>>> test(HowManyProcs) ->
>>>>>>>> spawn(?MODULE, test_con, []),
>>>>>>>> test(HowManyProcs-1).
>>>>>>>>
>>>>>>>> test_con() ->
>>>>>>>> {ok,S} = gen_tcp:connect(?DEF_IP, ?DEF_PORT,[]),
>>>>>>>> gen_tcp:send(S,<<0,5,65,66,67,68,69>>),
>>>>>>>> receive
>>>>>>>> {tcp_closed, _Socket} -> ok;
>>>>>>>> _Msg -> gen_tcp:close(S)
>>>>>>>> after 500 ->
>>>>>>>> gen_tcp:close(S)
>>>>>>>> end.
>>>>>>>> -------------------------- end: tcp_test.erl
>>>>>>>> --------------------------
>>>>>>>>
>>>>>>>> It just spawns a bunch of processes all trying to connect to a
>>>>>>>> currently closed port and sending some garbage there. This is what happens:
>>>>>>>>
>>>>>>>> -------------------------- start: log tcp_test.erl
>>>>>>>> --------------------------
>>>>>>>> =ERROR REPORT==== 12-Sep-2008::15:28:47 ===
>>>>>>>> Error in process <0.41.0> with exit value:
>>>>>>>> {{badmatch,{error,econnrefused}},[{tcp_test,test_con,0}]}
>>>>>>>>
>>>>>>>> [... a couple of them but usually between 1 and 20.]
>>>>>>>>
>>>>>>>> =ERROR REPORT==== 12-Sep-2008::15:28:47 ===
>>>>>>>> Error in process <0.103.0> with exit value:
>>>>>>>> {{badmatch,{error,econnrefused}},[{tcp_test,test_con,0}]}
>>>>>>>>
>>>>>>>>
>>>>>>>> Crash dump was written to: erl_crash.dump
>>>>>>>> Inconsistent, why isnt io reported?
>>>>>>>>
>>>>>>>> Abnormal termination
>>>>>>>> -------------------------- end: log tcp_test.erl
>>>>>>>> --------------------------
>>>>>>>>
>>>>>>>> It might have something to do with the socket error enfile 'file
>>>>>>>> table overflow' but I guess it should not simply crash the emulator!?
>>>>>>>> Searching google for 'Inconsistent, why isnt io reported?' just
>>>>>>>> gives one hit to Erlang's source code.
>>>>>>>> I can provide the crash dump if needed. Just did not want to spam
>>>>>>>> the whole list with big attachments.
>>>>>>>> Spawning only 500 processes (tcp_test:test(500).) usually leads to a
>>>>>>>> crash, spawning only 200 seems to work.
>>>>>>>>
>>>>>>>>
>>>>>>>> Problem #2:
>>>>>>>> ###########
>>>>>>>>
>>>>>>>> Now let's try the same with a server answering to port 2222: Just
>>>>>>>> take the code from the trapexit tutorial 'Building a Non-blocking TCP server
>>>>>>>> using OTP principles'
>>>>>>>> http://trapexit.org/Building_a_Non-blocking_TCP_server_using_OTP_principles
>>>>>>>> Start it first and then our test module in a different erlang node
>>>>>>>> as described above. Now, usually the client survives (have seen crashes as
>>>>>>>> well!) and the server crashes in a similar way. Sometimes it survives and in
>>>>>>>> very rare cases you will see the following logs in the erlang server
>>>>>>>> instance:
>>>>>>>>
>>>>>>>> -------------------------- start: log server
>>>>>>>> --------------------------
>>>>>>>> =ERROR REPORT==== 12-Sep-2008::12:58:56 ===
>>>>>>>> File operation error: system_limit. Function: get_cwd. Process:
>>>>>>>> code_server.
>>>>>>>>
>>>>>>>> =ERROR REPORT==== 12-Sep-2008::12:58:56 ===
>>>>>>>> Error in async accept: {async_accept,"file table overflow"}.
>>>>>>>>
>>>>>>>> =ERROR REPORT==== 12-Sep-2008::12:58:56 ===
>>>>>>>> ** Generic server tcp_listener terminating
>>>>>>>> ** Last message in was
>>>>>>>> {inet_async,#Port<0.109>,1019,{ok,#Port<0.2141>}}
>>>>>>>> ** When Server state == {state,#Port<0.109>,1019,tcp_echo_fsm}
>>>>>>>> ** Reason for termination ==
>>>>>>>> ** {async_accept,"file table overflow"}
>>>>>>>>
>>>>>>>> [...]
>>>>>>>> -------------------------- end: log server
>>>>>>>> --------------------------
>>>>>>>>
>>>>>>>> Can anyone help? Thank you!
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> erlang-questions mailing list
>>>>>>>> erlang-questions@REDACTED
>>>>>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080912/a3f4e93b/attachment.htm>
More information about the erlang-questions
mailing list