node startup crash

Serge Aleynikov serge@REDACTED
Tue Apr 26 19:12:28 CEST 2005


After banging my head with this problem for a couple of days now, I 
finally figured out what was causing this crash.

In case of this server netstat ("netstat -a | grep 4369") wasn't seeing 
anything listening on 127.0.0.1:4369, yet any attempt to bind to it 
would fail, stating that the address was not available:

4> {ok, S} = gen_tcp:listen(4369, [{ip, {127,0,0,1}},{reuseaddr, 
true}]).

=ERROR REPORT==== 26-Apr-2005::12:44:32 ===
Error in process <0.32.0> with exit value: 
{{badmatch,{error,eaddrnotavail}},[{erl_eval,expr,3}]}

epmd starts by binding to port 4369 using INADDR_ANY (telling the kernel 
to choose the IP address for it).  So, since the loopback address 
"127.0.0.1" was taken by some other process, it bound to some other IP 
address ("172.16.202.77" in case of this host) defined by the eth0 
interface.

The registration attempts from Erlang were coming on the 127.0.0.1:4369 
address, and therefore couldn't find the listener failing with timeout.

I ended up recycling the loopback interface on the server ("/sbin/ifdown 
lo", "/sbin/ifup lo", and this cleared up its state, allowing epmd to 
bind properly.

7> f(S), {ok, S} = gen_tcp:listen(4369, [{ip, {127,0,0,1}},{reuseaddr, 
true}]).
{ok,#Port<0.89>}
8> q().

And finally the glorious start of the distributed erl:

~/tmp/otp_src_R10B-4>erl -sname a
Erlang (BEAM) emulator version 5.4.5 [source] [hipe] [threads:0]

Eshell V5.4.5  (abort with ^G)
(a@REDACTED)1>

I have no clue why the kernel couldn't fulfill the original bind 
request, other than to blame the state of that interface.

I wanted to share with you this finding so that this might be helpful 
for someone else experincing the same issue.

Regards,

Serge

Serge Aleynikov wrote:
> In my attempt to troubleshoot this issue of a "erl -sname a" startup 
> crash, I determined that the problem is indeed related to epmd.  I 
> slightly modified epmd_srv.c to output more tracing:
> 
> --- /home/serge/tmp/epmd_srv.c  2005-04-25 18:45:31.328089912 -0400
> +++ epmd_srv.c  2005-04-25 18:41:23.570754752 -0400
> @@ -227,8 +227,12 @@
>           if (g->delay_accept)          /* Test of busy server */
>             sleep(g->delay_accept);
> 
> -         if(FD_ISSET(listensock,&read_mask))
> -           do_accept(g,listensock);
> +          dbg_tty_printf(g,2,"select() triggered. read_mask=%d 
> (listensock=%d)", read_mask, listensock);
> +
> +         if(FD_ISSET(listensock,&read_mask)) {
> +           dbg_tty_printf(g,2,"accepting the socket");
> +            do_accept(g,listensock);
> +          }
> 
>           /* Go over all connections and look for open ones */
>           {
> 
> Then I ran epmd in the debug mode ("epmd -d -d"), fired off an "erl" in 
> a separate terminal, and typed:
> 
>  > net_kernel:start([a]).
> 
> The call hung, and crashed after a timeout
> =INFO REPORT==== 25-Apr-2005::18:51:07 ===
> Protocol: "inet_tcp": register error: {timeout,
>                                           {gen_server,
>                                               call,
>                                               [erl_epmd,
>                                                {register,a,35418},
>                                                15000]}}
> 
> This is what I saw in the epmd trace:
> epmd: Mon Apr 25 18:50:48 2005: epmd running - daemon = 0
> epmd: Mon Apr 25 18:50:48 2005: try to initiate listening port 4369
> epmd: Mon Apr 25 18:50:48 2005: starting
> epmd: Mon Apr 25 18:50:48 2005: entering the main select() loop
> epmd: Mon Apr 25 18:50:53 2005: select() triggered. read_mask=0 
> (listensock=0)
> epmd: Mon Apr 25 18:50:58 2005: select() triggered. read_mask=0 
> (listensock=0)
> epmd: Mon Apr 25 18:51:03 2005: select() triggered. read_mask=0 
> (listensock=0)
> epmd: Mon Apr 25 18:51:08 2005: select() triggered. read_mask=0 
> (listensock=0)
> 
> What's extreemly odd is that the listensock fd is not set in the 
> read_mask, even though the select() call woke up detecting data...
> 
> I suppose that there's some incompatibility with libc, but wouldn't 
> configure take care of this during the build process? Is there a 
> problem, perhaps that I am running the 2.6.8.1 kernel on this host?
> 
> ~/tmp/otp_src_R10B-4/erts/epmd/src>uname -a
> Linux stardev1.corp.idt.net 2.6.8.1 #2 SMP Tue Sep 28 16:04:54 EDT 2004 
> i686 i686 i386 GNU/Linux
> 
>  >ldd /home/serge/tmp/otp_src_R10B-4/bin/i686-pc-linux-gnu/epmd
>         libncurses.so.5 => /usr/lib/libncurses.so.5 (0x4002c000)
>         libdl.so.2 => /lib/libdl.so.2 (0x4006c000)
>         libm.so.6 => /lib/tls/libm.so.6 (0x4006f000)
>         libc.so.6 => /lib/tls/libc.so.6 (0x40091000)
>         libgpm.so.1 => /usr/lib/libgpm.so.1 (0x401c8000)
>         /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
> Serge
> 
> 
> Serge Aleynikov wrote:
> 
>> Thanks for the advice, but I tried that a couple of times before 
>> submitting this posting.  Then I also did:
>>
>> ~/otp_src_R10B-4>killall epmd beam
>> ~/otp_src_R10B-4>rm -fr /usr/local/lib/erlang
>> ~/otp_src_R10B-4>make install
>> ~/otp_src_R10B-4>erl -sname abc
>> {error_logger,{{2005,4,23},{21,41,36}},'Protocol: ~p: register error: 
>> ~p~n',[inet_tcp,{timeout,{gen_server,call,[erl_epmd,{register,abc,35363},15000]}}]} 
>>
>> {error_logger,{{2005,4,23},{21,41,36}},crash_report,[[{pid,<0.18.0>},{registered_name,net_kernel},{error_info,{error,badarg} 
>>
>> ...
>>
>> The same crash.  There's not a lot of stuff running on this host 
>> either.  Below is the output of "ps -ax".  We can also see that epmd 
>> was started from the right distribution (erts-5.4.5).  This is 
>> somewhat annoying, as  I am not sure what else can be checked in order 
>> to overcome this crash...
>>
>> Thank you.
>>
>> Serge
>>
>>   PID TTY      STAT   TIME COMMAND
>>     1 ?        S      0:00 init [3]
>>     2 ?        SW     0:00 [migration/0]
>>     3 ?        SWN    0:00 [ksoftirqd/0]
>>     4 ?        SW     0:00 [migration/1]
>>     5 ?        SWN    0:00 [ksoftirqd/1]
>>     6 ?        SW     0:03 [migration/2]
>>     7 ?        SWN    0:00 [ksoftirqd/2]
>>     8 ?        SW     0:00 [migration/3]
>>     9 ?        SWN    0:00 [ksoftirqd/3]
>>    10 ?        SW<    0:00 [events/0]
>>    11 ?        SW<    0:00 [events/1]
>>    12 ?        SW<    0:00 [events/2]
>>    13 ?        SW<    0:00 [events/3]
>>    14 ?        SW<    0:00 [khelper]
>>    15 ?        SW<    0:00 [kblockd/0]
>>    16 ?        SW<    0:00 [kblockd/1]
>>    17 ?        SW<    0:00 [kblockd/2]
>>    18 ?        SW<    0:00 [kblockd/3]
>>    53 ?        SW     0:00 [kirqd]
>>    54 ?        SW     0:00 [pdflush]
>>    55 ?        SW     0:11 [pdflush]
>>    56 ?        SW     0:00 [kswapd0]
>>    57 ?        SW<    0:00 [aio/0]
>>    58 ?        SW<    0:00 [aio/1]
>>    59 ?        SW<    0:00 [aio/2]
>>    60 ?        SW<    0:00 [aio/3]
>>   153 ?        SW     0:00 [kseriod]
>>   190 ?        SW     0:00 [scsi_eh_0]
>>   191 ?        SW     0:00 [aacraid]
>>   204 ?        SW     0:18 [kjournald]
>>   520 ?        SW     0:00 [khubd]
>>  1043 ?        SW     0:00 [kjournald]
>>  1044 ?        SW     0:00 [kjournald]
>>  1045 ?        SW     0:06 [kjournald]
>>  1463 ?        S      0:00 syslogd -m 0
>>  1467 ?        S      0:00 klogd -x
>>  1859 ?        S      0:02 /usr/sbin/sshd
>>  1884 ?        S      0:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
>>  1947 ?        S      0:00 crond
>>  2010 ?        S      0:00 proftpd: (accepting connections)
>>  2017 tty1     S      0:00 /sbin/mingetty tty1
>>  2018 tty2     S      0:00 /sbin/mingetty tty2
>>  2019 tty3     S      0:00 /sbin/mingetty tty3
>>  2020 tty4     S      0:00 /sbin/mingetty tty4
>>  2021 tty5     S      0:00 /sbin/mingetty tty5
>>  2022 tty6     S      0:00 /sbin/mingetty tty6
>> 11707 ?        S      0:00 sshd: serge [priv]
>> 11709 ?        S      0:01 sshd: serge@REDACTED/1
>> 11710 pts/1    S      0:01 -bash
>> 31265 pts/1    S      0:00 rxvt -sr -fg white -bg black -fn Clean -sl 300
>> 31266 pts/0    S      0:00 bash
>> 27564 ?        S      0:00 /usr/local/lib/erlang/erts-5.4.5/bin/epmd 
>> -daemon
>> 27557 pts/0    R      0:00 ps -ax
>>
>> klacke@REDACTED wrote:
>>
>>> On Fri, Apr 22, 2005 at 10:05:23AM -0400, Serge Aleynikov wrote:
>>>
>>>> Hi,
>>>>
>>>> I upgraded Erlang on one of our Linux servers to R10B-4, and started 
>>>> getting this error at startup.  This problem wasn't observed before 
>>>> the upgrade.
>>>>
>>>>
>>>>> erl -sname a
>>>>
>>>>
>>>>
>>>> {error_logger,{{2005,4,21},{21,30,39}},'Protocol: ~p: register 
>>>> error: 
>>>> ~p~n',[inet_tcp,{timeout,{gen_server,call,[erl_epmd,{register,a,35269},15000]}}]} 
>>>>
>>>> {error_logger,{{2005,4,21},{21,30,39}},crash_report,
>>>
>>>
>>>
>>>
>>> .........
>>>
>>> One of the early Erlang slogans was: "Symolic information is always 
>>> available".
>>> Somehow, that spung to my mind what I saw your higly _symbolic_
>>> error message :-)
>>>
>>>
>>> It does though look as if there is some trouble while
>>> registering the node name with epmd.
>>> Maybe incompatible epmd vsns ??, 'killall epmd' and retry.
>>>
>>>
>>> /klacke
>>
>>
>>
> 

-- 
================================================================
| Serge Aleynikov                          Tel: (973) 438-3436
| MIS Telecom                              Fax: (973) 438-1457
| IDT Corp.                                   serge@REDACTED
================================================================



More information about the erlang-questions mailing list