[erlang-questions] A problem with exit erlang node.

adam chan <>
Thu Nov 13 10:25:26 CET 2014


I added some codes to stop the receiver and sender's supervisor, and also sleep(500) before calling erlang:halt()
But the problem remains.‍


[gs_gateway.erl]
...
stop() ->
    supervisor:terminate_child(gs_sup, gs_tcp_client_sup),
    supervisor:terminate_child(gs_sup, gs_tcp_listener_sup),
    timer:sleep(500),
    io:format("gs_gateway stop~n"),
    ok.‍



The gs_tcp_client_sup is the supervisor which supervises the receiver process:
[gs_tcp_client_sup.erl‍]
...
start_link(Mod) ->
    supervisor:start_link({local,?MODULE}, ?MODULE, [Mod]).
init([Mod]) ->
    {ok, {{simple_one_for_one, 10, 10},
          [{Mod, {Mod, start_link, []},
            temporary, brutal_kill, worker, [Mod]}]}}.‍



Since I use 'brutal_kill' option, the receiver process will be killed immediately while the gs_tcp_client_sup is terminated.
In this case it doesn't call gen_tcp:close(Socket) in the receiver process obviously.


It is true that my 'gateway' node may has some bug , though it is working normally.
When I strace the normal working 'gateway' node, it shows there are lots of 'recvfrom' errors increasing by time:
[ scripts]# strace -c -p 21376
Process 21376 attached - interrupt to quit
^CProcess 21376 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.61    0.004370           2      1988               epoll_wait
  1.91    0.000088           1        61                 writev
  1.28    0.000059           1       107                times
  1.28    0.000059           1       118                epoll_ctl
  0.93    0.000043           0       105        15    recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00    0.004619                  2379        15 total
[ scripts]# strace -p 21376 -e 'recvfrom'
Process 21376 attached - interrupt to quit
recvfrom(28, "\0\0\0\16N!", 6, 0, NULL, NULL) = 6
recvfrom(28, "\0\0\0\307\0\3\r\252", 8, 0, NULL, NULL) = 8
recvfrom(28, 0x7fac5a3d5e00, 6, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(23, "\0\0\0\204\203D\3\5\2D\5\350h\3a\2R\1gR\0\0\0\0\314\0\0\0\0\1h\2"..., 1460, 0, NULL, NULL) = 136
recvfrom(23, "\0\0\0001\203D\3\5\2D\5\350h\3a\2R\1gR\0\0\0\0\314\0\0\0\0\1h\2"..., 1460, 0, NULL, NULL) = 53
recvfrom(28, "\0\0\0\31N\"", 6, 0, NULL, NULL) = 6
recvfrom(28, "\0\0\0\0\0\0\0006\0\3dev\0\1\0\3\16\v", 19, 0, NULL, NULL) = 19
recvfrom(28, 0x7fac5a3d5e00, 6, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(23, "\0\0\0V\203D\3\5\2D\5\350h\3a\2R\1gR\0\0\0\0\314\0\0\0\0\1h\2"..., 1460, 0, NULL, NULL) = 170
recvfrom(28, "\0\0\0\16N!", 6, 0, NULL, NULL) = 6
recvfrom(28, "\0\0\0\307\0\3\r\251", 8, 0, NULL, NULL) = 8
recvfrom(28, 0x7fac5a3d5e00, 6, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
...


But I don't know how to dig deeper here to find out why there are recvfrom errors.

In additional, I use the pstack command to detect the stack of the fack dead 'gateway' node which is not exit normally:
[ scripts]# pstack 27304
#0  0x0000000000496233 in erts_deliver_time ()
#1  0x000000000051ce2c in erts_check_io_kp ()
#2  0x0000000000516a4d in erl_sys_schedule ()
#3  0x000000000048d406 in schedule ()
#4  0x00000000004fcb34 in process_main ()
#5  0x0000000000450fd5 in erl_start ()
#6  0x0000000000435add in main ()‍
[ scripts]# pstack 27304
#0  0x000000000051bf2d in erts_poll_interrupt_kp ()
#1  0x000000000048d431 in schedule ()
#2  0x00000000004fcb34 in process_main ()
#3  0x0000000000450fd5 in erl_start ()
#4  0x0000000000435add in main ()‍



It seems that the erlang scheduler is still working after erlang:halt().
So, any new suggestion?
Thanks~



------------------ Original ------------------
From:  "Imants Cekusins";<>;
Date:  Thu, Nov 13, 2014 07:35 AM
To:  "adam chan"<>; 
Cc:  "erlang-questions"<>; 
Subject:  Re: [erlang-questions] A problem with exit erlang node.




Could the 'gateway' node close its open sockets before shutting down?
 
Maybe pause for a second after pinging an exit signal to the gateway node, stopping the apps but before calling erlang:halt()?
 On 12 Nov 2014 17:14, "adam chan" <> wrote:
Hi List,


I have a problem to stop or exit a erlang node.
When I called erlang:halt(), the node is fake dead, and the cpu goes up to 100%.


Here is the situation:
I'm running OTP_R15B02 on Centos 6.3.


I have 3 nodes named 'server', 'unite' and 'gateway' which connected to each other.
The 'gateway' node listens to a port , receives socket datas from client, and then transfers to 'server' and 'unite'. 
The response data from 'server' and 'unite' will send back to client through 'gateway' node too.
 
When I want to stop all these 3 nodes, the 'gateway' node CAN NOT exit completely sometimes (small probability) .
The nodes is running in screen of linux, the starting scripts like this:


[start_all.sh]
...
/usr/bin/screen -dmS server -s $ScriptPath/start_server.sh $Log
...
/usr/bin/screen -dmS unite -s $ScriptPath/start_unite.sh $Log
...
/usr/bin/screen -dmS gateway -s $ScriptPath/start_gateway.sh $Log


[start_gateway.sh]
#!/bin/bash
cd /data/web/server/server/config
ulimit -s 262140
erl -kernel inet_dist_listen_min 40001 -kernel inet_dist_listen_max 40100 +P 1024000 +K true -smp disable -name  -setcookie abc -boot start_sasl -config gs_main -pa ../ebin -s gs_main start -extra 192.168.7.100 9001 2





I stop the nodes in the order of 'gateway' -> 'unite' -> 'server'
The stop scripts like this:
[stop_all.sh]
#!/bin/bash
cd /data/web/server/server/scripts/
chmod +x stop_gateway.sh
chmod +x stop_unite.sh
chmod +x stop_server.sh
./stop_gateway.sh
./stop_unite.sh
./stop_server.sh



[stop_gateway.sh]
#!/bin/bash
cd /data/web/server/server/config
erl -noshell -hidden -name  -setcookie abc -pa ../ebin -eval "rpc:call('', gs_main, stop, [])." -s c q



[gs_main.erl]
-define(SERVER_APPS, [sasl, gs_main]).
...
stop() ->
    ok = stop_applications(?SERVER_APPS),
    erlang:halt().





The 'server' and 'unite' node can exit completely every time, and the screen which is running the node also exit too.
But the 'gateway' node sometimes (small probability) can't exit, the screen remains too:


[ logs]# screen -ls
There are screens on:
        20107.gateway  (Detached)


[ logs]# ps -ef | grep gateway
root     20107     1  0 Nov10 ?        00:00:00 /usr/bin/SCREEN -dmS gateway -s /data/web/server/server/scripts/start_gateway.sh -L -c /data/web/server/server/var/logs/screenrc_gateway
root     20110 20107  0 Nov10 pts/7    00:00:00 /bin/bash /data/web/server/server/scripts/start_gateway.sh
root     20111 20110 90 Nov10 pts/7    1-19:56:53 /usr/local/lib/erlang/erts-5.9.2/bin/beam -P 1024000 -K true -- -root /usr/local/lib/erlang -progname erl -- -home /root -- -kernel inet_dist_listen_min 40001 -kernel inet_dist_listen_max 40100 -smp disable -name  -setcookie abc -boot start_sasl -config gs_main -pa ../ebin -s gs_main start -extra 192.168.7.100 9001 2



[ logs]# strace -c -p 20111
Process 20111 attached - interrupt to quit
^CProcess 20111 detached



strace command has no effect here. And one CPU core keeps running at 100%.
At the end of the 'gateway' node's log, it says the application is exited:
[gateway.log]
=INFO REPORT==== 11-Nov-2014::10:21:18 ===
    application: gs_main
    exited: stopped
    type: temporary



It seems that some endless loop occured after the printing of the =INFO REPORT=.

The application is not really exited, or the 'ps -ef | grep gateway' command won't find the 20111 process.


Any ideas?
Thanks in advance.


------------------
Adam Chan

 


 

_______________________________________________
 erlang-questions mailing list
 
 http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141113/68bd87a7/attachment.html>


More information about the erlang-questions mailing list