[erlang-questions] A problem with exit erlang node.

adam chan 114999420@REDACTED
Wed Nov 12 11:18:22 CET 2014


Hi List,


I have a problem to stop or exit a erlang node.‍
When I called erlang:halt(), the node is fake dead, and the cpu goes up to 100%.


Here is the situation:
I'm running OTP_R15B02 on Centos 6.3.‍


I have 3 nodes named 'server', 'unite' and 'gateway' which connected to each other.
The 'gateway' node listens to a port , receives socket datas from client, and then transfers to 'server' and 'unite'. 
The response data from 'server' and 'unite' will send back to client through 'gateway' node too.
 
When I want to stop all these 3 nodes, the 'gateway' node CAN NOT exit completely sometimes (small probability) ‍.
The nodes is running in screen of linux, the starting scripts like this:


[start_all.sh]
...
/usr/bin/screen -dmS server -s $ScriptPath/start_server.sh $Log‍‍
...
/usr/bin/screen -dmS unite -s $ScriptPath/start_unite.sh $Log‍
...
/usr/bin/screen -dmS gateway -s $ScriptPath/start_gateway.sh $Log‍


[start_gateway.sh]
#!/bin/bash
cd /data/web/server/server/config
ulimit -s 262140
erl -kernel inet_dist_listen_min 40001 -kernel inet_dist_listen_max 40100 +P 1024000 +K true -smp disable -name gateway@REDACTED -setcookie abc -boot start_sasl -config gs_main -pa ../ebin -s gs_main start -extra 192.168.7.100‍ 9001 2





I stop the nodes in the order of 'gateway' -> 'unite' -> 'server'
The stop scripts like this:
[stop_all.sh]
#!/bin/bash
cd /data/web/server/server/scripts/
chmod +x stop_gateway.sh
chmod +x stop_unite.sh
chmod +x stop_server.sh
./stop_gateway.sh
./stop_unite.sh
./stop_server.sh‍



[stop_gateway.sh]
#!/bin/bash
cd /data/web/server/server/config
erl -noshell -hidden -name stop_gateway@REDACTED -setcookie abc -pa ../ebin -eval "rpc:call('gateway@REDACTED', gs_main, stop, [])." -s c q‍



[gs_main.erl]
-define(SERVER_APPS, [sasl, gs_main]).‍
...
stop() ->
    ok = stop_applications(?SERVER_APPS),
    erlang:halt().‍





The 'server' and 'unite' node can exit completely every time, and the screen which is running the node also exit too.
But the 'gateway' node sometimes (small probability‍) can't exit, the screen remains too:


[root@REDACTED logs]# screen -ls
There are screens on:
        20107.gateway  (Detached)


[root@REDACTED logs]# ps -ef | grep gateway
root     20107     1  0 Nov10 ?        00:00:00 /usr/bin/SCREEN -dmS gateway -s /data/web/server/server/scripts/start_gateway.sh -L -c /data/web/server/server/var/logs/screenrc_gateway
root     20110 20107  0 Nov10 pts/7    00:00:00 /bin/bash /data/web/server/server/scripts/start_gateway.sh
root     20111 20110 90 Nov10 pts/7    1-19:56:53 /usr/local/lib/erlang/erts-5.9.2/bin/beam -P 1024000 -K true -- -root /usr/local/lib/erlang -progname erl -- -home /root -- -kernel inet_dist_listen_min 40001 -kernel inet_dist_listen_max 40100 -smp disable -name gateway@REDACTED -setcookie abc -boot start_sasl -config gs_main -pa ../ebin -s gs_main start -extra 192.168.7.100 9001 2‍



[root@REDACTED logs]# strace -c -p 20111
Process 20111 attached - interrupt to quit
^CProcess 20111 detached‍



strace command has no effect here. And one CPU core keeps running at 100%.
At the end of the 'gateway' node's log, it says the application is exited:
[gateway.log]
=INFO REPORT==== 11-Nov-2014::10:21:18 ===
    application: gs_main
    exited: stopped
    type: temporary‍



It seems that some endless loop occured after the printing of the =INFO REPORT=.

The application is not really exited, or the 'ps -ef | grep gateway' command won't find the 20111 process.


Any ideas?
Thanks in advance.


------------------
Adam Chan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141112/0c5d9455/attachment.htm>


More information about the erlang-questions mailing list