[erlang-questions] R17 - Possible wedged scheduler
Matthew Evans
mattevans123@REDACTED
Fri Dec 9 19:31:36 CET 2016
Thanks, I'll give that a go,
Looking at code_server.erl it does appear that there exists a possibility of a deadlock. In so much as function cpc_recv/4 does a receive with no timeout.
Thanks again
Matt
________________________________
From: Michael Truog <mjtruog@REDACTED>
Sent: Friday, December 9, 2016 1:03 PM
To: Matthew Evans; Erlang/OTP discussions
Subject: Re: [erlang-questions] R17 - Possible wedged scheduler
If this is scheduler collapse, it would mean you have a port driver or NIF that has internal latency greater than 1 millisecond. To handle scheduler collapse, you can use the erl command line option "-heart" combined with:
heart:set_options([check_schedulers]).
(see http://erlang.org/doc/man/heart.html#set_options-1 )
That will allow the system to restart when schedulers have collapsed.
A test that is meant to cause scheduler collapse is at:
https://github.com/basho/nifwait
[https://avatars0.githubusercontent.com/u/176293?v=3&s=400]<https://github.com/basho/nifwait>
GitHub - basho/nifwait: Utility to test effect of blocking ...<https://github.com/basho/nifwait>
github.com
README.md Utility to test effect of blocking NIF on Erlang scheduler. Test spawns several processes, all which will start in the run queue of the current scheduler.
So you could use that to prove to yourself that the behavior is the same with the system you are having problems with.
On 12/09/2016 08:35 AM, Matthew Evans wrote:
Happened again, it appears that code_server is wedged:
admin@REDACTED:~$ doErlangFun "erlang:process_info(whereis(code_server))."
[{registered_name,code_server},
{current_function,{code_server,cpc_recv,4}},
{initial_call,{erlang,apply,2}},
{status,waiting},
{message_queue_len,23},
{messages,[{code_call,<6805.4097.0>,{ensure_loaded,switch_type_module}},
{code_call,<6805.4146.0>,{ensure_loaded,switch_type_module}},
{code_call,<6805.941.0>,{ensure_loaded,pc_port_autoneg}},
{code_call,<6805.541.0>,{ensure_loaded,plexxiStatistics_types}},
{code_call,<6805.520.0>,{ensure_loaded,switch_type_module}},
{code_call,<6805.5123.0>,{ensure_loaded,secondary_erlang_node}},
{code_call,<6805.5122.0>,{ensure_loaded,secondary_erlang_node}},
{code_call,<6805.5162.0>,{ensure_loaded,icmp}},
{code_call,<6805.5321.0>,
{ensure_loaded,mac_entries_record_handler}},
{code_call,<6805.5483.0>,{ensure_loaded,icmp}},
{code_call,<6805.6647.0>,{ensure_loaded,icmp}},
{code_call,<6805.7232.0>,{ensure_loaded,icmp}},
{code_call,<6805.7274.0>,{ensure_loaded,icmp}},
{code_call,<6805.7304.0>,{ensure_loaded,icmp}},
{code_call,<6805.8889.0>,
{ensure_loaded,mac_entries_record_handler}},
{code_call,<6805.8951.0>,
{ensure_loaded,mac_entries_record_handler}},
{code_call,<6805.576.0>,
{ensure_loaded,cross_connect_unicast_utils}},
{code_call,<6805.19300.12>,{ensure_loaded,shell}},
{code_call,<6805.20313.12>,{ensure_loaded,shell}},
{code_call,<6805.21339.12>,{ensure_loaded,dbg}},
{code_call,<6805.31109.13>,get_mode},
{code_call,<6805.1255.14>,get_mode},
{system,{<6805.2521.14>,#Ref<6805.0.23.35356>},get_status}]},
{links,[<6805.11.0>]},
{dictionary,[{any_native_code_loaded,false}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<6805.9.0>},
{total_heap_size,86071},
{heap_size,10958},
{stack_size,25},
{reductions,13172282},
{garbage_collection,[{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,71}]},
{suspending,[]}]
admin@REDACTED:~$
________________________________
From: erlang-questions-bounces@REDACTED<mailto:erlang-questions-bounces@REDACTED> <erlang-questions-bounces@REDACTED><mailto:erlang-questions-bounces@REDACTED> on behalf of Matthew Evans <mattevans123@REDACTED><mailto:mattevans123@REDACTED>
Sent: Friday, December 9, 2016 9:56 AM
To: Erlang/OTP discussions
Subject: [erlang-questions] R17 - Possible wedged scheduler
Hi,
We just hit a situation where it appeared that 1 scheduler was wedged. Some parts of our application were working, but others appeared to be stuck. I could connect via a cnode application and an escript, but I couldn't connect via the Erlang shell. We have an escript that does rpc calls, some worked, others (e.g. anything to the code server or tracing failed) failed.
CPU load was minimal at the time, and heart didn't complain. We only have a single NIF, but this is not called on this hardware variant. We do use CNODE to talk to C applications.
We are running R17, Intel quad core CPU on Debian.
This is the first time this has been seen, so the questions are:
1. Has anyone seen this before?
2. What can we do if we hit this condition in the future to debug?
3. Since heart doesn't detect this can anyone think of any alternative mechanisms?
Thanks
Matt
_______________________________________________
erlang-questions mailing list
erlang-questions@REDACTED<mailto:erlang-questions@REDACTED>
http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20161209/ead729e9/attachment.htm>
More information about the erlang-questions
mailing list