[erlang-questions] R17 - Possible wedged scheduler

Fri Dec 9 19:31:36 CET 2016

Thanks, I'll give that a go,

Looking at code_server.erl  it does appear that there exists a possibility of a deadlock. In so much as function cpc_recv/4 does a receive with no timeout.

Thanks again

Matt

________________________________
From: Michael Truog <mjtruog@REDACTED>
Sent: Friday, December 9, 2016 1:03 PM
To: Matthew Evans; Erlang/OTP discussions
Subject: Re: [erlang-questions] R17 - Possible wedged scheduler

If this is scheduler collapse, it would mean you have a port driver or NIF that has internal latency greater than 1 millisecond.  To handle scheduler collapse, you can use the erl command line option "-heart" combined with:
heart:set_options([check_schedulers]).

(see http://erlang.org/doc/man/heart.html#set_options-1 )

That will allow the system to restart when schedulers have collapsed.

A test that is meant to cause scheduler collapse is at:
https://github.com/basho/nifwait
[https://avatars0.githubusercontent.com/u/176293?v=3&s=400]<https://github.com/basho/nifwait>

GitHub - basho/nifwait: Utility to test effect of blocking ...<https://github.com/basho/nifwait>
github.com
README.md Utility to test effect of blocking NIF on Erlang scheduler. Test spawns several processes, all which will start in the run queue of the current scheduler.

So you could use that to prove to yourself that the behavior is the same with the system you are having problems with.

On 12/09/2016 08:35 AM, Matthew Evans wrote:

Happened again, it appears that code_server is wedged:

admin@REDACTED:~$ doErlangFun "erlang:process_info(whereis(code_server))."

[{registered_name,code_server},

 {current_function,{code_server,cpc_recv,4}},

 {initial_call,{erlang,apply,2}},

 {status,waiting},

 {message_queue_len,23},

 {messages,[{code_call,<6805.4097.0>,{ensure_loaded,switch_type_module}},

            {code_call,<6805.4146.0>,{ensure_loaded,switch_type_module}},

            {code_call,<6805.941.0>,{ensure_loaded,pc_port_autoneg}},

            {code_call,<6805.541.0>,{ensure_loaded,plexxiStatistics_types}},

            {code_call,<6805.520.0>,{ensure_loaded,switch_type_module}},

            {code_call,<6805.5123.0>,{ensure_loaded,secondary_erlang_node}},

            {code_call,<6805.5122.0>,{ensure_loaded,secondary_erlang_node}},

            {code_call,<6805.5162.0>,{ensure_loaded,icmp}},

            {code_call,<6805.5321.0>,

                       {ensure_loaded,mac_entries_record_handler}},

            {code_call,<6805.5483.0>,{ensure_loaded,icmp}},

            {code_call,<6805.6647.0>,{ensure_loaded,icmp}},

            {code_call,<6805.7232.0>,{ensure_loaded,icmp}},

            {code_call,<6805.7274.0>,{ensure_loaded,icmp}},

            {code_call,<6805.7304.0>,{ensure_loaded,icmp}},

            {code_call,<6805.8889.0>,

                       {ensure_loaded,mac_entries_record_handler}},

            {code_call,<6805.8951.0>,

                       {ensure_loaded,mac_entries_record_handler}},

            {code_call,<6805.576.0>,

                       {ensure_loaded,cross_connect_unicast_utils}},

            {code_call,<6805.19300.12>,{ensure_loaded,shell}},

            {code_call,<6805.20313.12>,{ensure_loaded,shell}},

            {code_call,<6805.21339.12>,{ensure_loaded,dbg}},

            {code_call,<6805.31109.13>,get_mode},

            {code_call,<6805.1255.14>,get_mode},

            {system,{<6805.2521.14>,#Ref<6805.0.23.35356>},get_status}]},

 {links,[<6805.11.0>]},

 {dictionary,[{any_native_code_loaded,false}]},

 {trap_exit,true},

 {error_handler,error_handler},

 {priority,normal},

 {group_leader,<6805.9.0>},

 {total_heap_size,86071},

 {heap_size,10958},

 {stack_size,25},

 {reductions,13172282},

 {garbage_collection,[{min_bin_vheap_size,46422},

                      {min_heap_size,233},

                      {fullsweep_after,65535},

                      {minor_gcs,71}]},

 {suspending,[]}]

admin@REDACTED:~$

________________________________
From: erlang-questions-bounces@REDACTED<mailto:erlang-questions-bounces@REDACTED> <erlang-questions-bounces@REDACTED><mailto:erlang-questions-bounces@REDACTED> on behalf of Matthew Evans <mattevans123@REDACTED><mailto:mattevans123@REDACTED>
Sent: Friday, December 9, 2016 9:56 AM
To: Erlang/OTP discussions
Subject: [erlang-questions] R17 - Possible wedged scheduler

Hi,

We just hit a situation where it appeared that 1 scheduler was wedged. Some parts of our application were working, but others appeared to be stuck. I could connect via a cnode application and an escript, but I couldn't connect via the Erlang shell. We have an escript that does rpc calls, some worked, others (e.g. anything to the code server or tracing failed) failed.

CPU load was minimal at the time, and heart didn't complain. We only have a single NIF, but this is not called on this hardware variant. We do use CNODE to talk to C applications.

We are running R17, Intel quad core CPU on Debian.

This is the first time this has been seen, so the questions are:

1. Has anyone seen this before?

2. What can we do if we hit this condition in the future to debug?

3. Since heart doesn't detect this can anyone think of any alternative mechanisms?

Thanks

Matt

_______________________________________________
erlang-questions mailing list
erlang-questions@REDACTED<mailto:erlang-questions@REDACTED>
http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20161209/ead729e9/attachment.htm>