[erlang-questions] R17 - Possible wedged scheduler

Michael Truog mjtruog@REDACTED
Fri Dec 9 19:03:55 CET 2016


If this is scheduler collapse, it would mean you have a port driver or NIF that has internal latency greater than 1 millisecond.  To handle scheduler collapse, you can use the erl command line option "-heart" combined with:
heart:set_options([check_schedulers]).

(see http://erlang.org/doc/man/heart.html#set_options-1 )

That will allow the system to restart when schedulers have collapsed.

A test that is meant to cause scheduler collapse is at:
https://github.com/basho/nifwait

So you could use that to prove to yourself that the behavior is the same with the system you are having problems with.


On 12/09/2016 08:35 AM, Matthew Evans wrote:
>
> Happened again, it appears that code_server is wedged:
>
>
> admin@REDACTED:~$ doErlangFun "erlang:process_info(whereis(code_server))."
>
> [{registered_name,code_server},
>
>  {current_function,{code_server,cpc_recv,4}},
>
>  {initial_call,{erlang,apply,2}},
>
>  {status,waiting},
>
>  {message_queue_len,23},
>
>  {messages,[{code_call,<6805.4097.0>,{ensure_loaded,switch_type_module}},
>
> {code_call,<6805.4146.0>,{ensure_loaded,switch_type_module}},
>
> {code_call,<6805.941.0>,{ensure_loaded,pc_port_autoneg}},
>
> {code_call,<6805.541.0>,{ensure_loaded,plexxiStatistics_types}},
>
> {code_call,<6805.520.0>,{ensure_loaded,switch_type_module}},
>
> {code_call,<6805.5123.0>,{ensure_loaded,secondary_erlang_node}},
>
> {code_call,<6805.5122.0>,{ensure_loaded,secondary_erlang_node}},
>
> {code_call,<6805.5162.0>,{ensure_loaded,icmp}},
>
>         {code_call,<6805.5321.0>,
>
> {ensure_loaded,mac_entries_record_handler}},
>
> {code_call,<6805.5483.0>,{ensure_loaded,icmp}},
>
> {code_call,<6805.6647.0>,{ensure_loaded,icmp}},
>
> {code_call,<6805.7232.0>,{ensure_loaded,icmp}},
>
> {code_call,<6805.7274.0>,{ensure_loaded,icmp}},
>
> {code_call,<6805.7304.0>,{ensure_loaded,icmp}},
>
>         {code_call,<6805.8889.0>,
>
> {ensure_loaded,mac_entries_record_handler}},
>
>         {code_call,<6805.8951.0>,
>
> {ensure_loaded,mac_entries_record_handler}},
>
>         {code_call,<6805.576.0>,
>
> {ensure_loaded,cross_connect_unicast_utils}},
>
> {code_call,<6805.19300.12>,{ensure_loaded,shell}},
>
> {code_call,<6805.20313.12>,{ensure_loaded,shell}},
>
> {code_call,<6805.21339.12>,{ensure_loaded,dbg}},
>
>         {code_call,<6805.31109.13>,get_mode},
>
>         {code_call,<6805.1255.14>,get_mode},
>
> {system,{<6805.2521.14>,#Ref<6805.0.23.35356>},get_status}]},
>
>  {links,[<6805.11.0>]},
>
>  {dictionary,[{any_native_code_loaded,false}]},
>
>  {trap_exit,true},
>
>  {error_handler,error_handler},
>
>  {priority,normal},
>
>  {group_leader,<6805.9.0>},
>
>  {total_heap_size,86071},
>
>  {heap_size,10958},
>
>  {stack_size,25},
>
>  {reductions,13172282},
>
>  {garbage_collection,[{min_bin_vheap_size,46422},
>
>                   {min_heap_size,233},
>
>                   {fullsweep_after,65535},
>
>                   {minor_gcs,71}]},
>
>  {suspending,[]}]
>
> admin@REDACTED:~$
>
>
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *From:* erlang-questions-bounces@REDACTED <erlang-questions-bounces@REDACTED> on behalf of Matthew Evans <mattevans123@REDACTED>
> *Sent:* Friday, December 9, 2016 9:56 AM
> *To:* Erlang/OTP discussions
> *Subject:* [erlang-questions] R17 - Possible wedged scheduler
>
> Hi,
>
>
> We just hit a situation where it appeared that 1 scheduler was wedged. Some parts of our application were working, but others appeared to be stuck. I could connect via a cnode application and an escript, but I couldn't connect via the Erlang shell. We have an escript that does rpc calls, some worked, others (e.g. anything to the code server or tracing failed) failed.
>
>
> CPU load was minimal at the time, and heart didn't complain. We only have a single NIF, but this is not called on this hardware variant. We do use CNODE to talk to C applications.
>
>
> We are running R17, Intel quad core CPU on Debian.
>
>
> This is the first time this has been seen, so the questions are:
>
>
> 1. Has anyone seen this before?
>
> 2. What can we do if we hit this condition in the future to debug?
>
> 3. Since heart doesn't detect this can anyone think of any alternative mechanisms?
>
>
> Thanks
>
>
> Matt
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20161209/2dc2da68/attachment.htm>


More information about the erlang-questions mailing list