[erlang-bugs] Scheduler Wall Time Statistics live|dead locking a process.

Wed Jul 18 15:50:08 CEST 2012

Hi there Patrick!

I can't exactly afford to dump core on one of the servers right now 
because yeah, it would interrupt the service. I could however set one VM 
up on the server that just sits and calls vmstats and does some busy 
test work, pushing it to a fake StatsD server to reproduce it; maybe it 
could work.

I can get going setting stuff up; how should I dump the core for it to 
be useful for you?

Regards,
Fred.

On 12-07-18 9:44 AM, pan@REDACTED wrote:
> Hi Fred!
>
> On Wed, 18 Jul 2012, Fred Hebert wrote:
>
>> Hi there,
>>
>> If you go on erlang-questions, you'll find the following thread I 
>> started regarding one of my gen_servers locking up forever until I 
>> try to connect to the VM: 
>> http://erlang.org/pipermail/erlang-questions/2012-July/068097.html
>>
>> And the information following it in 
>> http://erlang.org/pipermail/erlang-questions/2012-July/068099.html
>>
>> The gist of it is that apparently, the gen_server gets stuck while 
>> calling erlang:statistics(scheduler_wall_time). A process info dump 
>> on it returns:
>>
>> [{registered_name,vmstats_server},
>> {current_function,{erlang,sched_wall_time,3}},
>> {initial_call,{proc_lib,init_p,5}},
>> {status,waiting},
>> {message_queue_len,2},
>> {messages,[{system,{<5998.7341.243>,#Ref<5998.0.3810.221818>},get_status}, 
>>
>> {system,{<5998.28757.800>,#Ref<5998.0.3811.260443>},get_status}]},
>> {links,[<5998.918.0>]},
>> {dictionary,[{random_seed,{17770,13214,15044}},
>>              {'$ancestors',[vmstats_sup,<5998.917.0>]},
>>              {'$initial_call',{vmstats_server,init,1}}]},
>> {trap_exit,false},
>> {error_handler,error_handler},
>> {priority,normal},
>> {group_leader,<5998.916.0>},
>> {total_heap_size,122003},
>> {heap_size,121393},
>> {stack_size,21},
>> {reductions,314325681},
>> {garbage_collection,[{min_bin_vheap_size,46368},
>>                      {min_heap_size,233},
>>                      {fullsweep_after,65535},
>>                      {minor_gcs,23774}]},
>> {suspending,[]}]
>> ok
>>
>> with the interesting parts:
>> {current_function,{erlang,sched_wall_time,3}},
>> {status,waiting},
>>
>> I'm unsure what exactly causes the problem, and we're running the VM 
>> with default arguments when it comes to scheduling and layout. It 
>> happens even when the virtual machine is under relatively low load 
>> (scheduler active wall time is less than 5%, but more than 2% of the 
>> total wall time when averaging all cores) and can also happen under 
>> higher load.
>
> Ouch... Seems like one of the schedulers does not understand that it 
> should report data back to the process. Is there any chance of dumping 
> core of a machine where it hangs, or would that mean interruption of 
> service? I *really* would like to know what the schedulers are doing 
> when they should be reporting back...
>
>
>>
>> Only that process appears affected.
>
> Yes, it's just waiting for a message that does not arrive, one that 
> should be sent from the VM when statistics for the scheduler is 
> available...
>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
>>
> Cheers,
> /Patrik