[erlang-bugs] Scheduler Wall Time Statistics live|dead locking a process.

Wed Jul 18 16:30:27 CEST 2012

Hi!

On Wed, 18 Jul 2012, Fred Hebert wrote:

> Hi there Patrick!
>
> I can't exactly afford to dump core on one of the servers right now because 
> yeah, it would interrupt the service. I could however set one VM up on the 
> server that just sits and calls vmstats and does some busy test work, pushing 
> it to a fake StatsD server to reproduce it; maybe it could work.

That would be great!

>
> I can get going setting stuff up; how should I dump the core for it to be 
> useful for you?

Oh, just kill -ABRT on the beam pid when it hangs. Don't forget to 'ulimit 
-c unlimited' before starting erlang though, I tend to forget that all 
the time and get very sad when I've reproduced a problem after three 
days of trying and get no core whatsoever :)

>
> Regards,
> Fred.

Cheers,
Patrik

>
> On 12-07-18 9:44 AM, pan@REDACTED wrote:
>> Hi Fred!
>> 
>> On Wed, 18 Jul 2012, Fred Hebert wrote:
>> 
>>> Hi there,
>>> 
>>> If you go on erlang-questions, you'll find the following thread I started 
>>> regarding one of my gen_servers locking up forever until I try to connect 
>>> to the VM: 
>>> http://erlang.org/pipermail/erlang-questions/2012-July/068097.html
>>> 
>>> And the information following it in 
>>> http://erlang.org/pipermail/erlang-questions/2012-July/068099.html
>>> 
>>> The gist of it is that apparently, the gen_server gets stuck while calling 
>>> erlang:statistics(scheduler_wall_time). A process info dump on it returns:
>>> 
>>> [{registered_name,vmstats_server},
>>> {current_function,{erlang,sched_wall_time,3}},
>>> {initial_call,{proc_lib,init_p,5}},
>>> {status,waiting},
>>> {message_queue_len,2},
>>> {messages,[{system,{<5998.7341.243>,#Ref<5998.0.3810.221818>},get_status}, 
>>> {system,{<5998.28757.800>,#Ref<5998.0.3811.260443>},get_status}]},
>>> {links,[<5998.918.0>]},
>>> {dictionary,[{random_seed,{17770,13214,15044}},
>>>              {'$ancestors',[vmstats_sup,<5998.917.0>]},
>>>              {'$initial_call',{vmstats_server,init,1}}]},
>>> {trap_exit,false},
>>> {error_handler,error_handler},
>>> {priority,normal},
>>> {group_leader,<5998.916.0>},
>>> {total_heap_size,122003},
>>> {heap_size,121393},
>>> {stack_size,21},
>>> {reductions,314325681},
>>> {garbage_collection,[{min_bin_vheap_size,46368},
>>>                      {min_heap_size,233},
>>>                      {fullsweep_after,65535},
>>>                      {minor_gcs,23774}]},
>>> {suspending,[]}]
>>> ok
>>> 
>>> with the interesting parts:
>>> {current_function,{erlang,sched_wall_time,3}},
>>> {status,waiting},
>>> 
>>> I'm unsure what exactly causes the problem, and we're running the VM with 
>>> default arguments when it comes to scheduling and layout. It happens even 
>>> when the virtual machine is under relatively low load (scheduler active 
>>> wall time is less than 5%, but more than 2% of the total wall time when 
>>> averaging all cores) and can also happen under higher load.
>> 
>> Ouch... Seems like one of the schedulers does not understand that it should 
>> report data back to the process. Is there any chance of dumping core of a 
>> machine where it hangs, or would that mean interruption of service? I 
>> *really* would like to know what the schedulers are doing when they should 
>> be reporting back...
>> 
>> 
>>> 
>>> Only that process appears affected.
>> 
>> Yes, it's just waiting for a message that does not arrive, one that should 
>> be sent from the VM when statistics for the scheduler is available...
>> 
>>> _______________________________________________
>>> erlang-bugs mailing list
>>> erlang-bugs@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>> 
>> Cheers,
>> /Patrik
>