[erlang-bugs] R12B-3/64bit/smp Stuck disk_log_server

Ulf Wiger ulf@REDACTED
Fri Jan 16 11:18:49 CET 2009


Hi Rickard,

Which versions of OTP seem to have this bug?

BR,
Ulf W

2009/1/12 Rickard Green <rickard.s.green@REDACTED>:
> Hi Geoff,
>
> I've looked at this and found a bug that may have caused this. When a
> process garbage collect another process and the process being garbage
> collected also receives a message during the garbage collect, the
> process being garbage collected can end up in the state that you described.
>
> This kind of garbage collect only happen when someone calls the
> garbage_collect/1 BIF or when code is purged. In the case with the
> disk_log server being stuck I think we can rule out the purge, i.e., if
> it is this bug that caused your problem another process must have
> garbage collected the disk_log server via the garbage_collect/1 BIF. Do
> you have any code that may have garbage collected the disk_log server
> via the garbage_collect/1 BIF? The garbage collect may also have been
> done explicitly in the shell.
>
> Regards,
> Rickard Green, Erlang/OTP, Ericsson AB.
>
>
> Geoff Cant wrote:
>> Hi all, we just took another look at the cluster and discovered another
>> stuck gen_server. This time we sent it a bogus message of '$ignore_me' -
>> the process then woke up, processed the first message in its queue (an
>> ejabberd internal message) and exited (as expected from the ejabberd
>> code).
>>
>> It appears that this bug causes processes to sometimes not get
>> scheduled in when they receive a message. It seems to strike randomly
>> and subsequent messages cause the process to be scheduled properly.
>>
>> Most of the time this doesn't cause major problems as the affected
>> process will receive another message in the course of normal events and
>> will now run normally. However, sometimes this strikes the wrong process
>> in just the wrong way (the disk_log_server case) and we get visible
>> error behaviour.
>>
>> This problem has been discovered on two different machines in the same
>> ejabberd cluster, so I don't think this is a heisenbug due to bad RAM.
>>
>> We're going to try replicating this with a tsung test of the same
>> emulator package (http://packages.debian.org/lenny/erlang-base-hipe
>> 1:12.b.3-dfsg-4) and then see if the same problem exists with a source
>> compile of R12B-5.
>>
>> Thanks,
>> --Geoff Cant
>>
>> The debugging session transcript follows.
>>
>> Running
>> [ Pid
>>  || Pid <- erlang:processes(),
>>     element(2, erlang:process_info(Pid, current_function)) =:= {gen_server,loop,6},
>>     element(2, erlang:process_info(Pid, status)) =:= waiting,
>>     length(element(2, erlang:process_info(Pid, message_queue))) > 0].
>>
>> gave us the process <0.19313.279>:
>>
>> (ejabberd@REDACTED)10> process_info(pid(0,19313,279)).
>> [{current_function,{gen_server,loop,6}},
>>  {initial_call,{proc_lib,init_p,5}},
>>  {status,waiting},
>>  {message_queue_len,1},
>>  {messages,[{timeout,#Ref<0.0.1009.52090>,activate}]},
>>  {links,[#Port<0.15757334>,<0.235.0>,#Port<0.15757327>]},
>>  {dictionary,[{'$ancestors',[ejabberd_receiver_sup,
>>                              ejabberd_sup,<0.37.0>]},
>>               {'$initial_call',{gen,init_it,
>>                                     [gen_server,<0.235.0>,<0.235.0>,ejabberd_receiver,
>>                                      [#Port<0.15757327>,gen_tcp,none,131072,<0.19312.279>],
>>                                      []]}}]},
>>  {trap_exit,false},
>>  {error_handler,error_handler},
>>  {priority,normal},
>>  {group_leader,<0.36.0>},
>>  {total_heap_size,987},
>>  {heap_size,987},
>>  {stack_size,12},
>>  {reductions,922822},
>>  {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
>>  {suspending,[]}]
>>
>> This process stayed at {reductions,922822} for over a minute.
>>
>> It was sitting on a backtrace of:
>> Program counter: 0x00007f3374b86af8 (gen_server:loop/6 + 288)
>> CP: 0x00007f3371644fe8 (proc_lib:init_p/5 + 400)
>> arity = 0
>>
>> 0x00007f32d07fd988 Return addr 0x00007f3371644fe8 (proc_lib:init_p/5 + 400)
>> y(0)     []
>> y(1)     infinity
>> y(2)     ejabberd_receiver
>> y(3)     {state,{tlssock,#Port<0.15757327>,#Port<0.15757329>},tls,{maxrate,32768,3.247837e+04,1231372801805878},<0.19312.279>,131072,{xml_stream_state,<0.19312.279>,#Port<0.15757334>,[{xmlelement,"stream:stream",[{"to","fake.domain"},{"xmlns","jabber:client"},{"xmlns:stream","http://etherx.jabber.org/streams"},{"xml:lang","de"},{"version","1.0"}],[]}],0,131072},infinity}
>> y(4)     <0.19313.279>
>> y(5)     <0.235.0>
>>
>> 0x00007f32d07fd9c0 Return addr 0x000000000084bd18 (<terminate process normally>)
>> y(0)     Catch 0x00007f3371645008 (proc_lib:init_p/5 + 432)
>> y(1)     gen
>> y(2)     init_it
>> y(3)     [gen_server,<0.235.0>,<0.235.0>,ejabberd_receiver,[#Port<0.15757327>,gen_tcp,none,131072,<0.19312.279>],[]]
>>
>> As this process ignores messages it doesn't understand, we sent it a
>> bogus message:
>>
>> pid(0,19313,279) ! '$ignore_me'.
>>
>> The process then logged:
>> =ERROR REPORT==== 2009-01-08 17:33:37 ===
>> E(<0.19313.279>:ejabberd_receiver:264): ejabberd_reciever:activate_socket missed the tcp_closed event
>>
>> before exiting. This is the expected behaviour on receiving a message
>> like {timeout,#Ref<0.0.1009.52090>,activate} - the one in the queue
>> while it was stuck before we sent the '$ignore_me' message.
>>
>> So, it appears that this bug causes processes to sometimes not get
>> scheduled in when they receive a message. It seems to strike randomly
>> and subsequent messages cause the process to be scheduled properly.
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-bugs
>>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>



More information about the erlang-bugs mailing list