[erlang-bugs] R12B-3/64bit/smp Stuck disk_log_server

Mon Jan 12 13:46:22 CET 2009

Hi Geoff,

I've looked at this and found a bug that may have caused this. When a 
process garbage collect another process and the process being garbage 
collected also receives a message during the garbage collect, the 
process being garbage collected can end up in the state that you described.

This kind of garbage collect only happen when someone calls the 
garbage_collect/1 BIF or when code is purged. In the case with the 
disk_log server being stuck I think we can rule out the purge, i.e., if 
it is this bug that caused your problem another process must have 
garbage collected the disk_log server via the garbage_collect/1 BIF. Do 
you have any code that may have garbage collected the disk_log server 
via the garbage_collect/1 BIF? The garbage collect may also have been 
done explicitly in the shell.

Regards,
Rickard Green, Erlang/OTP, Ericsson AB.

Geoff Cant wrote:
> Hi all, we just took another look at the cluster and discovered another
> stuck gen_server. This time we sent it a bogus message of '$ignore_me' -
> the process then woke up, processed the first message in its queue (an
> ejabberd internal message) and exited (as expected from the ejabberd
> code). 
> 
> It appears that this bug causes processes to sometimes not get
> scheduled in when they receive a message. It seems to strike randomly
> and subsequent messages cause the process to be scheduled properly.
> 
> Most of the time this doesn't cause major problems as the affected
> process will receive another message in the course of normal events and
> will now run normally. However, sometimes this strikes the wrong process
> in just the wrong way (the disk_log_server case) and we get visible
> error behaviour.
> 
> This problem has been discovered on two different machines in the same
> ejabberd cluster, so I don't think this is a heisenbug due to bad RAM.
> 
> We're going to try replicating this with a tsung test of the same
> emulator package (http://packages.debian.org/lenny/erlang-base-hipe
> 1:12.b.3-dfsg-4) and then see if the same problem exists with a source
> compile of R12B-5.
> 
> Thanks,
> --Geoff Cant
> 
> The debugging session transcript follows.
> 
> Running
> [ Pid 
>  || Pid <- erlang:processes(),
>     element(2, erlang:process_info(Pid, current_function)) =:= {gen_server,loop,6},
>     element(2, erlang:process_info(Pid, status)) =:= waiting,
>     length(element(2, erlang:process_info(Pid, message_queue))) > 0].
> 
> gave us the process <0.19313.279>:
> 
> (ejabberd@REDACTED)10> process_info(pid(0,19313,279)).
> [{current_function,{gen_server,loop,6}},
>  {initial_call,{proc_lib,init_p,5}},
>  {status,waiting},
>  {message_queue_len,1},
>  {messages,[{timeout,#Ref<0.0.1009.52090>,activate}]},
>  {links,[#Port<0.15757334>,<0.235.0>,#Port<0.15757327>]},
>  {dictionary,[{'$ancestors',[ejabberd_receiver_sup,
>                              ejabberd_sup,<0.37.0>]},
>               {'$initial_call',{gen,init_it,
>                                     [gen_server,<0.235.0>,<0.235.0>,ejabberd_receiver,
>                                      [#Port<0.15757327>,gen_tcp,none,131072,<0.19312.279>],
>                                      []]}}]},
>  {trap_exit,false},
>  {error_handler,error_handler},
>  {priority,normal},
>  {group_leader,<0.36.0>},
>  {total_heap_size,987},
>  {heap_size,987},
>  {stack_size,12},
>  {reductions,922822},
>  {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
>  {suspending,[]}]
> 
> This process stayed at {reductions,922822} for over a minute.
> 
> It was sitting on a backtrace of:
> Program counter: 0x00007f3374b86af8 (gen_server:loop/6 + 288)
> CP: 0x00007f3371644fe8 (proc_lib:init_p/5 + 400)
> arity = 0
> 
> 0x00007f32d07fd988 Return addr 0x00007f3371644fe8 (proc_lib:init_p/5 + 400)
> y(0)     []
> y(1)     infinity
> y(2)     ejabberd_receiver
> y(3)     {state,{tlssock,#Port<0.15757327>,#Port<0.15757329>},tls,{maxrate,32768,3.247837e+04,1231372801805878},<0.19312.279>,131072,{xml_stream_state,<0.19312.279>,#Port<0.15757334>,[{xmlelement,"stream:stream",[{"to","fake.domain"},{"xmlns","jabber:client"},{"xmlns:stream","http://etherx.jabber.org/streams"},{"xml:lang","de"},{"version","1.0"}],[]}],0,131072},infinity}
> y(4)     <0.19313.279>
> y(5)     <0.235.0>
> 
> 0x00007f32d07fd9c0 Return addr 0x000000000084bd18 (<terminate process normally>)
> y(0)     Catch 0x00007f3371645008 (proc_lib:init_p/5 + 432)
> y(1)     gen
> y(2)     init_it
> y(3)     [gen_server,<0.235.0>,<0.235.0>,ejabberd_receiver,[#Port<0.15757327>,gen_tcp,none,131072,<0.19312.279>],[]]
> 
> As this process ignores messages it doesn't understand, we sent it a
> bogus message:
> 
> pid(0,19313,279) ! '$ignore_me'.
> 
> The process then logged:
> =ERROR REPORT==== 2009-01-08 17:33:37 ===
> E(<0.19313.279>:ejabberd_receiver:264): ejabberd_reciever:activate_socket missed the tcp_closed event
> 
> before exiting. This is the expected behaviour on receiving a message
> like {timeout,#Ref<0.0.1009.52090>,activate} - the one in the queue
> while it was stuck before we sent the '$ignore_me' message.
> 
> So, it appears that this bug causes processes to sometimes not get
> scheduled in when they receive a message. It seems to strike randomly
> and subsequent messages cause the process to be scheduled properly.
> 
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>