[erlang-bugs] R12B-3/64bit/smp Stuck disk_log_server

Thu Jan 8 18:44:59 CET 2009

Hi all, we just took another look at the cluster and discovered another
stuck gen_server. This time we sent it a bogus message of '$ignore_me' -
the process then woke up, processed the first message in its queue (an
ejabberd internal message) and exited (as expected from the ejabberd
code). 

It appears that this bug causes processes to sometimes not get
scheduled in when they receive a message. It seems to strike randomly
and subsequent messages cause the process to be scheduled properly.

Most of the time this doesn't cause major problems as the affected
process will receive another message in the course of normal events and
will now run normally. However, sometimes this strikes the wrong process
in just the wrong way (the disk_log_server case) and we get visible
error behaviour.

This problem has been discovered on two different machines in the same
ejabberd cluster, so I don't think this is a heisenbug due to bad RAM.

We're going to try replicating this with a tsung test of the same
emulator package (http://packages.debian.org/lenny/erlang-base-hipe
1:12.b.3-dfsg-4) and then see if the same problem exists with a source
compile of R12B-5.

Thanks,
--Geoff Cant

The debugging session transcript follows.

Running
[ Pid 
 || Pid <- erlang:processes(),
    element(2, erlang:process_info(Pid, current_function)) =:= {gen_server,loop,6},
    element(2, erlang:process_info(Pid, status)) =:= waiting,
    length(element(2, erlang:process_info(Pid, message_queue))) > 0].

gave us the process <0.19313.279>:

(ejabberd@REDACTED)10> process_info(pid(0,19313,279)).
[{current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,1},
 {messages,[{timeout,#Ref<0.0.1009.52090>,activate}]},
 {links,[#Port<0.15757334>,<0.235.0>,#Port<0.15757327>]},
 {dictionary,[{'$ancestors',[ejabberd_receiver_sup,
                             ejabberd_sup,<0.37.0>]},
              {'$initial_call',{gen,init_it,
                                    [gen_server,<0.235.0>,<0.235.0>,ejabberd_receiver,
                                     [#Port<0.15757327>,gen_tcp,none,131072,<0.19312.279>],
                                     []]}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.36.0>},
 {total_heap_size,987},
 {heap_size,987},
 {stack_size,12},
 {reductions,922822},
 {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
 {suspending,[]}]

This process stayed at {reductions,922822} for over a minute.

It was sitting on a backtrace of:
Program counter: 0x00007f3374b86af8 (gen_server:loop/6 + 288)
CP: 0x00007f3371644fe8 (proc_lib:init_p/5 + 400)
arity = 0

0x00007f32d07fd988 Return addr 0x00007f3371644fe8 (proc_lib:init_p/5 + 400)
y(0)     []
y(1)     infinity
y(2)     ejabberd_receiver
y(3)     {state,{tlssock,#Port<0.15757327>,#Port<0.15757329>},tls,{maxrate,32768,3.247837e+04,1231372801805878},<0.19312.279>,131072,{xml_stream_state,<0.19312.279>,#Port<0.15757334>,[{xmlelement,"stream:stream",[{"to","fake.domain"},{"xmlns","jabber:client"},{"xmlns:stream","http://etherx.jabber.org/streams"},{"xml:lang","de"},{"version","1.0"}],[]}],0,131072},infinity}
y(4)     <0.19313.279>
y(5)     <0.235.0>

0x00007f32d07fd9c0 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0)     Catch 0x00007f3371645008 (proc_lib:init_p/5 + 432)
y(1)     gen
y(2)     init_it
y(3)     [gen_server,<0.235.0>,<0.235.0>,ejabberd_receiver,[#Port<0.15757327>,gen_tcp,none,131072,<0.19312.279>],[]]

As this process ignores messages it doesn't understand, we sent it a
bogus message:

pid(0,19313,279) ! '$ignore_me'.

The process then logged:
=ERROR REPORT==== 2009-01-08 17:33:37 ===
E(<0.19313.279>:ejabberd_receiver:264): ejabberd_reciever:activate_socket missed the tcp_closed event

before exiting. This is the expected behaviour on receiving a message
like {timeout,#Ref<0.0.1009.52090>,activate} - the one in the queue
while it was stuck before we sent the '$ignore_me' message.

So, it appears that this bug causes processes to sometimes not get
scheduled in when they receive a message. It seems to strike randomly
and subsequent messages cause the process to be scheduled properly.