[erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]

Serge Aleynikov serge@REDACTED
Fri Dec 22 21:22:03 CET 2006


Additionally, I should say that we've been able to reproduce this bug on 
several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in 
R11B-0, R11B-1 and R11B-2.  The bug (what appears to be a race 
condition) is seen only if the emulator is started in the SMP mode and 
results in the following construct blocking infinitely in the context of 
some Erlang process handing a message dispatching function:

receive
after N ->    % Where N is between 1 and 999
     ok
end.

It happens when all the CPUs in SMP mode are over 75% loaded.  The bug 
doesn't happen immediately after starting a release, but after a period 
of 5 min to 3 hours, which makes it pretty hard to diagnose.  The 
tracing method that we initially tried to use was to include printf 
statements in the emulator to stderr.  However, this prevented the bug 
from showing up.  Further it was changed to using SysV message queue to 
communicate trace to an external process that dumped the trace to a 
file.  This allowed to gain further understanding of the problem, but as 
Dmitry indicated any attempt to reduce the code to a minimal example 
made the problem disappear.

The emulator code is quite involved, but hopefully someone in the OTP 
team could come up with a recommendation of how/where to put a missing 
synchronization.  If needed we can arrange for a remote SSH login to the 
system(s) where the problem is reproducible.

Regards,

Serge

Dmitriy Kargapolov wrote:
> Unfortunately I can not create standalone test for this bug, even when I 
> became much more close to understanding the effect.
> This bug appears only in highly loaded system.
> 
> Recently I did manage to trace some points in the code and see at least 
> one scenario for the race condition bug.
> 
>  1. Thread A    erl_set_timer (time.c)            Lock Timing Wheel
>  2. Thread A    insert_timer (time.c)             Insert Timer T1
>  3. Thread A    erl_set_timer (time.c)            Unlock Timing Wheel
>  4. Thread B    bump_timer_internal (time.c)      Lock Timing Wheel
>  5. Thread A    cancel_timer (erl_process.c)      Cancel timer T1
>  6. Thread B    bump_timer_internal (time.c)      Build list of Expired 
> Timers
>  7. Thread A    erl_cancel_timer (time.c)         Cancel timer T1: 
> Waiting for Timing Wheel Lock
>  8. Thread B    bump_timer_internal (time.c)      Unlock Timing Wheel
>  9. Thread C    set_timer (erl_process.c)         New Timeout Request (T2)
> 10. Thread B    bump_timer_internal (time.c)      Call Expired Timers 
> Callbacks
> 11. Thread B    free_ptimer (utils.c)             Timer T1 callback 
> invokes free_ptimer()
> 12. Thread C    erts_create_smp_ptimer (utils.c)  Create Timer 
> ErtsSmpPTimer for T2
> 13. Thread B    free_ptimer (utils.c)             Free ErtsSmpPTimer 
> memory block
> 14. Thread C    erts_create_smp_ptimer (utils.c)  Allocate ErtsSmpPTimer 
> for T2, block reused!
> 15. Thread C    erl_set_timer (time.c)            erl_set_timer invoked 
> for T2
> 16. Thread C    erl_set_timer (time.c)            Lock Timing Wheel
> 17. Thread C    insert_timer (time.c)             Insert Timer T2
> 18. Thread C    erl_set_timer (time.c)            Unlock Timing Wheel
> 19. Thread A    erl_cancel_timer (time.c)         Lock Timing Wheel
> 20. Thread A    erl_cancel_timer (time.c)         Remove ex-T1 == T2 
> from the timing wheel
> 21. Thread A    erl_cancel_timer (time.c)         Unlock Timing Wheel
> 
> See also attached diagram.
> 
> Looks like one more mutex required, excluding release of ErtsSmpPTimer 
> memory block by timeout callback if cancel request was issued for the 
> timer and vise versa. The two point of control - cancel timer and timer 
> expiration should not interfere.
> This bug happens only in SMP mode since there additional timer control 
> structure ErtsSmpPTimer is used between emulator and timing wheel.
> 
> Mikael Pettersson wrote:
>> Dmitriy Kargapolov writes:
>>  >  > When running erl with -smp +S 2 option, sometimes process gets 
>> stuck in  > timer:sleep/1.
>>  > Process code looks like:
>>  >  > some_receiver(State) ->
>>  >      NewState = receive
>>  >          % legal packet
>>  >          {some_keyword, Address, Port, Packet} ->
>>  >              State1 = handle_packet(Address, Port, Packet, State),
>>  >              timer:sleep(get_loop_delay()),
>>  >              State1;
>>  >          % unknown message
>>  >          _ ->
>>  >              State
>>  >      end,
>>  >      some_receiver(NewState).
>>  >  > Delay value varies in range 1..999
>>  >  > Since timer:sleep/1 implemented as:
>>  > sleep(T) ->
>>  >      receive
>>  >      after T -> ok
>>  >      end.
>>  > it seems to be problem with "after" in smp implementation in R11B-0
>>  >  > I don't have more details yet but will continue testing.
>>  > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>>  
>> Interesting. Please send us a small standalone module that exhibits
>> the bug, and I'll see if I can reproduce it.
>>
>> /Mikael
>>
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions

-- 
Serge Aleynikov
Routing R&D, IDT Telecom
Tel: +1 (973) 438-3436
Fax: +1 (973) 438-1464



More information about the erlang-questions mailing list