[erlang-bugs] [erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]

Rickard Green rickard.s.green@REDACTED
Sun Dec 24 01:36:25 CET 2006


Thanks for your detailed bug report. I'll look at this as soon as possible.

BR,
Rickard Green, Erlang/OTP

Serge Aleynikov wrote:
> Additionally, I should say that we've been able to reproduce this bug on 
> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in 
> R11B-0, R11B-1 and R11B-2.  The bug (what appears to be a race 
> condition) is seen only if the emulator is started in the SMP mode and 
> results in the following construct blocking infinitely in the context of 
> some Erlang process handing a message dispatching function:
> 
> receive
> after N ->    % Where N is between 1 and 999
>      ok
> end.
> 
> It happens when all the CPUs in SMP mode are over 75% loaded.  The bug 
> doesn't happen immediately after starting a release, but after a period 
> of 5 min to 3 hours, which makes it pretty hard to diagnose.  The 
> tracing method that we initially tried to use was to include printf 
> statements in the emulator to stderr.  However, this prevented the bug 
> from showing up.  Further it was changed to using SysV message queue to 
> communicate trace to an external process that dumped the trace to a 
> file.  This allowed to gain further understanding of the problem, but as 
> Dmitry indicated any attempt to reduce the code to a minimal example 
> made the problem disappear.
> 
> The emulator code is quite involved, but hopefully someone in the OTP 
> team could come up with a recommendation of how/where to put a missing 
> synchronization.  If needed we can arrange for a remote SSH login to the 
> system(s) where the problem is reproducible.
> 
> Regards,
> 
> Serge
> 
> Dmitriy Kargapolov wrote:
>> Unfortunately I can not create standalone test for this bug, even when I 
>> became much more close to understanding the effect.
>> This bug appears only in highly loaded system.
>>
>> Recently I did manage to trace some points in the code and see at least 
>> one scenario for the race condition bug.
>>
>>  1. Thread A    erl_set_timer (time.c)            Lock Timing Wheel
>>  2. Thread A    insert_timer (time.c)             Insert Timer T1
>>  3. Thread A    erl_set_timer (time.c)            Unlock Timing Wheel
>>  4. Thread B    bump_timer_internal (time.c)      Lock Timing Wheel
>>  5. Thread A    cancel_timer (erl_process.c)      Cancel timer T1
>>  6. Thread B    bump_timer_internal (time.c)      Build list of Expired 
>> Timers
>>  7. Thread A    erl_cancel_timer (time.c)         Cancel timer T1: 
>> Waiting for Timing Wheel Lock
>>  8. Thread B    bump_timer_internal (time.c)      Unlock Timing Wheel
>>  9. Thread C    set_timer (erl_process.c)         New Timeout Request (T2)
>> 10. Thread B    bump_timer_internal (time.c)      Call Expired Timers 
>> Callbacks
>> 11. Thread B    free_ptimer (utils.c)             Timer T1 callback 
>> invokes free_ptimer()
>> 12. Thread C    erts_create_smp_ptimer (utils.c)  Create Timer 
>> ErtsSmpPTimer for T2
>> 13. Thread B    free_ptimer (utils.c)             Free ErtsSmpPTimer 
>> memory block
>> 14. Thread C    erts_create_smp_ptimer (utils.c)  Allocate ErtsSmpPTimer 
>> for T2, block reused!
>> 15. Thread C    erl_set_timer (time.c)            erl_set_timer invoked 
>> for T2
>> 16. Thread C    erl_set_timer (time.c)            Lock Timing Wheel
>> 17. Thread C    insert_timer (time.c)             Insert Timer T2
>> 18. Thread C    erl_set_timer (time.c)            Unlock Timing Wheel
>> 19. Thread A    erl_cancel_timer (time.c)         Lock Timing Wheel
>> 20. Thread A    erl_cancel_timer (time.c)         Remove ex-T1 == T2 
>> from the timing wheel
>> 21. Thread A    erl_cancel_timer (time.c)         Unlock Timing Wheel
>>
>> See also attached diagram.
>>
>> Looks like one more mutex required, excluding release of ErtsSmpPTimer 
>> memory block by timeout callback if cancel request was issued for the 
>> timer and vise versa. The two point of control - cancel timer and timer 
>> expiration should not interfere.
>> This bug happens only in SMP mode since there additional timer control 
>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>>
>> Mikael Pettersson wrote:
>>> Dmitriy Kargapolov writes:
>>>  >  > When running erl with -smp +S 2 option, sometimes process gets 
>>> stuck in  > timer:sleep/1.
>>>  > Process code looks like:
>>>  >  > some_receiver(State) ->
>>>  >      NewState = receive
>>>  >          % legal packet
>>>  >          {some_keyword, Address, Port, Packet} ->
>>>  >              State1 = handle_packet(Address, Port, Packet, State),
>>>  >              timer:sleep(get_loop_delay()),
>>>  >              State1;
>>>  >          % unknown message
>>>  >          _ ->
>>>  >              State
>>>  >      end,
>>>  >      some_receiver(NewState).
>>>  >  > Delay value varies in range 1..999
>>>  >  > Since timer:sleep/1 implemented as:
>>>  > sleep(T) ->
>>>  >      receive
>>>  >      after T -> ok
>>>  >      end.
>>>  > it seems to be problem with "after" in smp implementation in R11B-0
>>>  >  > I don't have more details yet but will continue testing.
>>>  > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>>>  
>>> Interesting. Please send us a small standalone module that exhibits
>>> the bug, and I'll see if I can reproduce it.
>>>
>>> /Mikael
>>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
> 



More information about the erlang-bugs mailing list