[erlang-bugs] [erlang-questions] R11B-2 SMP Timer Race Condition Bug [Re: bug in timer:sleep/1 smp implementation (R11B-0)]

Rickard Green rickard.s.green@REDACTED
Wed Dec 27 18:54:29 CET 2006


The process lock plays an important role here. Unfortunately a faulty
optimization (blush) prevented the process lock from playing that role.
ptimer_timeout() has to acquire the process lock before looking at the
ptimer flags. I've attached a patch that should fix the problem.

      $ tar -zxf otp_src_R11B-2.tar.gz
      $ patch -p0 < ptimer.patch
      patching file `otp_src_R11B-2/erts/emulator/beam/utils.c'

Please, report to us whether or not the problem went away.

Great work Dmitriy and Serge! Many thanks!

BR,
Rickard Green, Erlang/OTP

Rickard Green wrote:
> Thanks for your detailed bug report. I'll look at this as soon as possible.
> 
> BR,
> Rickard Green, Erlang/OTP
> 
> Serge Aleynikov wrote:
>> Additionally, I should say that we've been able to reproduce this bug on 
>> several Linux platforms (RH ES 4.0, Fedora Core 5.0, 32bit and 64bit) in 
>> R11B-0, R11B-1 and R11B-2.  The bug (what appears to be a race 
>> condition) is seen only if the emulator is started in the SMP mode and 
>> results in the following construct blocking infinitely in the context of 
>> some Erlang process handing a message dispatching function:
>>
>> receive
>> after N ->    % Where N is between 1 and 999
>>      ok
>> end.
>>
>> It happens when all the CPUs in SMP mode are over 75% loaded.  The bug 
>> doesn't happen immediately after starting a release, but after a period 
>> of 5 min to 3 hours, which makes it pretty hard to diagnose.  The 
>> tracing method that we initially tried to use was to include printf 
>> statements in the emulator to stderr.  However, this prevented the bug 
>> from showing up.  Further it was changed to using SysV message queue to 
>> communicate trace to an external process that dumped the trace to a 
>> file.  This allowed to gain further understanding of the problem, but as 
>> Dmitry indicated any attempt to reduce the code to a minimal example 
>> made the problem disappear.
>>
>> The emulator code is quite involved, but hopefully someone in the OTP 
>> team could come up with a recommendation of how/where to put a missing 
>> synchronization.  If needed we can arrange for a remote SSH login to the 
>> system(s) where the problem is reproducible.
>>
>> Regards,
>>
>> Serge
>>
>> Dmitriy Kargapolov wrote:
>>> Unfortunately I can not create standalone test for this bug, even when I 
>>> became much more close to understanding the effect.
>>> This bug appears only in highly loaded system.
>>>
>>> Recently I did manage to trace some points in the code and see at least 
>>> one scenario for the race condition bug.
>>>
>>>  1. Thread A    erl_set_timer (time.c)            Lock Timing Wheel
>>>  2. Thread A    insert_timer (time.c)             Insert Timer T1
>>>  3. Thread A    erl_set_timer (time.c)            Unlock Timing Wheel
>>>  4. Thread B    bump_timer_internal (time.c)      Lock Timing Wheel
>>>  5. Thread A    cancel_timer (erl_process.c)      Cancel timer T1
>>>  6. Thread B    bump_timer_internal (time.c)      Build list of Expired 
>>> Timers
>>>  7. Thread A    erl_cancel_timer (time.c)         Cancel timer T1: 
>>> Waiting for Timing Wheel Lock
>>>  8. Thread B    bump_timer_internal (time.c)      Unlock Timing Wheel
>>>  9. Thread C    set_timer (erl_process.c)         New Timeout Request (T2)
>>> 10. Thread B    bump_timer_internal (time.c)      Call Expired Timers 
>>> Callbacks
>>> 11. Thread B    free_ptimer (utils.c)             Timer T1 callback 
>>> invokes free_ptimer()
>>> 12. Thread C    erts_create_smp_ptimer (utils.c)  Create Timer 
>>> ErtsSmpPTimer for T2
>>> 13. Thread B    free_ptimer (utils.c)             Free ErtsSmpPTimer 
>>> memory block
>>> 14. Thread C    erts_create_smp_ptimer (utils.c)  Allocate ErtsSmpPTimer 
>>> for T2, block reused!
>>> 15. Thread C    erl_set_timer (time.c)            erl_set_timer invoked 
>>> for T2
>>> 16. Thread C    erl_set_timer (time.c)            Lock Timing Wheel
>>> 17. Thread C    insert_timer (time.c)             Insert Timer T2
>>> 18. Thread C    erl_set_timer (time.c)            Unlock Timing Wheel
>>> 19. Thread A    erl_cancel_timer (time.c)         Lock Timing Wheel
>>> 20. Thread A    erl_cancel_timer (time.c)         Remove ex-T1 == T2 
>>> from the timing wheel
>>> 21. Thread A    erl_cancel_timer (time.c)         Unlock Timing Wheel
>>>
>>> See also attached diagram.
>>>
>>> Looks like one more mutex required, excluding release of ErtsSmpPTimer 
>>> memory block by timeout callback if cancel request was issued for the 
>>> timer and vise versa. The two point of control - cancel timer and timer 
>>> expiration should not interfere.
>>> This bug happens only in SMP mode since there additional timer control 
>>> structure ErtsSmpPTimer is used between emulator and timing wheel.
>>>
>>> Mikael Pettersson wrote:
>>>> Dmitriy Kargapolov writes:
>>>>  >  > When running erl with -smp +S 2 option, sometimes process gets 
>>>> stuck in  > timer:sleep/1.
>>>>  > Process code looks like:
>>>>  >  > some_receiver(State) ->
>>>>  >      NewState = receive
>>>>  >          % legal packet
>>>>  >          {some_keyword, Address, Port, Packet} ->
>>>>  >              State1 = handle_packet(Address, Port, Packet, State),
>>>>  >              timer:sleep(get_loop_delay()),
>>>>  >              State1;
>>>>  >          % unknown message
>>>>  >          _ ->
>>>>  >              State
>>>>  >      end,
>>>>  >      some_receiver(NewState).
>>>>  >  > Delay value varies in range 1..999
>>>>  >  > Since timer:sleep/1 implemented as:
>>>>  > sleep(T) ->
>>>>  >      receive
>>>>  >      after T -> ok
>>>>  >      end.
>>>>  > it seems to be problem with "after" in smp implementation in R11B-0
>>>>  >  > I don't have more details yet but will continue testing.
>>>>  > My platform: 2.6.9-5.ELsmp #1 SMP i686 i686 i386 GNU/Linux
>>>>  
>>>> Interesting. Please send us a small standalone module that exhibits
>>>> the bug, and I'll see if I can reproduce it.
>>>>
>>>> /Mikael
>>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://www.erlang.org/mailman/listinfo/erlang-questions
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
> 



-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ptimer.patch
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20061227/1fd5a234/attachment.ksh>


More information about the erlang-bugs mailing list